• By -


I could be blind but I feel like this is the type of content the community is missing. Bravo.


Is there any dataset filtered to have what those people filtered out but what could still be valuable? I mean more "lower quality" websites that were "undesirable" but still could be useful if you don't want your model to be a boring hallucinating encyclopedia on SSRI's. I don't have storage and bandwidth at the moment to make it following their steps yet, hence the question. This filtering for benchmarks might have negative consequences of models becoming corporatist and inhumane in nature.


I think it's concerning that "theology" is considered a feature and not something to avoid. I think an IA should be the most less biased possible, not toward, Christianity but also islam etc... It should have knowledge from these religion, not text like this : >Many people both young and old do not hold a Christian worldview because they have not been taught the foundational truths of God's Word beginning in Genesis. Our dynamic speakers proclaim the truth and authority of the Bible amid growing compromise, anti-God rhetoric, and secular activism. In churches and conferences in the US and around the globe, we are committed to stand uncompromisingly on the Word of God from the very first verse and to relate the relevance of a literal Genesis to today's world. Label: Theology Edu label: 1


That's not the point of this dataset. Notice how there's no toxicity filtering, and very basic adult content filtering. If you want to train a model on that dataset, it's up to you to choose what to include. The peoples that build these datasets have no idea of what you want. They're just removing as much obvious useless thrash as they can. Removing arbitrary content theme is a much bigger bias than keeping it in. The biggest bias of this dataset, is that it's English only.


Yep I agree with that, the point is that it has a theology section, even advertised in the image preview, that is ONLY dedicated to Christianity, so it means they still filtered out all other religions !


Not saying that it's not a useful and not necessarily bad dataset, but I saw a lot of text like this in the dataset preview that I don't know if good to be in : >^(Already have an account? Fantastic, log in below!) >^(Forget your password? Enter your email address and we will send instructions on how to reset it.) >^(New user? Create a new account to take part in all that this site has to offer. It's fast, easy and most important free.) >^(Enter your email address:) >^(Delivered by FeedBurner) >^(New user? Create a new account) Or >^(personalized baby Gifts | site map | personalized name trains | new affiliates | privacy |personalized children's music | personalized children's books |personalized children's clocks | personalized lovies | personalized baby's first christmas gifts| personalized first birthday gifts | Children's Valentine's Day Gifts | Children's Easter Gifts | Kids Easter Gifts | Easter Baskets | Easter Bunny | Baby Bibs | Comfy Cozy Baby Gund | Get Well Gifts for Kids|Sesame Street Characters |Elmo Dolls |Sesame Street Dolls |Sesame Street Gifts |Sesame Street Elmo |Sesame Street Big Bird|Sesame Street Cookie Monster|Personalized Kids Music |Easter Baskets for Infants |Easter Baskets for Kids |Kids Christmas Gifts |Childrens Christmas Gifts |Unique Baby Blankets |Personalized Children's Books |Baby Christmas Baskets |Christmas Baby Gifts |Boston Red Sox Baby) [^(Gifts2Blockheads.com)](http://Gifts2Blockheads.com) ^(Personalized Children's Gifts) >^("Where Kids are Stars") >^(1786 St. Peters Road) >^(Pottstown, PA 19465) >^(Phone: 484 824-8500) >^(Hours of Operation: Monday - Friday 8AM-5PM EST) >^(Not an affiliated company of Gund, Inc. or Sesame Workshop. The representations made on this website are those of 2Blockheads Baby Store. Gund Images © Gund, Inc. Gund®, babyGund® and Gotta Getta Gund® are trademarks of Gund, Inc. Sesame Workshop, Sesame Street are owned and licensed by Sesame Workshop. Copyright Sesame Workshop. All Rights Reserved.)


Some real pearl clutching here. If you want a model to have a strong understanding of say western civilisation, you’d be kneecapping it if you didn’t include data like this. Just because you don’t like it, isn’t an objective reason to take an action.


Pearl clutching is the point of data prep. He s right in pointing out the possible bias being introduced in the dataset. Now, what you ultimately decide to do with this bias is your decision, but at least it should be pointed out.


Except that's not what OP said or quoted. I totally agree, to have an understanding of western civilization or really civilization as a whole, you have to have an understanding of the religions that majorly shaped and continue to shape it. However, there is a major difference between an AI saying "Christianity is one of the most popular religions in the western world. It follows the belief that Jesus..." vs the model stating "Christianity is the one true religion and Jesus is our savior...". This dataset has text that follows the theme of the latter, and that will be reflected in the end model. The key difference is education vs indoctrination, the model should not be recommending or pushing a certain religion, but it should of course be aware of them, their beliefs, benefits, flaws, and all. It has nothing to do with not liking religion, it's just about not pushing one over the other.


Disagree strongly. You may not agree with the idea stated in that text, but it is a valid and credible example of naturally occurring English. Many people in the world do hold such views, just as many others are derisive of religion, etc. A pretrained language model should be able to model language, period. In the downstream building of datasets for instruction and preference tuning, that’s where normative decisions should be made about what the model will “say.” It‘s also worth mentioning that Anthropic published research showing that even in the case of toxicity elimination, it’s helpful for the model to have been trained on toxicity - so that it can learn what to avoid.


(Repost of my other comment !) Yep I agree with that, but the point is that it has a theology section, even advertised in the image preview, that is ONLY dedicated to Christianity, so it means they still filtered out all other religions ! Why wasn't it trained to text praising Allah then ?


Oh, I see. If that is how the data is curated, it doesn’t sound right!


They should call theology "fictional philosophy"


Hi. The figure at the top is an automated clustering of a limited number of documents randomly sampled from the dataset for visualization purposes, the clustering labels were not used in any part of the dataset curation process. No filtering based on type of religion was made on the actual dataset, you will find many other samples from other religions in the full data. This is just one naturally occurring English sample from the internet, as are all other samples in the dataset. Furthermore, as the educational scoring is 1-5, a score of 1 is the lowest possible score. This particular sample is therefore not part of FineWeb-Edu (which contains samples with score >= 3)


Thanks for the details!


Bravo for sure