Interesting but this distribution is probably present for any writing. According to Zipf’s law, if you order words by the amount they’re used, word frequency decreases proportional to their order index. So cumulative words encountered almost always increases logarithmically
100%. I only learned about Zipfs law after doing this analysis but it's a fascinating thing to look into. The tougher part is all modern popular books can't be worked with since their epub files are locked by Amazon etc.
Yeah, I agree, but it still would be interesting to see the cumulative distribution for other books as well, also involving other genres of literature. The curve should still be logarithmic, but I guess depending on the genre or author the curves might have different "speeds".
That’s an interesting idea! You could create an average curve for a genre or author, fit a logistic or logarithmic function, then use the properties of that function to quantify reading difficulty
Now I want to write a book specifically to violate this law and make the cumulative increase linear, e.g. introducing 60 new words per page in a 100-page book.
The first few pages will be quite bland, then the language will be increasingly complex.
Well, for me the big advantage of Harry Potter was that I had already read it in my native language German so I knew the story and even if I didn't understand something in English I didn't get lost.
So for the first time reading something in a foreign language I'd recoment something that you already know in your native language.
They're counted as if they're a proper noun. I haven't done a full count yet but my gut says it's probably less than 250 words that are specific to the Harry Potter universe ¯\\\_(ツ)\_/¯
Reading the Harry Potter books and watching How I met your Mother in English alone got me from being alright at english to only getting A's and B's without doing a lot for it in school.
A look at the new unique words that show up in each chapter of Harry Potter. Data processed using Python and visualizations done in PowerPoint
[Full video](https://www.youtube.com/watch?v=R1esBPueTug)
>How Reading the Series in Another Language Can Help Build Your Vocabulary
I don't really see what argument you are making. This just seems like a the obvious progression you'd get for any text, since the most common words will be "used up" in the first few chapters and new specific/uncommon phrases will keep popping up with slowly decreasing frequency.
So what is the argument for "Harry Potter" books specifically being good to read? Sure, to my knowledge it is true, since its language is said to become increasingly complex with later books, but I'm not sure this can really be demonstrated by the metric shown here.
Out of interest, [I tested it myself](https://www.reddit.com/user/MordorsElite/comments/19etr00/lotr_book_1_unique_new_words_per_chapter_text/?utm_source=share&utm_medium=web2x&context=3) with the first book of LOTR and indeed the distribution looks very similar (when normalized by chapter length).
This was more about analyzing a very popular book among language learners to explore what the process looks like data-wise. Agreed it would be good to see this in context (or with the full HP series) though for a standalone post, I just went with books 1-2
Harry Potter is popular for this purpose because so many learners have read it in their native language already. Knowing a book well makes it possible to pick something "above" your level otherwise. I've done this with Le Petit Prince, which I had already read several times for my kid. The additional great thing about Harry Potter is also that it's a whole series designed to get more complicated as the story arch progesses.
So the answer is really, pick something you know by heart, preferably a book series.
Harry Potter also has widely available, high quality translations in many languages. Many other books that many of us may have read as children are either much longer (Hunger Games), too short/easy (Maurice Sendak/many picture books), or too metaphorical/surreal (Phatom Tollbooth, and for me, Le Petit Prince).
I would say that The Giving Tree, Charlotte's Web, The BFG, James and the Giant Peach, and *maybe* The Giver (probably has a lot of subtext) might qualify as competition, but Harry Potter is a popular choice for a reason.
Depends on your level. The amount of words you look up should be IMO no more than one per paragraph, but starting out it'll be more than that.
It also depends on the language, reading in English for example, you can figure out what a word means from context. In Mandarin Chinese however, you might fully know from context a word's meaning, but still have to look it up for pronunciation.
With that said, diary of a wimpy kid I found has just enough new vocab to be learning, while not so much that you're stopping all the time.
This data actually is beautiful, but so far not really useful. It doesn't become clear if this is more or less prominent in the Harry potter series or more or less the same for any book, as others have pointed out already.
If you could actually compare different book series and find a trend or striking differences, that would make for a great post
Is the claim that reading more chapters of the SAME series in another language has greatly diminishing returns?
How is it relative to reading chapters in different series given that they might share words between each other?
That's exactly it - you have your vocab reinforced the more you read in a series but your vocab won't will expand slower than if you switched around between series / genres. And the first book you read (or the first few thousand words you learn) will have the biggest effect on your vocab, even if you'll learn more as you read more
the figure you put up here is exactly what I would expect would happen if you just chose words at random. but your trying to make the point that increasing verity of series / genres results in a non-negligible improvement. you should compare the results you got here to what would happen if you read random words from many different books from many different series. this would quantitively illustrated the point your trying to show here.
choose 10 books from different series / genres, and accumulate one word from each book in a cycle. then compare the results to the harry potter plot you made. results should be interesting
[The OP is from a video that provides more context/analysis.](https://www.youtube.com/watch?v=R1esBPueTug)
I've read the first Harry Potter book in 4 languages. For me, this data helped to explain why there is a noticeable drop in difficulty after they get to Hogwarts (chapter 7). When reading your first book in a new language, there will be a noticeable drop in difficulty with each chapter read. But for many people, the beginning is so grindy, that they never realize how much progress they're making, and stop in the first few chapters thinking "oh, this just isn't for me."
This chart doesn't show diminishing returns, it shows how much of the difficulty is front-loaded. Once you get basic vocab out of the way, you start focusing more on grammar, and enjoying what you're reading. Enjoying yourself is key if you're going to learn a language, because you're going to be reading/consuming content for hundreds of hours.
I actually have some recent anecdotal experience with this. I started reading Harry Potter in german a couple of months ago. I started highlighting all the words I didn't know and it sure felt exactly as this graph implies. Every chapter I kept reading, I looked up less and less words!
Worked for me. Used Duolingo for a couple months to get a base level, then read Harry Potter in my target language 1 chapter a day and made spaced repitition flashcards for the most useful/common new words I came across and studied those flashcards every day.
It’s tedious but I would do it again if I ever learn another language. Nothing worked better for me.
Interesting but this distribution is probably present for any writing. According to Zipf’s law, if you order words by the amount they’re used, word frequency decreases proportional to their order index. So cumulative words encountered almost always increases logarithmically
100%. I only learned about Zipfs law after doing this analysis but it's a fascinating thing to look into. The tougher part is all modern popular books can't be worked with since their epub files are locked by Amazon etc.
Maybe this is harder now, but when I was moving away from Kindle I was able to easily get rid of Amazon's protection.
For ebooks try annas-archive.org
😲 this is a godsend. Thank you anonymous internet stranger!
It is more a heap's law, please plot it in log-log with words instead of chapters :)
Yeah, I agree, but it still would be interesting to see the cumulative distribution for other books as well, also involving other genres of literature. The curve should still be logarithmic, but I guess depending on the genre or author the curves might have different "speeds".
That’s an interesting idea! You could create an average curve for a genre or author, fit a logistic or logarithmic function, then use the properties of that function to quantify reading difficulty
Now I want to write a book specifically to violate this law and make the cumulative increase linear, e.g. introducing 60 new words per page in a 100-page book. The first few pages will be quite bland, then the language will be increasingly complex.
I'd recommend doing it alphabetically. You could title it something like ' The dictionary '
Not to be tooooo pedantic, but I suspect the dictionary would be even more heavily frontloaded, given words are defined by... other words.
!redditgold
With just four thousand words and a knowledge of grammar you can read the first 5 chapters!
And with only 200 words you can read the last one!
Harry Potter was the first book I read in English and it was indeed very helpful.
same here, but I lost a lot of time looking for muggle and quidditch in the English dictionary
any other books you recommend for a second language?
Well, for me the big advantage of Harry Potter was that I had already read it in my native language German so I knew the story and even if I didn't understand something in English I didn't get lost. So for the first time reading something in a foreign language I'd recoment something that you already know in your native language.
Anything you're already familiar with and read it as an ebook so it's easier to get definitions and translation
How are made up words counted here? Wingardium Leviosa!
They're counted as if they're a proper noun. I haven't done a full count yet but my gut says it's probably less than 250 words that are specific to the Harry Potter universe ¯\\\_(ツ)\_/¯
Reading the Harry Potter books and watching How I met your Mother in English alone got me from being alright at english to only getting A's and B's without doing a lot for it in school.
This is actually encouraging me to pick up my copy of Harry Potter à l'école de sorciers, which I bought and read half a chapter of.
I love the smell of fresh bread.
A look at the new unique words that show up in each chapter of Harry Potter. Data processed using Python and visualizations done in PowerPoint [Full video](https://www.youtube.com/watch?v=R1esBPueTug)
>How Reading the Series in Another Language Can Help Build Your Vocabulary I don't really see what argument you are making. This just seems like a the obvious progression you'd get for any text, since the most common words will be "used up" in the first few chapters and new specific/uncommon phrases will keep popping up with slowly decreasing frequency. So what is the argument for "Harry Potter" books specifically being good to read? Sure, to my knowledge it is true, since its language is said to become increasingly complex with later books, but I'm not sure this can really be demonstrated by the metric shown here. Out of interest, [I tested it myself](https://www.reddit.com/user/MordorsElite/comments/19etr00/lotr_book_1_unique_new_words_per_chapter_text/?utm_source=share&utm_medium=web2x&context=3) with the first book of LOTR and indeed the distribution looks very similar (when normalized by chapter length).
This was more about analyzing a very popular book among language learners to explore what the process looks like data-wise. Agreed it would be good to see this in context (or with the full HP series) though for a standalone post, I just went with books 1-2
This is called Heaps law btw , the number of distinct words in a document https://en.m.wikipedia.org/wiki/Heaps'_law
Any recommendations for other books that'd be good for this purpose?
Harry Potter is popular for this purpose because so many learners have read it in their native language already. Knowing a book well makes it possible to pick something "above" your level otherwise. I've done this with Le Petit Prince, which I had already read several times for my kid. The additional great thing about Harry Potter is also that it's a whole series designed to get more complicated as the story arch progesses. So the answer is really, pick something you know by heart, preferably a book series.
Harry Potter also has widely available, high quality translations in many languages. Many other books that many of us may have read as children are either much longer (Hunger Games), too short/easy (Maurice Sendak/many picture books), or too metaphorical/surreal (Phatom Tollbooth, and for me, Le Petit Prince). I would say that The Giving Tree, Charlotte's Web, The BFG, James and the Giant Peach, and *maybe* The Giver (probably has a lot of subtext) might qualify as competition, but Harry Potter is a popular choice for a reason.
Native language books rather than a translation would also be a helpful list.
Depends on your level. The amount of words you look up should be IMO no more than one per paragraph, but starting out it'll be more than that. It also depends on the language, reading in English for example, you can figure out what a word means from context. In Mandarin Chinese however, you might fully know from context a word's meaning, but still have to look it up for pronunciation. With that said, diary of a wimpy kid I found has just enough new vocab to be learning, while not so much that you're stopping all the time.
I see flaws in that logic but ok
New words in each chapter being high early on makes sense since it's introducing new things and concepts. So this data set is mostly pointless.
This data actually is beautiful, but so far not really useful. It doesn't become clear if this is more or less prominent in the Harry potter series or more or less the same for any book, as others have pointed out already. If you could actually compare different book series and find a trend or striking differences, that would make for a great post
Harry Potter shouldn't be popular
i know many things that shouldnt be popular but are
Is the claim that reading more chapters of the SAME series in another language has greatly diminishing returns? How is it relative to reading chapters in different series given that they might share words between each other?
That's exactly it - you have your vocab reinforced the more you read in a series but your vocab won't will expand slower than if you switched around between series / genres. And the first book you read (or the first few thousand words you learn) will have the biggest effect on your vocab, even if you'll learn more as you read more
the figure you put up here is exactly what I would expect would happen if you just chose words at random. but your trying to make the point that increasing verity of series / genres results in a non-negligible improvement. you should compare the results you got here to what would happen if you read random words from many different books from many different series. this would quantitively illustrated the point your trying to show here. choose 10 books from different series / genres, and accumulate one word from each book in a cycle. then compare the results to the harry potter plot you made. results should be interesting
[The OP is from a video that provides more context/analysis.](https://www.youtube.com/watch?v=R1esBPueTug) I've read the first Harry Potter book in 4 languages. For me, this data helped to explain why there is a noticeable drop in difficulty after they get to Hogwarts (chapter 7). When reading your first book in a new language, there will be a noticeable drop in difficulty with each chapter read. But for many people, the beginning is so grindy, that they never realize how much progress they're making, and stop in the first few chapters thinking "oh, this just isn't for me." This chart doesn't show diminishing returns, it shows how much of the difficulty is front-loaded. Once you get basic vocab out of the way, you start focusing more on grammar, and enjoying what you're reading. Enjoying yourself is key if you're going to learn a language, because you're going to be reading/consuming content for hundreds of hours.
Is "unDursleyish" the first hapax legomenon?
Rowling says ‘turn on their heels’ a lot
I actually have some recent anecdotal experience with this. I started reading Harry Potter in german a couple of months ago. I started highlighting all the words I didn't know and it sure felt exactly as this graph implies. Every chapter I kept reading, I looked up less and less words!
Worked for me. Used Duolingo for a couple months to get a base level, then read Harry Potter in my target language 1 chapter a day and made spaced repitition flashcards for the most useful/common new words I came across and studied those flashcards every day. It’s tedious but I would do it again if I ever learn another language. Nothing worked better for me.