Zipf's Law: Word Frequency and Rank Analysis

Zipfs Law Zipfs law is an empirical observation of computational linguistics that states the frequency a word appears within a corpus of natural language is inversely proportional to its rank in the corpuses frequency table, thus the word that is the most frequent will appear twice as much as the second most frequent word, and three times as much as the third most frequent word This statistical observation continues throughout a corpus of natural language, using an empirical ranking system to observe the frequency of a words occurrence, starting with the most frequent word, and continuing linearly to the most infrequent word. Zipfs Law Formula 𝒇(𝒘) = 𝑪 𝒓(𝒘)𝒂 f(w) = frequency of the word r(w) = The Rank of the word c = Amount of Times the most frequent word occurs a = 1 (The most frequent word) Zipfs Law is a power law, where k(Konstant) = -1. Implementing a log function on the X and Y axis will create a linear relationship between the 2 axes, thus log(y) = log(c) + k log(x). Write a function to process a large text and plot word frequency against word rank Function Output Results To conclude the result set generated proves that not all words have an equal ratio of occurrence thus proving that in general, most corpuses have few words that occur frequently, and a large selection of words that occur infrequently. This is exemplified in the work undertaken by Christina Sterbenz, who states that the most frequently occurring words are generally prepositions and comprise about 25% of the English language. Going further, the top 100 words comprise about 50% of the English language, whilst 50,000 words comprise a total of 95%. To account for the remaining 5%, vocabulary is required for over a million words. (Sterbenz, 2013). The table below shows the most frequent words for a range of corpuses, thus proving that the above statement holds true and that a few select words are the most frequent. Word Comparison of Corpus Analysis Results To test the validity of Zipfs law and the developed function, a series of test were implemented on a range of different corpuses. To generate an unbiased and accurate result set, each of the corpuses chosen emanate idiosyncratic attributes that are solely unique to that specific body of text. Brown Corpus The first corpus analysed used the NLTK frequency Distribution function was the Brown Corpus. By analysing the Logarithmic scale, one can determine that the Brown corpus holds true to Zipfs Law. One distinguishes that there a few select words that occur frequently, whilst there are many words that occur infrequently. Whilst the formation of the of the Zipf Plot holds true to Zipfs law, the plot generated does not provide an accurate linear line between log(x) and log(y). To achieve a more accurate fit, a Mandelbrot Distribution can be implemented, using the formula 𝒇 = 𝒄(𝒓 + 𝒑)𝒌 . The Mandelbrot Will remove stop words thus allowing for more sufficient collation of statistical data derived from the plot. []’]’ Most frequent 100 words in brown corpus 1. 2. 3. 4. 5. 6. 7. 8. 9. ('the', 69971) ('of', 36412) ('and', 28853) ('to', 26158) ('a', 23195) ('in', 21337) ('that', 10594) ('is', 10109) ('was', 9815) 10. ('he', 9548) 5000 strring 5 letts long Abcdefghijk 5000 words 5 letters long Abcde 10000 words 5 letters long abcde Web Text Loglog Web Text Freq Dist. 10 Most Common Words in Web Text Corpus [('i', 7925), ('the', 7909), ('to', 6325), ('a', 6199), ('you', 5812), ('and', 4516), ('in', 4293), ('t', 3547), ('it', 3385), ('s', 3320), ('on', 3281), ('of', 3186), ('is', 3024), ('girl', 2956), ('that', 2792), ('guy', 2751), ('not', 2665), ('1', 2261), ('with', 1943), ('for', 1908), ('when', 1833), ('2', 1709), ('like', 1696), ('my', 1558), ('no', 1539), ('this', 1491), ('but', 1447), ('have', 1410), ('so', 1381), Gutenberg corpus Log Log Plot Gutenberg Corpus Freq dist. 10 Most Common Words in Gutenberg Corpus [('the', 133606), ('and', 95452), ('of', 71273), ('to', 48063), ('a', 33977), ('in', 33584), ('i', 30265), ('that', 28798), ('he', 25857), ('it', 22303), ('his', 21402), ('for', 19536), ('was', 18717), ('with', 17599), ('not', 17373), ('is', 16437), ('you', 16398), ('be', 16115), ('as', 14528), ('but', 13944), ('all', 13727), ('they', 13104) Chat Room Corpus Log Plot Freq Dist. 10 Most common Chat Room Corpus [('i', 1224), ('part', 1022), ('join', 1021), ('lol', 822), ('you', 686), ('to', 665), ('the', 660), ('hi', 656), ('a', 580), ('me', 428), ('is', 380), ('in', 364), ('and', 357), ('it', 355), ('action', 347), ('hey', 292), ('that', 284), ('my', 259), ('of', 207), ('u', 204), ('what', 201), ("'s", 195), ('for', 189), ('on', 189), ('here', 185), ('no', 181), ('are', 181), ('do', 181), ('not', 179), ('have', 171) Inaugural Corpus Freq Dist. Most frequent 100 words in Inaugural Address Corpus [('the', 9906), ('of', 6986), ('and', 5139), ('to', 4432), ('in', 2749), ('a', 2193), ('our', 2058), ('that', 1726), ('we', 1625), ('be', 1460), ('is', 1416), ('it', 1367), ('for', 1154), ('by', 1066), ('which', 1002), ('have', 997), ('with', 937), ('as', 931), ('not', 924), ('will', 851), ('i', 832), ('this', 812), ('all', 794), ('are', 779), ('their', 738), ('but', 628), ('has', 612), ('government', 593), Results Implementing Zipfs law regarding a variety of different corpuses has allowed for a result set to be generated that specifies the statistical relationship a word has between with its frequency and rank, as well as various factors that may attribute to Zipfs Law being Upheld. For example, In the Brown Corpus, the Word kurigalzu pache chain-reaction recusant rubric' Frequency 1 1 1 1 1 1 Word sicurella booster archaism truth-packed takeing Frequency 1 1 1 1 1 1

Zipf's Law: Word Frequency and Rank Analysis

Related documents

Products

Support

Zipf's Law: Word Frequency and Rank Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib