Uploaded by Jamie Pordoy

Assigment

advertisement
Zipfs Law
Zipfs law is an empirical observation of computational linguistics that states the frequency a word
appears within a corpus of natural language is inversely proportional to its rank in the corpuses
frequency table, thus the word that is the most frequent will appear twice as much as the second most
frequent word, and three times as much as the third most frequent word This statistical observation
continues throughout a corpus of natural language, using an empirical ranking system to observe the
frequency of a words occurrence, starting with the most frequent word, and continuing linearly to the
most infrequent word.
Zipfs Law Formula
𝒇(𝒘) =
𝑪
𝒓(𝒘)𝒂
f(w) = frequency of the word
r(w) = The Rank of the word
c = Amount of Times the most frequent word occurs
a = 1 (The most frequent word)
Zipfs Law is a power law, where k(Konstant) = -1. Implementing a log function on the X and Y axis will
create a linear relationship between the 2 axes, thus log(y) = log(c) + k log(x).
Write a function to process a large text and plot word frequency against word rank
Function Output
Results
To conclude the result set generated proves that not all words have an equal ratio of occurrence thus
proving that in general, most corpuses have few words that occur frequently, and a large selection of
words that occur infrequently. This is exemplified in the work undertaken by Christina Sterbenz, who
states that the most frequently occurring words are generally prepositions and comprise about 25% of
the English language. Going further, the top 100 words comprise about 50% of the English language,
whilst 50,000 words comprise a total of 95%. To account for the remaining 5%, vocabulary is required
for over a million words. (Sterbenz, 2013). The table below shows the most frequent words for a range
of corpuses, thus proving that the above statement holds true and that a few select words are the most
frequent.
Word Comparison of Corpus Analysis
Results
To test the validity of Zipfs law and the developed function, a series of test were implemented on a
range of different corpuses. To generate an unbiased and accurate result set, each of the corpuses
chosen emanate idiosyncratic attributes that are solely unique to that specific body of text.
Brown Corpus
The first corpus analysed used the NLTK frequency Distribution function was the Brown Corpus. By
analysing the Logarithmic scale, one can determine that the Brown corpus holds true to Zipfs Law. One
distinguishes that there a few select words that occur frequently, whilst there are many words that
occur infrequently.
Whilst the formation of the of the Zipf Plot holds true to Zipfs law, the plot generated does not provide
an accurate linear line between log(x) and log(y). To achieve a more accurate fit, a Mandelbrot
Distribution can be implemented, using the formula 𝒇 = 𝒄(𝒓 + 𝒑)𝒌 . The Mandelbrot Will remove stop
words thus allowing for more sufficient collation of statistical data derived from the plot.
[]’]’
Most frequent 100 words in brown corpus
1.
2.
3.
4.
5.
6.
7.
8.
9.
('the', 69971)
('of', 36412)
('and', 28853)
('to', 26158)
('a', 23195)
('in', 21337)
('that', 10594)
('is', 10109)
('was', 9815)
10. ('he', 9548)
5000 strring
5 letts long
Abcdefghijk
5000 words
5 letters long
Abcde
10000 words
5 letters long
abcde
Web Text Loglog
Web Text Freq Dist.
10 Most Common Words in Web Text Corpus
[('i', 7925), ('the', 7909), ('to', 6325), ('a', 6199), ('you', 5812), ('and', 4516), ('in', 4293), ('t', 3547), ('it',
3385), ('s', 3320), ('on', 3281), ('of', 3186), ('is', 3024), ('girl', 2956), ('that', 2792), ('guy', 2751), ('not',
2665), ('1', 2261), ('with', 1943), ('for', 1908), ('when', 1833), ('2', 1709), ('like', 1696), ('my', 1558), ('no',
1539), ('this', 1491), ('but', 1447), ('have', 1410), ('so', 1381),
Gutenberg corpus Log Log Plot
Gutenberg
Corpus
Freq
dist.
10 Most Common Words in Gutenberg Corpus
[('the', 133606), ('and', 95452), ('of', 71273), ('to', 48063), ('a', 33977), ('in', 33584), ('i', 30265), ('that',
28798), ('he', 25857), ('it', 22303), ('his', 21402), ('for', 19536), ('was', 18717), ('with', 17599), ('not',
17373), ('is', 16437), ('you', 16398), ('be', 16115), ('as', 14528), ('but', 13944), ('all', 13727), ('they',
13104)
Chat Room Corpus Log Plot
Freq Dist.
10 Most common Chat Room Corpus
[('i', 1224), ('part', 1022), ('join', 1021), ('lol', 822), ('you', 686), ('to', 665), ('the', 660), ('hi', 656), ('a',
580), ('me', 428), ('is', 380), ('in', 364), ('and', 357), ('it', 355), ('action', 347), ('hey', 292), ('that', 284),
('my', 259), ('of', 207), ('u', 204), ('what', 201), ("'s", 195), ('for', 189), ('on', 189), ('here', 185), ('no', 181),
('are', 181), ('do', 181), ('not', 179), ('have', 171)
Inaugural Corpus
Freq Dist.
Most frequent 100 words in Inaugural Address Corpus
[('the', 9906), ('of', 6986), ('and', 5139), ('to', 4432), ('in', 2749), ('a', 2193), ('our', 2058), ('that', 1726),
('we', 1625), ('be', 1460), ('is', 1416), ('it', 1367), ('for', 1154), ('by', 1066), ('which', 1002), ('have', 997),
('with', 937), ('as', 931), ('not', 924), ('will', 851), ('i', 832), ('this', 812), ('all', 794), ('are', 779), ('their',
738), ('but', 628), ('has', 612), ('government', 593),
Results
Implementing Zipfs law regarding a variety of different corpuses has allowed for a result set to be
generated that specifies the statistical relationship a word has between with its frequency and rank, as
well as various factors that may attribute to Zipfs Law being Upheld. For example, In the Brown Corpus,
the
Word
kurigalzu
pache
chain-reaction
recusant
rubric'
Frequency
1
1
1
1
1
1
Word
sicurella
booster
archaism
truth-packed
takeing
Frequency
1
1
1
1
1
1
Download