NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University Course organization I have requested that Python and NLTK be installed on the computers in this room. 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 2 NLPP §1.2 A Closer Look at Python: Texts as Lists of Words Variables variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', ... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold'] 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 4 How to name variables Valid names (or identifiers) … must start with a letter, optionally followed by digits or letters; are case-sensitive; cannot contain whitespace (use an underscore) or a dash (means minus); cannot be a reserved word. 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 5 Strings Strings are individual words, i.e. a single element list. Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python'] 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 6 NLPP §1.3. Computing with Language: Simple Statistics Frequency distribution What is a frequency distribution? It tells us the frequency of each vocabulary item in a text. It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. What function in NLTK calculates it? FreqDist(text_name) What expression lists the tokens with their distribution? text_name.keys() 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 8 Very frequent words How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True) 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 9 Very infrequent words Words that occur only once are called hapaxes. >>>fdist1.hapaxes() In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others. How would you describe them? 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 10 Summary Most frequent Least frequent Length short long Meaning very general very specific Coverage of text large proportion small proportion 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 11 Question Which group would you look in to find words that help you understand what the text is about? Neither. 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 12 Fine-grained word selection Some Python expressions are based on set theory. a) {w | w ∈ V & P(w)} b) [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?) Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 13 Finding words that characterize a text Not too short (>?) and not too infrequent (>?) >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7] 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 14 Finding groups of words What is the name for a sequence of two words? Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] What is the name for a sequence of words that occur together unusually often? Collocation ~ collocations() They are essentially bigrams that occur more often than we would expect based on the frequency of individual words. 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 15 Example >>> text4.collocations() Building collocations list United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 16 Counting Other Things 31-Aug-2009 LING 681.02, Prof. Howard, Tulane University 17 Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2