Programming for Linguists An Introduction to Python 13/12/2012 Dictionaries Like a list, but more general In a list the index has to be an integer, e.g. words[4] In a dictionary the index can be almost any type A dictionary is like a mapping between 2 sets: keys and values To create an empty list: list = [ ] To create an empty dictionary: dictionary = { } e.g. a dictionary containing English and Spanish words: >>>eng2sp = { } >>>eng2sp['one'] = 'uno’ >>>print eng2sp {'one': 'uno'} In this case both the keys and the values are of the string type Like with lists, you can create dictionaries yourselves, e.g. eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'} print eng2sp Note: in general, the order of items in a dictionary is unpredictable You can use the keys to look up the corresponding values, e.g. >>>print eng2sp['two'] The key ‘two’ always maps to the value ‘dos’ so the order of the items does not matter If the key is not in the dictionary you get an error message, e.g. >>>print eng2sp[‘ten’] KeyError: ‘ten’ The len( ) function returns the number of key-value pairs len(eng2sp) The in operator tells you whether something appears as a key in the dictionary >>>‘one’ in eng2sp True BUT >>>‘uno’ in eng2sp False To see whether something appears as a value in a dictionary, you can use the values( ) function, which returns the values as a list, and then use the in operator, e.g. >>>‘uno’ in eng2sp.values( ) True Lists can be values, but never keys! Default dictionary Try this: words = [‘een’, ‘twee’, ‘drie’] frequencyDict = { } for w in words: frequencyDict[w] += 1 Possible solution: for w in words: if w in frequencyDict: frequencyDict[w] += 1 else: frequencyDict[w] = 1 The easy solution: >>>from collections import defaultdict >>>frequencyDict = defaultdict(int) >>>for w in words: frequencyDict[w] += 1 you can use int, float, str,… in the defaultdict A Dictionary as a Set of Counters Suppose you want to count the number of times each letter occurs in a string, you could: create 26 variables, traverse the string and, for each letter, add 1 to the corresponding counter create a dictionary with letters as keys and counters as the corresponding values def frequencies(sent): freq_dict = defaultdict(int) for let in sent: freq_dict[let] += 1 return freq_dict dictA = frequencies(“abracadabra”) list_keys = dictA.keys( ) list_values = dictA.values( ) z_value = dictA[‘z’] The first line of the function creates an empty default dictionary The for loop traverses the string Each time through the loop, we create a new key item with the initial value 1 If the letter is already in the dictionary we add 1 to its corresponding value Write a function that counts the word frequencies in a sentence instead of the letter frequencies using a dictionary def words(sent): word_freq = defaultdict(int) wordlist = sent.split( ) for word in wordlist: word_freq[word] += 1 return word_freq words(“this is is a a test sentence”) Dictionary Lookup Given a dictionary “word_freq” and a key “is”, finding the corresponding value: word_freq[“is”] This operation is called a lookup What if you know the value and want to look up the corresponding key? Sorting a Dictionary According to its Values First you need to import itemgetter: from operator import itemgetter To go over each item in a dictionary you can use .iteritems( ) To sort the dictionary according to the values, you need to use key = itemgetter(1) To sort it decreasingly: reverse = True >>>from operator import itemgetter >>>def getValues(sent): w_fr = defaultdict(int) wordlist = sent.split( ) for word in wordlist: w_fr[word] += 1 byVals = sorted(w_fr.iteritems( ), key = itemgetter(1), reverse =True) return byVals >>>getValues(‘this is a a a sentence’) Write a function that takes a sentence as an argument and returns all words that occur only once in the sentence. def getHapax(sent): words = sent.split( ) freqs = defaultdict(int) for w in words: freqs[w] += 1 hapaxlist = [ ] for item in freqs: value = freqs[item] if value == 1: hapaxlist.append(item) return hapaxlist Getting Started with NLTK In IDLE: import nltk nltk.download() Searching Texts Start your script with importing all texts in NLTK: from nltk.book import * text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 Any time you want to find out about these texts, just enter their names at the Python prompt: >>> text1 <Text: Moby Dick by Herman Melville 1851> A concordance view shows every occurrence of a given word, together with some context: e.g. “monstrous” in Moby Dick text1.concordance(“monstrous”) Try looking up the context of “lol” in the chat corpus (text 5) If you have a corpus that contains texts that are spread over time, you can look up how some words are used differently over time: e.g. the Inaugural Address Corpus (dates back to 1789): words like “nation”, “terror”, “God”… You can also examine what other words appear in a similar context, e.g. text1.similar(“monstrous”) common_contexts( ) allows you to examine the contexts that are shared by two or more words, e.g. text1.common_contexts([“very”, “monstrous”]) You can also determine the location of a word in the text This positional information can be displayed using a dispersion plot Each stripe represents an instance of a word, and each row represents the entire text, e.g. text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) Counting Tokens To count the number of tokens (words + punctuation marks), just use the len( ) function, e.g. len(text5) To count the number of unique tokens, you have to make a set, e.g. set(text5) If you want them sorted alfabetically, try this: sorted(set(text5)) Note: in Python all capitalized words precede lowercase words (you can use .lower( ) first to avoid this) Now you can calculate the lexical diversity of a text, e.g. the chat corpus (text5): 45010 tokens 6066 unique tokens or types The lexical diversity = nr of types/nr of tokens Use the Python functions to calculate the lexical diversity of text 5 len(set(text5))/float(len(text5)) Frequency Distributions To find n most frequent tokens: FreqDist( ), e.g. fdist = FreqDist(text1) fdist[“have”] 760 all_tokens = fdist.keys( ) all_tokens[:50] The function .keys( ) combined with the FreqDist( ) also gives you a list of all the unique tokens in the text Frequency distributions can be informative, BUT the most frequent words usually are function words (the, of, and, …) What proportion of the text is taken up with such words? Cumulative frequency plot fdist.plot(50, cumulative=True) If frequent tokens do not give enough information, what about infrequent tokens? Hapaxes= tokens which occur only once fdist.hapaxes( ) Without their context, you do not get much information either Fine-grained Selection of Tokens Extract tokens of a certain minimum length: tokens = set(text1) long_tokens = [ ] for token in tokens: if len(token) >= 15: long_tokens.append(token) #OR shorter: long_tokens = list(token for token in tokens if len(token) >= 15) BUT: very long words are often hapaxes You can also extract frequently occurring long words of a certain length: words = set(text1) fdist = FreqDist(text1) #short version freq_long_words = list(word for word in words if len(word) >= 7 and fdist[word] >= 7) Collocations and Bigrams A collocation is a sequence of words that occur together unusually often, e.g. “red whine” is a collocation, “yellow whine” is not Collocations are essentially just frequent bigrams (word pairs), but you can find bigrams that occur more often than is to be expected based on the frequency of the individual words: text8.collocations( ) Some Functions for NLTK's Frequency Distributions fdist = FreqDist(samples) fdist[“word”] frequency of “word” fdist.freq(“word”) frequency of “word” fdist.N( ) total number of samples fdist.keys( ) the samples sorted in order of decreasing frequency for sample in fdist: iterates over the samples in order of decreasing frequency fdist.max( ) sample with the greatest count fdist.plot( ) graphical plot of the frequency distribution fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 tests if the samples in fdist1 occur less frequently than in fdist2 Accessing Corpora NLTK also contains entire corpora, e.g.: Brown Corpus NPS Chat Gutenberg Corpus … A complete list can be found on http://nltk.googlecode.com/svn/trunk/nltk _data/index.xml Each of these corpora contains dozens of individual texts To see which files are e.g. in the Gutenberg corpus in NLTK: nltk.corpus.gutenberg.fileids() Do not forget the dot notation nltk.corpus. This tells Python the location of the corpus You can use the dot notation to work with a corpus from NLTK or you can import a corpus at the beginning of your script: from nltk.corpus import gutenberg After that you just have to use the name of the corpus and the dot notation before a function gutenberg.fileids( ) If you want to examine a particular text, e.g. Shakespeare’s Hamlet, you can use the .words( ) function Hamlet = gutenberg.words(“shakespearehamlet.txt”) Note that “shakespeare-hamlet.txt” is the file name that is to be found using the previous .fileids( ) function You can use some of the previously mentioned functions (corpus methods) on this text, e.g. fdist_hamlet = FreqDist(hamlet) Some Corpus Methods in NLTK brown.raw( ) raw data from the corpus file(s) brown.categories( ) fileids( ) grouped per predefined categories brown.words( ) a list of words and punctuation tokens brown.sents( ) words( ) grouped into sentences brown.tagged_words( ) a list of (word,tag) pairs brown.tagged_sents( ) tagged_words( ) grouped into sentences treebank.parsed_sents( ) a list of parse trees def statistics(corpus): for fileid in corpus.fileids( ): nr_chars = len(corpus.raw(fileid)) nr_words = len(corpus.words(fileid)) nr_sents = len(corpus.sents(fileid)) nr_vocab = len(set([word.lower() for word in corpus.words(fileid)])) print fileid, “average word length: ”, nr_chars/nr_words, “average sentence length: ”, nr_words/nr_sents, “lexical diversity: ”, nr_words/nr_vocab Some corpora contain several subcategories, e.g. the Brown Corpus contains “news”, “religion”,… You can optionally specify these particular categories or files from a corpus, e.g.: from nltk.corpus import brown brown.categories( ) brown.words(categories='news') brown.words(fileids=['cg22']) brown.sents(categories=['news', 'editorial', 'reviews']) Some linguistic research: comparing genres in the Brown corpus in their usage of modal verbs from nltk.corpus import brown cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories( ) for word in brown.words(categories =genre)) #Do not press enter to type in the for #statements! genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor’] modal_verbs = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions=genres, samples=modal_verbs) can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition” The condition is usually the category of the text (news, religion,…) Loading Your Own Text or Corpus Make sure that the texts/files of your corpus are in plaintext format (convert them, do not just change the file extensions from e.g. .docx to .txt) Make a map with the name of your corpus which contains all the text files A text in Python: open your file f = open(“/Users/claudia/text1.txt”, “r”) read in the text text1 = f.read( ) reads the text entirely text1 = f.readlines( ) reads in all lines that end with \n and makes a list text1 = f.readline( ) reads in one line Loading your own corpus in NLTK with no subcategories: import nltk from nltk.corpus import PlaintextCorpusReader loc = “/Users/claudia/my_corpus” #Mac loc = “C:\Users\claudia\my_corpus” #Windows my_corpus = nltk.PlaintextCorpusReader(loc, “.*”) Now you can use the corpus methods of NLTK on your own corpus, e.g. my_corpus.words( ) my_corpus.sents( ) … Loading your own corpus in NLTK with subcategories: import nltk from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” #Mac loc=“C:\Users\claudia\my_corpus” #Windows 7 my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern= r'(cat1|cat2)/.*') If your corpus is loaded correctly, you should get a list of all files in your corpus by using: my_corpus.fileids( ) For a corpus with subcategories, you can access the files in the subcategories by taking the name of the subcategory as an argument: my_corpus.fileids(categories = “cat1”) my_corpus.words(categories = “cat2”) Writing Results to a File It is often useful to write output to files First you have to open/create a file for your output output_file = open(‘(path)/output.txt’,‘w’) output_file = open(‘(path)/output.txt’,‘a’) output_file.write(‘hallo’) output_file.close() To download and install NLTK: http://www.nltk.org/download Note: you need to have Python's NumPy and Matplotlib packages installed in order to produce the graphical plots See http://www.nltk.org/ for installation instructions Thank you