An Introduction to Python

advertisement
Programming for
Linguists
An Introduction to Python
13/12/2012
Dictionaries
 Like a list, but more general
 In a list the index has to be an integer,
e.g. words[4]
 In a dictionary the index can be almost
any type
 A dictionary is like a mapping between 2
sets: keys and values
 To create an empty list:
list = [ ]
 To create an empty dictionary:
dictionary = { }
e.g. a dictionary containing English and
Spanish words:
>>>eng2sp = { }
>>>eng2sp['one'] = 'uno’
>>>print eng2sp
{'one': 'uno'}
 In this case both the keys and the values
are of the string type
 Like with lists, you can create dictionaries
yourselves, e.g.
eng2sp = {'one': 'uno', 'two': 'dos', 'three':
'tres'}
print eng2sp
 Note: in general, the order of items in a
dictionary is unpredictable
 You can use the keys to look up the
corresponding values, e.g.
>>>print eng2sp['two']
 The key ‘two’ always maps to the value
‘dos’ so the order of the items does not
matter
 If the key is not in the dictionary you get an
error message, e.g.
>>>print eng2sp[‘ten’]
KeyError: ‘ten’
 The len( ) function returns the number of
key-value pairs
len(eng2sp)
 The in operator tells you whether
something appears as a key in the
dictionary
>>>‘one’ in eng2sp
True
 BUT
>>>‘uno’ in eng2sp
False
 To see whether something appears as a
value in a dictionary, you can use the
values( ) function, which returns the values
as a list, and then use the in operator, e.g.
>>>‘uno’ in eng2sp.values( )
True
 Lists can be values, but never keys!
Default dictionary
Try this:
words = [‘een’, ‘twee’, ‘drie’]
frequencyDict = { }
for w in words:
frequencyDict[w] += 1
 Possible solution:
for w in words:
if w in frequencyDict:
frequencyDict[w] += 1
else:
frequencyDict[w] = 1
 The easy solution:
>>>from collections import defaultdict
>>>frequencyDict = defaultdict(int)
>>>for w in words:
frequencyDict[w] += 1
 you can use int, float, str,… in the
defaultdict
A Dictionary as a Set of Counters
 Suppose you want to count the number of
times each letter occurs in a string, you
could:
 create 26 variables, traverse the string and,
for each letter, add 1 to the corresponding
counter
 create a dictionary with letters as keys and
counters as the corresponding values
def frequencies(sent):
freq_dict = defaultdict(int)
for let in sent:
freq_dict[let] += 1
return freq_dict
dictA = frequencies(“abracadabra”)
list_keys = dictA.keys( )
list_values = dictA.values( )
z_value = dictA[‘z’]
 The first line of the function creates an
empty default dictionary
 The for loop traverses the string
 Each time through the loop, we create a
new key item with the initial value 1
 If the letter is already in the dictionary we
add 1 to its corresponding value
Write a function that counts the word
frequencies in a sentence instead of the
letter frequencies using a dictionary
def words(sent):
word_freq = defaultdict(int)
wordlist = sent.split( )
for word in wordlist:
word_freq[word] += 1
return word_freq
words(“this is is a a test sentence”)
Dictionary Lookup
 Given a dictionary “word_freq” and a key
“is”, finding the corresponding value:
word_freq[“is”]
 This operation is called a lookup
 What if you know the value and want to
look up the corresponding key?
Sorting a Dictionary According to its
Values
 First you need to import itemgetter:
from operator import itemgetter
 To go over each item in a dictionary you
can use .iteritems( )
 To sort the dictionary according to the
values, you need to use
key = itemgetter(1)
 To sort it decreasingly: reverse = True
>>>from operator import itemgetter
>>>def getValues(sent):
w_fr = defaultdict(int)
wordlist = sent.split( )
for word in wordlist:
w_fr[word] += 1
byVals = sorted(w_fr.iteritems( ),
key = itemgetter(1),
reverse =True)
return byVals
>>>getValues(‘this is a a a sentence’)
Write a function that takes a sentence as
an argument and returns all words that
occur only once in the sentence.
def getHapax(sent):
words = sent.split( )
freqs = defaultdict(int)
for w in words:
freqs[w] += 1
hapaxlist = [ ]
for item in freqs:
value = freqs[item]
if value == 1:
hapaxlist.append(item)
return hapaxlist
Getting Started with NLTK
 In IDLE:
import nltk
nltk.download()
Searching Texts
 Start your script with importing all texts in
NLTK:
from nltk.book import *









text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
 Any time you want to find out about these
texts, just enter their names at the Python
prompt:
>>> text1
<Text: Moby Dick by Herman Melville
1851>
 A concordance view shows every
occurrence of a given word, together with
some context:
e.g. “monstrous” in Moby Dick
text1.concordance(“monstrous”)
 Try looking up the context of “lol” in the
chat corpus (text 5)
 If you have a corpus that contains texts
that are spread over time, you can look up
how some words are used differently over
time:
e.g. the Inaugural Address Corpus (dates
back to 1789): words like “nation”, “terror”,
“God”…
 You can also examine what other words
appear in a similar context, e.g.
text1.similar(“monstrous”)
 common_contexts( ) allows you to
examine the contexts that are shared by
two or more words, e.g.
text1.common_contexts([“very”,
“monstrous”])
 You can also determine the location of a
word in the text
 This positional information can be
displayed using a dispersion plot
 Each stripe represents an instance of a
word, and each row represents the entire
text, e.g.
text4.dispersion_plot(["citizens",
"democracy", "freedom", "duties",
"America"])
Counting Tokens
 To count the number of tokens (words +
punctuation marks), just use the len( )
function, e.g. len(text5)
 To count the number of unique tokens,
you have to make a set, e.g.
set(text5)
 If you want them sorted alfabetically,
try this:
sorted(set(text5))
 Note: in Python all capitalized words
precede lowercase words (you can
use .lower( ) first to avoid this)
 Now you can calculate the lexical diversity
of a text, e.g.
the chat corpus (text5):
45010 tokens
6066 unique tokens or types
The lexical diversity =
nr of types/nr of tokens
 Use the Python functions to calculate the
lexical diversity of text 5
len(set(text5))/float(len(text5))
Frequency Distributions
 To find n most frequent tokens: FreqDist( ),
e.g.
fdist = FreqDist(text1)
fdist[“have”]
760
all_tokens = fdist.keys( )
all_tokens[:50]
 The function .keys( ) combined with the
FreqDist( ) also gives you a list of all the
unique tokens in the text
 Frequency distributions can be
informative, BUT the most frequent words
usually are function words (the, of, and,
…)
 What proportion of the text is taken up with
such words?
Cumulative frequency plot
fdist.plot(50, cumulative=True)
 If frequent tokens do not give enough
information, what about infrequent tokens?
Hapaxes= tokens which occur only once
fdist.hapaxes( )
 Without their context, you do not get much
information either
Fine-grained Selection of Tokens
 Extract tokens of a certain minimum
length:
tokens = set(text1)
long_tokens = [ ]
for token in tokens:
if len(token) >= 15:
long_tokens.append(token)
#OR shorter:
long_tokens = list(token for token in
tokens if len(token) >= 15)
 BUT: very long words are often hapaxes
 You can also extract frequently occurring
long words of a certain length:
words = set(text1)
fdist = FreqDist(text1)
#short version
freq_long_words =
list(word for word in words if len(word) >=
7 and fdist[word] >= 7)
Collocations and Bigrams
 A collocation is a sequence of words that
occur together unusually often, e.g. “red
whine” is a collocation, “yellow whine” is
not
 Collocations are essentially just frequent
bigrams (word pairs), but you can find
bigrams that occur more often than is to
be expected based on the frequency of the
individual words:
text8.collocations( )
Some Functions for NLTK's Frequency
Distributions
fdist = FreqDist(samples)
fdist[“word”]  frequency of “word”
fdist.freq(“word”)  frequency of “word”
fdist.N( )  total number of samples
fdist.keys( )  the samples sorted in
order of decreasing frequency
for sample in fdist:  iterates over the
samples in order of decreasing
frequency
fdist.max( )  sample with the greatest
count
fdist.plot( )  graphical plot of the
frequency distribution
fdist.plot(cumulative=True) 
cumulative plot of the
frequency distribution
fdist1 < fdist2  tests if the samples in
fdist1 occur less
frequently than in fdist2
Accessing Corpora
 NLTK also contains entire corpora, e.g.:




Brown Corpus
NPS Chat
Gutenberg Corpus
…
A complete list can be found on
http://nltk.googlecode.com/svn/trunk/nltk
_data/index.xml
 Each of these corpora contains dozens of
individual texts
 To see which files are e.g. in the
Gutenberg corpus in NLTK:
nltk.corpus.gutenberg.fileids()
 Do not forget the dot notation nltk.corpus.
This tells Python the location of the corpus
 You can use the dot notation to work with
a corpus from NLTK or you can import a
corpus at the beginning of your script:
from nltk.corpus import gutenberg
 After that you just have to use the name of
the corpus and the dot notation before a
function
gutenberg.fileids( )
 If you want to examine a particular text, e.g.
Shakespeare’s Hamlet, you can use the
.words( ) function
Hamlet = gutenberg.words(“shakespearehamlet.txt”)
 Note that “shakespeare-hamlet.txt” is the file
name that is to be found using the previous
.fileids( ) function
 You can use some of the previously
mentioned functions (corpus methods) on this
text, e.g.
fdist_hamlet = FreqDist(hamlet)
Some Corpus Methods in NLTK
 brown.raw( )  raw data from the corpus
file(s)
 brown.categories( )  fileids( ) grouped per
predefined categories
 brown.words( )  a list of words and
punctuation tokens
 brown.sents( )  words( ) grouped into
sentences
 brown.tagged_words( )  a list of
(word,tag) pairs
 brown.tagged_sents( )  tagged_words( )
grouped into sentences
 treebank.parsed_sents( )  a list of parse
trees
def statistics(corpus):
for fileid in corpus.fileids( ):
nr_chars = len(corpus.raw(fileid))
nr_words = len(corpus.words(fileid))
nr_sents = len(corpus.sents(fileid))
nr_vocab = len(set([word.lower() for
word in corpus.words(fileid)]))
print fileid, “average word length: ”,
nr_chars/nr_words,
“average sentence length: ”,
nr_words/nr_sents,
“lexical diversity: ”,
nr_words/nr_vocab
 Some corpora contain several
subcategories, e.g. the Brown Corpus
contains “news”, “religion”,…
 You can optionally specify these particular
categories or files from a corpus, e.g.:
from nltk.corpus import brown
brown.categories( )
brown.words(categories='news')
brown.words(fileids=['cg22'])
brown.sents(categories=['news',
'editorial', 'reviews'])
 Some linguistic research: comparing
genres in the Brown corpus in their usage
of modal verbs
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre, word)
for genre in brown.categories( )
for word in brown.words(categories
=genre))
#Do not press enter to type in the for
#statements!
genres = ['news', 'religion', 'hobbies',
'science_fiction', 'romance', 'humor’]
modal_verbs = ['can', 'could', 'may',
'might', 'must', 'will']
cfd.tabulate(conditions=genres,
samples=modal_verbs)
can could may might must will
news
93 86 66 38
50 389
religion
82 59 78 12
54 71
hobbies
268 58 131 22
83 264
science_fiction 16 49 4
12
8 16
romance
74 193 11 51
45 43
humor
16 30 8
8
9 13
 A conditional frequency distribution is a
collection of frequency distributions, each one
for a different "condition”
 The condition is usually the category of the text
(news, religion,…)
Loading Your Own Text or Corpus
 Make sure that the texts/files of your
corpus are in plaintext format (convert
them, do not just change the file
extensions from e.g. .docx to .txt)
 Make a map with the name of your corpus
which contains all the text files
A text in Python:
 open your file
f = open(“/Users/claudia/text1.txt”, “r”)
 read in the text
 text1 = f.read( )  reads the text entirely
 text1 = f.readlines( )  reads in all lines that
end with \n and makes a list
 text1 = f.readline( )  reads in one line
Loading your own corpus in NLTK with no
subcategories:
import nltk
from nltk.corpus import PlaintextCorpusReader
loc = “/Users/claudia/my_corpus” #Mac
loc = “C:\Users\claudia\my_corpus” #Windows
my_corpus = nltk.PlaintextCorpusReader(loc, “.*”)
 Now you can use the corpus methods of
NLTK on your own corpus, e.g.
my_corpus.words( )
my_corpus.sents( )
…
Loading your own corpus in NLTK with
subcategories:
import nltk
from nltk.corpus import
CategorizedPlaintextCorpusReader
loc=“/Users/claudia/my_corpus” #Mac
loc=“C:\Users\claudia\my_corpus” #Windows 7
my_corpus = CategorizedPlaintextCorpusReader(loc,
'(?!\.svn).*\.txt', cat_pattern=
r'(cat1|cat2)/.*')
 If your corpus is loaded correctly, you
should get a list of all files in your corpus
by using:
my_corpus.fileids( )
 For a corpus with subcategories, you can
access the files in the subcategories by
taking the name of the subcategory as an
argument:
my_corpus.fileids(categories = “cat1”)
my_corpus.words(categories = “cat2”)
Writing Results to a File
 It is often useful to write output to files
 First you have to open/create a file for your
output
output_file = open(‘(path)/output.txt’,‘w’)
output_file = open(‘(path)/output.txt’,‘a’)
output_file.write(‘hallo’)
output_file.close()
 To download and install NLTK:
http://www.nltk.org/download
 Note: you need to have Python's NumPy
and Matplotlib packages installed in order
to produce the graphical plots
 See http://www.nltk.org/ for installation
instructions
Thank you
Download