LING 681 Intro to Comp Ling

advertisement
NLTK & Python
Day 5
LING 681.02
Computational Linguistics
Harry Howard
Tulane University
Course organization
 I have requested that Python and NLTK be
installed on the computers in this room.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
2
NLPP
§1.2 A Closer Look at Python: Texts
as Lists of Words
Variables
 variable = expression
>>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',
... 'forth', 'from', 'Camelot', '.']
>>> noun_phrase = my_sent[1:4]
>>> noun_phrase
['bold', 'Sir', 'Robin']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Robin', 'Sir', 'bold']
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
4
How to name variables
 Valid names (or identifiers) …
must start with a letter, optionally followed by
digits or letters;
are case-sensitive;
cannot contain whitespace (use an underscore)
or a dash (means minus);
cannot be a reserved word.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
5
Strings
 Strings are individual words, i.e. a single element list.
 Some methods for strings
>>> name = 'Monty'
>>> name[0]
'M'
>>> name[:4]
'Mont'
>>> name * 2
'MontyMonty'
>>> name + '!'
'Monty!'
>>> ' '.join(['Monty', 'Python'])
'Monty Python'
>>> 'Monty Python'.split()
['Monty', 'Python']
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
6
NLPP
§1.3. Computing with Language:
Simple Statistics
Frequency distribution
 What is a frequency distribution?
 It tells us the frequency of each vocabulary item in a
text.
 It is a "distribution" because it tells us how the total
number of word tokens in the text are distributed across
the vocabulary items.
 What function in NLTK calculates it?
 FreqDist(text_name)
 What expression lists the tokens with their
distribution?
 text_name.keys()
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
8
Very frequent words
 How would you describe the 50 most frequent elements in
Moby Dick?
>>>fdist1.plot(50, cumulative=True)
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
9
Very infrequent words
 Words that occur only once are called
hapaxes.
>>>fdist1.hapaxes()
In Moby Dick, "lexicographer, cetological,
contraband, expostulations", and about 9,000
others.
How would you describe them?
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
10
Summary
Most frequent
Least frequent
Length
short
long
Meaning
very general
very specific
Coverage of text large proportion small proportion
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
11
Question
 Which group would you look in to find
words that help you understand what the
text is about?
Neither.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
12
Fine-grained word
selection

Some Python expressions are based on set
theory.
a)
{w | w ∈ V & P(w)}
b)
[w for w in V if p(w)], though this returns a list,
not a set. (What's the difference?)

Real NLTK
>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
13
Finding words that
characterize a text
 Not too short (>?) and not too infrequent
(>?)
 >>> informative_words = [w for w in V
if len(w) > 7 and FreqDist(V) > 7]
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
14
Finding groups of words
 What is the name for a sequence of two words?
 Bigram ~ bigrams()
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'),
('than', 'done')]
 What is the name for a sequence of words that
occur together unusually often?
 Collocation ~ collocations()
 They are essentially bigrams that occur more often than
we would expect based on the frequency of individual
words.
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
15
Example
 >>> text4.collocations()
 Building collocations list
 United States; fellow citizens; years
ago; Federal Government; General
Government; American people; Vice
President; Almighty God; Fellow
citizens; Chief Magistrate; Chief
Justice; God bless; Indian tribes;
public debt; foreign nations; political
parties; State governments; National
Government; United Nations; public
money
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
16
Counting Other Things
31-Aug-2009
LING 681.02, Prof. Howard, Tulane University
17
Next time
First quiz/project
NLPP: finish §1 and do all exercises;
do up to Ex 8 in §2
Download