Getting started with Python/NLTK

advertisement
CIS 530 Fall 2010
Annie Louis
20 Sept. 2010

Basics to help you get started
◦ Data structures, functions, loops, running a script…
◦ NLTK corpus reader, probability distributions


You will find more resources on the course
webpage
Some examples in this tutorial are based on
◦ Mark Lutz & David Ascher, ‘Learning Python’
◦ Brad Dayley, ‘Python Phrasebook’

Portable

Can contain wrappers to other code—eg: ‘C’

Various inbuilt types—lists, dictionaries…

Easy/ can be learnt quickly—it is very concise

NLTK toolkit comes with a lot of NLP utilities

Read
http://www.cis.upenn.edu/~cis530/hw_2010/generalinfo.pdf

Both already installed on eniac
> python
Python 2.6.2 (r262:71600, Jun 17 2010, 13:37:45)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> import nltk
>>> nltk.corpus.brown.words()[0]
'The'
>python
>>> 2+3
5
>>> word = "python"
>>> word[2]
't‘
>>> print word
python
>>>
>>>
>>>
>>>
4.0
x=3
y=9
z = (x + y + 0.0)/3
z
No type
declaration

Unix command line
python myscript.py

Within the interpreter
>>> execfile(“myscript.py”)
>>> sentence = "Stock prices fell. They did so last week too."
>>> sentence.split()
['Stock', 'prices', 'fell.', 'They', 'did', 'so', 'last', 'week', 'too.']


Split() with no separator => split at
whitespace characters (space, tab, newline)
With a different separator
>>> sentence.split(".")
['Stock prices fell', ' They did so last week too', '']

Method 1– using ‘+’ operator
>>> word1 = "This"
>>> word2 = "is"
>>> word3 = "a"
>>> word4 = "sentence“
>>> combined1 = word1+" "+word2+" "+word3+" "+word4
>>> combined1
'This is a sentence‘

Method 2– using ‘join’ operator
>>> wordlist = ["This", "is", "a", "sentence"]
>>> ' '.join(wordlist)
'This is a sentence'

Case sensitive
>>> string1
>>> string2
>>> string3
>>> string1
False
>>> string1
False

= "apple"
= "orange"
= "aPPle"
== string2
== string3
Case insensitive
>>> string1.lower() == string3.lower()
True


Search and replace
Trimming

Ordered collection of items

Can contain items of any type
>>> digits = [0,1,2,3,4,5,6,7,8,9]
>>> strings = ["the", "dog", "ran"]

Indices start from 0

Items in a range

Negative indices work backwards
>>> strings[0]
'the‘
>>> strings[2]
'ran‚
>>> digits[2:4]
[2, 3]
>>> digits[-1]
9
>>> digits[-2]
8

Add/remove
>>> strings.append("fast")
>>> strings.insert(1, "brown")
>>> strings
['the', 'brown', 'dog', 'ran', 'fast']
>>> digits.remove(8)
>>> digits
[0, 1, 2, 3, 4, 5, 6, 7, 9]

Sort
>>> digits.sort(reverse=1)
>>> digits
[9, 7, 6, 5, 4, 3, 2, 1, 0]
>>> digits.sort()
>>> digits
[0, 1, 2, 3, 4, 5, 6, 7, 9]

Similar to lists

But cannot be modified
>>> first_five = (1,2,3,4,5)
>>> first_five[2]
3
>>> first_five.append(6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'

Tuple to list

List to tuple
>>> newlist = list(first_five)
>>> newlist.append(6)
>>> newlist
[1, 2, 3, 4, 5, 6]
>>> first_six = tuple(newlist)
>>> first_six.append(7)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'
>>> first_six
(1, 2, 3, 4, 5, 6)

<key, value> pairs
>>> numbers = {1:"one", 2:"two", 3:"three"}
>>> letters = {"vowel":['a','e','i','o','u'],“consonant":['b','c','d','f','g']}

Get value given key
>>> numbers[2]
'two'
>>> letters["consonant"]
['b', 'c', 'd', 'f', 'g']

Changing the value associated with a key
>>> letters["consonant"].append('h')
>>> letters["consonant"]
['b', 'c', 'd', 'f', 'g', 'h']
>>> numbers[2]="twosome"
>>> numbers[2]
'twosome'
Universal newline
(handles all newline
variations--\r, \r\n)
>>> file1 = open("topics.txt", "rU")
>>> file1_lines = file1.readlines()
>>> file1_lines[1:3]
['Finance\n', 'Computers and the internet\n']
>>>
>>>
>>>
>>>
file2 = open("topics_copy.txt", "w")
file2.write("This is a copy of topics.txt\n")
file2.writelines(file1_lines)
Write list of
file2.close()
strings
Write string
>>> file1 = open("topics.txt", "rU")
>>> file1_lines = file1.readlines()
Indentation is
important
Blank line
when done
Begin blocks
indicated by :
>>> if len(file1_lines) < 2:
... print "fewer than 2 lines"
... elif len(file1_lines) > 10:
... print "more than 10 lines"
... else:
... print "between 2 and 10 lines"
...
between 2 and 10 lines
>>>
>>> word = "dog"
>>> for letter in word:
... print letter
...
d
o
g
>>> pets = ["dog", "cat", "fish"]
>>> for i in range(len(pets)):
... print pets[i]
...
dog
cat
fish
range: 0 to
that number
‘break’ and
‘continue’
statements are
available as usual
>>> def get_length(listx):
... list_len = len(listx)
... return list_len
...
>>> pets = ["dogs", "cats", "fish"]
>>> print "I have "+ str(get_length(pets)) + " pets"
I have 3 pets
Integer to
string
Functions
that can be
performed
on this data
Data
(specific to
each object)

Apple tree
◦ Data
 Fruit
 Leaf
◦ Functions
 Pick_fruit()
 Pick_leaf()

Can abstract into a Tree class
◦ Data
 Fruit
 Leaf
◦ Functions
 Pick_fruit()
 Pick_leaf()
 Instances: apple tree, maple tree,
palm tree..


Lists, tuples, dictionaries, files-–were all
objects of their respective classes
The functions we used on them were the
member functions of those classes
◦ list1.append(‘a’)
class roster:
course = "cis530"
Called when a
object is
‘instantiated’
def __init__(self, name, dept):
self.student_name = name
self.student_dept = dept
def print_details(self):
print "Name: " + self.student_name
print "Dept: " + self.student_dept
print "Course: " + self.course
student1 = roster("annie", "cis")
student1.print_details()
Another member
function
Creating an instance
Calling methods of an
object

Suite of classes for several NLP tasks

Parsing, POS tagging, classifiers…

Several text processing utilities, corpora
◦ Brown, Penn Treebank corpus…
◦ Your data was divided into sentences using ‘punkt’

Basics – skim chapters 1-4

For this homework, be familiar with
◦ Corpus utilities
 Simplied in NLTK
◦ Probability distributions—FreqDist, ConditionalFreqDist
 Read definitions of all member functions
 Look at the code to see how it is implemented
◦ Smoothing techniques

You will need to import the necessary
modules to create objects and call member
functions
◦ import ~ include objects from pre-built packages


FreqDist, ConditionalFreqDist are in
nltk.probability
PlaintextCorpusReader is in nltk.corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
def get_files_from_category(category):
subcat = category.split('#')
if (len(subcat) == 1):
corpus_root = '/home1/c/cis530/data/‘ + subcat[0]
else:
corpus_root = '/home1/c/cis530/data/‘ + subcat[0] + '/' +
subcat[1]
files = PlaintextCorpusReader(corpus_root, '.*')
return files
finance_files = get_files_from_category(“Finance”)
cancer_files = get_files_from_category(“Health#Cancer”)
def get_num_tokens(topic):
categ_files = get_files_from_category(topic)
all_words = categ_files.words()
return len(all_words)
print get_num_tokens(“Health#Diet_and_Nutrition”)
print get_num_tokens(“Computers_and_the_Internet”)
from nltk import FreqDist
def get_top_word(topic):
categ_files = get_files_from_category(topic)
all_words = categ_files.words()
fdist1 = nltk.FreqDist(all_words)
return fdist1.keys()[0]
print get_top_word(“Finance”)
keys() returns samples
in decreasing order of
frequency
Download