An Introduction to Python

advertisement
Programming for
Linguists
An Introduction to Python
15/12/2011
Tuples
 A sequence of values
 They are similar to lists:
 the values can be any type
 they are indexed by integers
 Syntactically a tuple is a commaseparated list of values:
t = 'a', 'b', 'c', 'd', 'e'
 Although it is not necessary, it is common
to enclose tuples in parentheses
t = ('a', 'b', 'c', 'd', 'e’)
 To create a tuple with a single element,
you have to include a final comma:
t1 = 'a’,
type(t1)
 Note: a value in parentheses is not a tuple
!
t2 = (‘a’)
type(t2)
 With no argument, the tuple ( ) function
creates a new empty tuple
t = tuple( )
 If the argument is a sequence (string, list
or tuple), the result is a tuple with the
elements of the sequence:
t = tuple(‘lupins’)
print t
 Most list operators also work on tuples:
print t[0]
print t[1:3]
 BUT if you try to modify one of the
elements of the tuple, you get an error
message
t[0] = ‘A’
 You can’t modify the elements of a tuple: a
tuple is immutable !
 You can replace one tuple with another
t = ('A',) + t[1:]
print t
Tuple Assignment
 It is often useful to swap the values of
two variables, e.g. swap “a” with “b”
temp=a
a=b
b=temp
 More elegant with a tuple assignment
a,b = b,a
 The number of variables on the left and
the number of values on the right have to
be the same !
a, b = 1,2,3
ValueError: too many values to unpack
 For example: split an email address into a
user name and a domain
address = ‘joske@ua.ac.be’
username, domain = address.split('@')
print username
print domain
 The return value from split(‘@’) is a list
with two elements
 The first element is assigned to username,
the second to domain.
Tuples as Return Values
 Strictly speaking, a function can only
return one value
 If the value is a tuple, the effect is the
same as returning multiple values
For example:
def min_max(t):
return min(t), max(t)
 max( ) and min( ) are built-in functions that
find the largest and smallest elements of a
sequence
 min_max(t) computes both and returns a
tuple of two values
Dictionaries and Tuples
 .items ( ) function used on dictionaries we
saw last week actually returns a list of
tuples, e.g.
>>> d = {'a':0, 'b':1, 'c':2}
>>> d.items( )
[('a', 0), ('c', 2), ('b', 1)]
 This way you can easily access both keys
and values separately:
d = {'a':0, 'b':1, 'c':2}
for letter, number in d.items( ):
print letter
print number
 Example: sorting a list of words by their
word length
def sort_by_length(words):
list1=[ ]
for word in words:
list1.append((len(word), word))
list1.sort(reverse=True)
ordered_list=[ ]
for length, word in list1:
ordered_list.append(word)
return ordered_list
NLTK and the Internet
 A lot of text on the web is in the form of
HTML documents
 To access them, you first need to specify
the correct location
url =
“http://nltk.googlecode.com/svn/trunk/doc/
book/ch03.html”
 Then use the urlopen( ) function
from urllib import urlopen
htmltext = urlopen(url).read( )
 NLTK provides a function nltk.clean_html(
), which takes an HTML string and returns
raw text, e.g.
rawtext = nltk.clean_html(htmltext)
 In order to use other NLTK methods, you
can then tokenize the raw text
tokens=nltk.wordpunct_tokenize(rawtext)
 NLTK’s WordPunctTokenizer takes as an
argument raw text and returns a list of
tokens (words + punctuation marks)
 If you want to use the functions we used
on the texts from nltk.book on your own
texts, use the nltk.Text( ) function
my_text = nltk.Text(tokens)
my_text.collocations( )
 Note: if you are used to working with
characters in a particular local encoding
(ë, è,…), you need to include the string '# *- coding: <coding> -*-' as the first or
second line of your script, e.g.
# -*- coding: utf-8 -*-
Writing Results to a File
 It is often useful to write output to files
 First you have to open/create a file for your
output
output_file = open(‘(path)/output.txt’,‘w’)
output_file = open(‘(path)/output.txt’,‘a’)
 Now you have to write your output to the
file you just opened
list = [1, 2, 3]
output_file.write(str(list) + "\n”)
 When you write non-text data to a file you
must convert it to a string first
 Do not forget to close the file when you
are done
output_file.close( )
NLTK and automatic text
classification
 Classification is the computational task of
choosing the correct class label for a given
input text, e.g.
 deciding whether an email is spam or not
 deciding what the topic of a news article is (e.g.
sports, politics, financial,…)
 authorship attribution
Framework (1)
 Gather a training corpus:
 in which a categorization is possible using
metadata, e.g.
 information about the author(s): name, age,
gender, location
 information about the texts’ genre: sports,
humor, romance, scientific
Framework (2)
 Gather a training corpus:
 for which you need to add the metadata
yourself, e.g.
 annotation of content-specific information: add
sentiment labels to utterances
 annotation of linguistic features: add POS tags
to text
 Result: a dataset with predefined categories
Framework (3)
 Pre-processing of the dataset, e.g.
tokenization, removing stop words
 Feature selection: which features of the text
could be informative for your classification
task, e.g.
 lexical features: words, word bigrams,...
 character features: n-grams
 syntactic features: POS tags
 semantic features: role labels
 others: readability scores, TTR, wl, sl,…
Framework (4)
 Divide your dataset in a training set and a test
set (usually 90% vs 10%)
 Feature selection metrics:
 based on frequencies: most frequent features
 based on frequency distributions per category:
most informative features
 in NLTK: Chi-square, Student's t test, Pointwise
Mutual Information, Likelihood Ratio, PoissonStirling, Jaccard index, Information Gain
 use them only on training data! (overfitting)
Framework (5)
 For document classification: each document
in the dataset is represented by a separate
instance containing the features extracted
from the training data
 The format of your instances depends on the
classifier you want to use
 Select your classifier: in NLTK: Naive Bayes,
Decision Tree, Maximum Entropy, link to Weka
Framework (6)
 Train the classifier using the training
instances you created in the previous step
 Test your trained model on previously unseen
data: the test set
 Evaluate your classifier’s performance:
accuracy, precision, recall and f-scores,
confusion matrix
 Perform error analysis
A Case Study
Classification task: classifying movie reviews
into positive and negative reviews
1. Import the corpus
from nltk.corpus import movie_reviews
2. Create a list of categorized documents
documents =
[(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories( )
for fileid in movie_reviews.fileids(category)]
print documents[:2]
3. Shuffle your list of documents randomly
from random import shuffle
shuffle(documents)
4. Divide your data in training en test
train_docs = documents[:1800]
test_docs = documents[1800:]
5. We only consider word unigram features
here, so make a dictionary of all (normalized)
words from the training data
train_words = { }
for (wordlist, cat) in train_docs:
for w in wordlist:
w = w.lower( )
if w not in train_words:
train_words[w] = 1
else:
train_words[w] += 1
print len(train_words)
6. Define a feature extraction function
def extract_features(wordlist):
document_words = set(wordlist)
features = { }
for word in document_words:
word = word.lower( )
if word in train_words:
features[word] = (word in
document_words)
return features
print
extract_features(movie_reviews.words('pos/cv957
_8737.txt'))
7. Use your feature extraction function to
extract all features from your training and test
set
train_feats =
[(extract_features(wordlist), cat) for (wordlist,cat)
in train_docs]
test_feats =
[(extract_features(wordlist), cat) for (wordlist,cat)
in test_docs]
7. Train e.g. NLTK’s Naïve Bayes classifier on
the training set
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_feats)
predicted_labels = classifier.batch_classify([fs for (fs,
cat) in test_feats])
8. Evaluate the model on the test set
print nltk.classify.accuracy(classifier, test_feats)
classifier.show_most_informative_features(20)
For Next Week
 Feedback on the past exercises
 Some extra exercises
 If you have additional questions or
problems, please e-mail me by
Wednesday
 The evaluation assignment will be
announced
 Ex 1)
Choose a website. Read it in in Python
using the urlopen function, remove all
HTML mark-up and tokenize it. Make a
frequency dictionary of all words ending
with ‘ing’ and sort it on its values
(decreasingly).
 Ex 2)
Write the raw text of the text in the
previous exercise to an output file.
 Ex 3)
Write a script that performs the same
classification task as we saw today using
word bigrams as features instead of single
words.
Thank you
Download