An Introduction to Python

advertisement
Programming for
Linguists
An Introduction to Python
15/12/2011
Tuples
 A sequence of values
 They are similar to lists:
 the values can be any type
 they are indexed by integers
 Syntactically a tuple is a commaseparated list of values:
t = 'a', 'b', 'c', 'd', 'e'
 Although it is not necessary, it is common
to enclose tuples in parentheses
t = ('a', 'b', 'c', 'd', 'e’)
 To create a tuple with a single element,
you have to include a final comma:
t1 = 'a’,
type(t1)
 Note: a value in parentheses is not a tuple
!
t2 = (‘a’)
type(t2)
 With no argument, the tuple ( ) function
creates a new empty tuple
t = tuple( )
 If the argument is a sequence (string, list
or tuple), the result is a tuple with the
elements of the sequence:
t = tuple(‘lupins’)
print t
 Most list operators also work on tuples:
print t[0]
print t[1:3]
 BUT if you try to modify one of the
elements of the tuple, you get an error
message
t[0] = ‘A’
 You can’t modify the elements of a tuple: a
tuple is immutable !
 You can replace one tuple with another
t = ('A',) + t[1:]
print t
Tuple Assignment
 It is often useful to swap the values of
two variables, e.g. swap “a” with “b”
temp=a
a=b
b=temp
 More elegant with a tuple assignment
a,b = b,a
 The number of variables on the left and
the number of values on the right have to
be the same !
a, b = 1,2,3
ValueError: too many values to unpack
 For example: split an email address into a
user name and a domain
address = ‘[email protected]’
username, domain = address.split('@')
print username
print domain
 The return value from split(‘@’) is a list
with two elements
 The first element is assigned to username,
the second to domain.
Tuples as Return Values
 Strictly speaking, a function can only
return one value
 If the value is a tuple, the effect is the
same as returning multiple values
For example:
def min_max(t):
return min(t), max(t)
 max( ) and min( ) are built-in functions that
find the largest and smallest elements of a
sequence
 min_max(t) computes both and returns a
tuple of two values
Dictionaries and Tuples
 .items ( ) function used on dictionaries we
saw last week actually returns a list of
tuples, e.g.
>>> d = {'a':0, 'b':1, 'c':2}
>>> d.items( )
[('a', 0), ('c', 2), ('b', 1)]
 This way you can easily access both keys
and values separately:
d = {'a':0, 'b':1, 'c':2}
for letter, number in d.items( ):
print letter
print number
 Example: sorting a list of words by their
word length
def sort_by_length(words):
list1=[ ]
for word in words:
list1.append((len(word), word))
list1.sort(reverse=True)
ordered_list=[ ]
for length, word in list1:
ordered_list.append(word)
return ordered_list
NLTK and the Internet
 A lot of text on the web is in the form of
HTML documents
 To access them, you first need to specify
the correct location
url =
“http://nltk.googlecode.com/svn/trunk/doc/
book/ch03.html”
 Then use the urlopen( ) function
from urllib import urlopen
htmltext = urlopen(url).read( )
 NLTK provides a function nltk.clean_html(
), which takes an HTML string and returns
raw text, e.g.
rawtext = nltk.clean_html(htmltext)
 In order to use other NLTK methods, you
can then tokenize the raw text
tokens=nltk.wordpunct_tokenize(rawtext)
 NLTK’s WordPunctTokenizer takes as an
argument raw text and returns a list of
tokens (words + punctuation marks)
 If you want to use the functions we used
on the texts from nltk.book on your own
texts, use the nltk.Text( ) function
my_text = nltk.Text(tokens)
my_text.collocations( )
 Note: if you are used to working with
characters in a particular local encoding
(ë, è,…), you need to include the string '# *- coding: <coding> -*-' as the first or
second line of your script, e.g.
# -*- coding: utf-8 -*-
Writing Results to a File
 It is often useful to write output to files
 First you have to open/create a file for your
output
output_file = open(‘(path)/output.txt’,‘w’)
output_file = open(‘(path)/output.txt’,‘a’)
 Now you have to write your output to the
file you just opened
list = [1, 2, 3]
output_file.write(str(list) + "\n”)
 When you write non-text data to a file you
must convert it to a string first
 Do not forget to close the file when you
are done
output_file.close( )
NLTK and automatic text
classification
 Classification is the computational task of
choosing the correct class label for a given
input text, e.g.
 deciding whether an email is spam or not
 deciding what the topic of a news article is (e.g.
sports, politics, financial,…)
 authorship attribution
Framework (1)
 Gather a training corpus:
 in which a categorization is possible using
metadata, e.g.
 information about the author(s): name, age,
gender, location
 information about the texts’ genre: sports,
humor, romance, scientific
Framework (2)
 Gather a training corpus:
 for which you need to add the metadata
yourself, e.g.
 annotation of content-specific information: add
sentiment labels to utterances
 annotation of linguistic features: add POS tags
to text
 Result: a dataset with predefined categories
Framework (3)
 Pre-processing of the dataset, e.g.
tokenization, removing stop words
 Feature selection: which features of the text
could be informative for your classification
task, e.g.
 lexical features: words, word bigrams,...
 character features: n-grams
 syntactic features: POS tags
 semantic features: role labels
 others: readability scores, TTR, wl, sl,…
Framework (4)
 Divide your dataset in a training set and a test
set (usually 90% vs 10%)
 Feature selection metrics:
 based on frequencies: most frequent features
 based on frequency distributions per category:
most informative features
 in NLTK: Chi-square, Student's t test, Pointwise
Mutual Information, Likelihood Ratio, PoissonStirling, Jaccard index, Information Gain
 use them only on training data! (overfitting)
Framework (5)
 For document classification: each document
in the dataset is represented by a separate
instance containing the features extracted
from the training data
 The format of your instances depends on the
classifier you want to use
 Select your classifier: in NLTK: Naive Bayes,
Decision Tree, Maximum Entropy, link to Weka
Framework (6)
 Train the classifier using the training
instances you created in the previous step
 Test your trained model on previously unseen
data: the test set
 Evaluate your classifier’s performance:
accuracy, precision, recall and f-scores,
confusion matrix
 Perform error analysis
A Case Study
Classification task: classifying movie reviews
into positive and negative reviews
1. Import the corpus
from nltk.corpus import movie_reviews
2. Create a list of categorized documents
documents =
[(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories( )
for fileid in movie_reviews.fileids(category)]
print documents[:2]
3. Shuffle your list of documents randomly
from random import shuffle
shuffle(documents)
4. Divide your data in training en test
train_docs = documents[:1800]
test_docs = documents[1800:]
5. We only consider word unigram features
here, so make a dictionary of all (normalized)
words from the training data
train_words = { }
for (wordlist, cat) in train_docs:
for w in wordlist:
w = w.lower( )
if w not in train_words:
train_words[w] = 1
else:
train_words[w] += 1
print len(train_words)
6. Define a feature extraction function
def extract_features(wordlist):
document_words = set(wordlist)
features = { }
for word in document_words:
word = word.lower( )
if word in train_words:
features[word] = (word in
document_words)
return features
print
extract_features(movie_reviews.words('pos/cv957
_8737.txt'))
7. Use your feature extraction function to
extract all features from your training and test
set
train_feats =
[(extract_features(wordlist), cat) for (wordlist,cat)
in train_docs]
test_feats =
[(extract_features(wordlist), cat) for (wordlist,cat)
in test_docs]
7. Train e.g. NLTK’s Naïve Bayes classifier on
the training set
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_feats)
predicted_labels = classifier.batch_classify([fs for (fs,
cat) in test_feats])
8. Evaluate the model on the test set
print nltk.classify.accuracy(classifier, test_feats)
classifier.show_most_informative_features(20)
For Next Week
 Feedback on the past exercises
 Some extra exercises
 If you have additional questions or
problems, please e-mail me by
Wednesday
 The evaluation assignment will be
announced
 Ex 1)
Choose a website. Read it in in Python
using the urlopen function, remove all
HTML mark-up and tokenize it. Make a
frequency dictionary of all words ending
with ‘ing’ and sort it on its values
(decreasingly).
 Ex 2)
Write the raw text of the text in the
previous exercise to an output file.
 Ex 3)
Write a script that performs the same
classification task as we saw today using
word bigrams as features instead of single
words.
Thank you
Download
Related flashcards

Diving medicine

20 cards

Diving medicine

23 cards

Murdered sportspeople

68 cards

Create Flashcards