Slide 1

advertisement
Old Business
def fact(x):
if(x <= 1): return 1
else: return x* fact(x-1)
def binomial_coef(n,k): return fact(n) / (fact(k) * fact(n-k))
for n in range(9): print [binomial_coef(n,k) for k in range(n+1)]
[1]
[1, 1]
[1, 2, 1]
[1, 3, 3, 1]
[1, 4, 6, 4, 1]
[1, 5, 10, 10, 5, 1]
[1, 6, 15, 20, 15, 6, 1]
[1, 7, 21, 35, 35, 21, 7, 1]
[1, 8, 28, 56, 70, 56, 28, 8, 1]
Homework for next time
• What are the 10 most common words in
– Moby Dick?
• Make a concordance of the 3rd most common
word
• Do the same for
• https://jshare.johnshopkins.edu/kchurch4/public_html/teaching/103/Spring2011/
– What are the 10 most common words on this page?
– Make a concordance of the 3rd most common word
• No need to buy the book
– Free online at http://www.nltk.org/book
• Read Chapter 1
– http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html
• Install NLTK (see next slide)
– Warning: It might not be easy (and it might not be your fault)
– Let us know how it goes
• (both positive and negative responses are more appreciated)
Installing NLTK
http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html
• Chapter 01: pp. 1 - 4
– Python
– NLTK
– Data
Concordances
Python Objects
Lists
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> type(sent1)
<type 'list'>
>>> sent1[0]
First
'Call'
>>> sent1[1:len(sent1)]
['me', 'Ishmael', '.']
Rest
Strings
>>> sent1[0]
'Call'
>>> type(sent1[0])
<type 'str'>
>>> sent1[0][0]
'C'
>>> sent1[0][1:len(sent1[0])]
'all'
Types & Tokens
Polymorphism
Tokens
Types
Tokens
Types
FreqDist
URLs (Chapter 3)
HTML
Works with almost any URL!
>>>url="https://jshare.johnshopkins.edu/kchurch4/public_ht
ml/teaching/103/Lecture07/WebProgramming/javascript_
example_with_sounds.html"
>>> def url2text(url):
html = urlopen(url).read()
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
return nltk.Text(tokens)
>>> text=url2text(url)
>>> text.concordance('Nonsense')
Hints for Homework
import nltk
from urllib import urlopen
def url2text(url):
html = urlopen(url).read()
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
return nltk.Text(tokens)
url =
“https://jshare.johnshopkins.edu/kchurch4/public_html/te
aching/103/Spring2011/”
t = url2text(url)
t.concordance("web”)
Download