CIS 530 Fall 2010 Annie Louis 20 Sept. 2010 Basics to help you get started ◦ Data structures, functions, loops, running a script… ◦ NLTK corpus reader, probability distributions You will find more resources on the course webpage Some examples in this tutorial are based on ◦ Mark Lutz & David Ascher, ‘Learning Python’ ◦ Brad Dayley, ‘Python Phrasebook’ Portable Can contain wrappers to other code—eg: ‘C’ Various inbuilt types—lists, dictionaries… Easy/ can be learnt quickly—it is very concise NLTK toolkit comes with a lot of NLP utilities Read http://www.cis.upenn.edu/~cis530/hw_2010/generalinfo.pdf Both already installed on eniac > python Python 2.6.2 (r262:71600, Jun 17 2010, 13:37:45) [GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk.corpus.brown.words()[0] 'The' >python >>> 2+3 5 >>> word = "python" >>> word[2] 't‘ >>> print word python >>> >>> >>> >>> 4.0 x=3 y=9 z = (x + y + 0.0)/3 z No type declaration Unix command line python myscript.py Within the interpreter >>> execfile(“myscript.py”) >>> sentence = "Stock prices fell. They did so last week too." >>> sentence.split() ['Stock', 'prices', 'fell.', 'They', 'did', 'so', 'last', 'week', 'too.'] Split() with no separator => split at whitespace characters (space, tab, newline) With a different separator >>> sentence.split(".") ['Stock prices fell', ' They did so last week too', ''] Method 1– using ‘+’ operator >>> word1 = "This" >>> word2 = "is" >>> word3 = "a" >>> word4 = "sentence“ >>> combined1 = word1+" "+word2+" "+word3+" "+word4 >>> combined1 'This is a sentence‘ Method 2– using ‘join’ operator >>> wordlist = ["This", "is", "a", "sentence"] >>> ' '.join(wordlist) 'This is a sentence' Case sensitive >>> string1 >>> string2 >>> string3 >>> string1 False >>> string1 False = "apple" = "orange" = "aPPle" == string2 == string3 Case insensitive >>> string1.lower() == string3.lower() True Search and replace Trimming Ordered collection of items Can contain items of any type >>> digits = [0,1,2,3,4,5,6,7,8,9] >>> strings = ["the", "dog", "ran"] Indices start from 0 Items in a range Negative indices work backwards >>> strings[0] 'the‘ >>> strings[2] 'ran‚ >>> digits[2:4] [2, 3] >>> digits[-1] 9 >>> digits[-2] 8 Add/remove >>> strings.append("fast") >>> strings.insert(1, "brown") >>> strings ['the', 'brown', 'dog', 'ran', 'fast'] >>> digits.remove(8) >>> digits [0, 1, 2, 3, 4, 5, 6, 7, 9] Sort >>> digits.sort(reverse=1) >>> digits [9, 7, 6, 5, 4, 3, 2, 1, 0] >>> digits.sort() >>> digits [0, 1, 2, 3, 4, 5, 6, 7, 9] Similar to lists But cannot be modified >>> first_five = (1,2,3,4,5) >>> first_five[2] 3 >>> first_five.append(6) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'tuple' object has no attribute 'append' Tuple to list List to tuple >>> newlist = list(first_five) >>> newlist.append(6) >>> newlist [1, 2, 3, 4, 5, 6] >>> first_six = tuple(newlist) >>> first_six.append(7) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'tuple' object has no attribute 'append' >>> first_six (1, 2, 3, 4, 5, 6) <key, value> pairs >>> numbers = {1:"one", 2:"two", 3:"three"} >>> letters = {"vowel":['a','e','i','o','u'],“consonant":['b','c','d','f','g']} Get value given key >>> numbers[2] 'two' >>> letters["consonant"] ['b', 'c', 'd', 'f', 'g'] Changing the value associated with a key >>> letters["consonant"].append('h') >>> letters["consonant"] ['b', 'c', 'd', 'f', 'g', 'h'] >>> numbers[2]="twosome" >>> numbers[2] 'twosome' Universal newline (handles all newline variations--\r, \r\n) >>> file1 = open("topics.txt", "rU") >>> file1_lines = file1.readlines() >>> file1_lines[1:3] ['Finance\n', 'Computers and the internet\n'] >>> >>> >>> >>> file2 = open("topics_copy.txt", "w") file2.write("This is a copy of topics.txt\n") file2.writelines(file1_lines) Write list of file2.close() strings Write string >>> file1 = open("topics.txt", "rU") >>> file1_lines = file1.readlines() Indentation is important Blank line when done Begin blocks indicated by : >>> if len(file1_lines) < 2: ... print "fewer than 2 lines" ... elif len(file1_lines) > 10: ... print "more than 10 lines" ... else: ... print "between 2 and 10 lines" ... between 2 and 10 lines >>> >>> word = "dog" >>> for letter in word: ... print letter ... d o g >>> pets = ["dog", "cat", "fish"] >>> for i in range(len(pets)): ... print pets[i] ... dog cat fish range: 0 to that number ‘break’ and ‘continue’ statements are available as usual >>> def get_length(listx): ... list_len = len(listx) ... return list_len ... >>> pets = ["dogs", "cats", "fish"] >>> print "I have "+ str(get_length(pets)) + " pets" I have 3 pets Integer to string Functions that can be performed on this data Data (specific to each object) Apple tree ◦ Data Fruit Leaf ◦ Functions Pick_fruit() Pick_leaf() Can abstract into a Tree class ◦ Data Fruit Leaf ◦ Functions Pick_fruit() Pick_leaf() Instances: apple tree, maple tree, palm tree.. Lists, tuples, dictionaries, files-–were all objects of their respective classes The functions we used on them were the member functions of those classes ◦ list1.append(‘a’) class roster: course = "cis530" Called when a object is ‘instantiated’ def __init__(self, name, dept): self.student_name = name self.student_dept = dept def print_details(self): print "Name: " + self.student_name print "Dept: " + self.student_dept print "Course: " + self.course student1 = roster("annie", "cis") student1.print_details() Another member function Creating an instance Calling methods of an object Suite of classes for several NLP tasks Parsing, POS tagging, classifiers… Several text processing utilities, corpora ◦ Brown, Penn Treebank corpus… ◦ Your data was divided into sentences using ‘punkt’ Basics – skim chapters 1-4 For this homework, be familiar with ◦ Corpus utilities Simplied in NLTK ◦ Probability distributions—FreqDist, ConditionalFreqDist Read definitions of all member functions Look at the code to see how it is implemented ◦ Smoothing techniques You will need to import the necessary modules to create objects and call member functions ◦ import ~ include objects from pre-built packages FreqDist, ConditionalFreqDist are in nltk.probability PlaintextCorpusReader is in nltk.corpus import nltk from nltk.corpus import PlaintextCorpusReader def get_files_from_category(category): subcat = category.split('#') if (len(subcat) == 1): corpus_root = '/home1/c/cis530/data/‘ + subcat[0] else: corpus_root = '/home1/c/cis530/data/‘ + subcat[0] + '/' + subcat[1] files = PlaintextCorpusReader(corpus_root, '.*') return files finance_files = get_files_from_category(“Finance”) cancer_files = get_files_from_category(“Health#Cancer”) def get_num_tokens(topic): categ_files = get_files_from_category(topic) all_words = categ_files.words() return len(all_words) print get_num_tokens(“Health#Diet_and_Nutrition”) print get_num_tokens(“Computers_and_the_Internet”) from nltk import FreqDist def get_top_word(topic): categ_files = get_files_from_category(topic) all_words = categ_files.words() fdist1 = nltk.FreqDist(all_words) return fdist1.keys()[0] print get_top_word(“Finance”) keys() returns samples in decreasing order of frequency