1 Parsing • Obtain text from somewhere (file, user input, web page, ..) • Analyze text: split it into meaningful tokens • Extract relevant information, disregard irrelevant information • ‘Meaningful’, ‘relevant’ depend on application: what are we looking for? – Search phone book for all people named “Ole Hansen” – Search phone book for all phone numbers starting with 86 – Search phone book for all people living in Ny Munkegade 2 Example: Torleif game Sort of like Master Mind with words and letters: • Two players, each finds 5-letter noun • Take turns in guessing • Score each guess by – Number of correctly placed letters also present in the hidden word – Number of incorrectly placed letters also present in hidden word sport trofæ frygt .. 1 correct, 2 incorrect 1 correct, 1 incorrect 3 Let’s write a computer player: 1. Pick random word (from homepage of Dansk Sprognævn). 2. Ask for a guess 3. Was the guess correct? 4. Otherwise score the guess 5. Go to 2. Dansk Sprognævn, dictionary web page Ask for all words starting with .. We are looking for 5-letter strings in bold followed by the string “sb” in italics Page displays at most 50 words at a time 4 5 Parsing the web page The source code of the dynamically generated web page has 370 lines. Some of it looks like this: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <TITLE>Retskrivningsordbogen på nettet fra Dansk Sprogn&aelig;vn</TITLE> <META name="Author" content="Erik Bo Krantz Simonsen, www.progresso.dk"> <META name="Description" content="The official Danish orthography dictionary on the web"> <META name="KeyWords" content="RO2001, Retskrivningsordbogen, ordbog, dictionary, orthography, Dansk Sprognævn"> <LINK rel="STYLESHEET" href="http://www.dsn.dk/ordbog.aux/ro2001ie.css" type="text/css"> <SCRIPT language="JavaScript" type="text/javascript"> <!-self.focus(); // frame focus if (document.searchForm && document.searchForm.P) src="http://www.dsn.dk/ordbog.aux/lowerRight.gif"></td></tr></table></TD> <TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> <TR><TD rowspan="2" valign="top"><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=7 WIDTH=390> <TR BGCOLOR="#d0e0d0"><TD> <B>spond&aelig;isk </B><I>adj., itk. d.s.</I> </TD></TR> <<TR><TD> <B>sporstof </B><I>sb., </I>-fet, -fer. </TD></TR> <TR BGCOLOR="#d0e0d0"><TD> <B>sport </B><I>sb., </I>-en, <I>i sms. </I>sports-, <I>fx </I>sportsst&aelig;vne. </TD></TR> </HTML> 6 Algorithm for picking a random word • Pick a random initial letter x (weighted – count total number of words beginning with each letter) • Pick random index (in the list of all words starting with x) • Ask website for webpage with next 50 x-words starting at chosen index • Parse webpage and look for first 5-letter noun • If none is found, ask for next 50 (wrap-around) import import import import 7 urllib sys re random get_random_word.py module def getRandom5letterNoun(): # q has weight 0 since there are no 5-letter Danish nouns starting with q! hyppighed = (3194, 4540, 759, 2651, 1556, 5221, 2658, 3141, 1890, 526,4979, 2327, 3086, 1665, 2074, 3480, 0, 2455, 8460, 3845, 2315, 2230, 78, 20, 102, 77, 262, 252, 175) # sum: 64018 r = random.randrange(0, 64018) sum = hyppighed[0] startbogstav = 0 while sum<r: # pick random (weighted) starting letter startbogstav+=1 sum+=hyppighed[startbogstav] bogstavhyppighed = hyppighed[startbogstav] startindex = random.randrange(0, bogstavhyppighed) # pick random index if startbogstav == startbogstav = elif startbogstav startbogstav = elif startbogstav startbogstav = else: startbogstav = 26: # translate from chosen character code into actual letter 'æ' == 27: 'ø' == 28: 'å' chr(startbogstav+97) found_word = 0 while not found_word: try: # get next 50 words, starting from chosen index, from website: myurl = "http://www.dsn.dk/cgi-bin/ordbog/ronet?M=1&P=%s&L=50&F=%d&T=%d” \ %(startbogstav, startindex, bogstavhyppighed) tempfile = urllib.urlopen(myurl) tekst = tempfile.read() tempfile.close() except IOError: print "Kan ikke få fat på Dansk Sprognævn" sys.exit(1) tekst = tekst.replace("&aelig;", "æ") # replace special codes with corresponding letters tekst = tekst.replace("&oslash;", "ø") tekst = tekst.replace("&aring;", "å") wordRE = "<B>([a-zæøå]{5}) </B><I>sb" # look for 5-letter noun compiled_word = re.compile( wordRE ) resultat = compiled_word.search( tekst ) if resultat: word = resultat.group(1) found_word = 1 else: # get next 50 words from website startindex += 50 if startindex > bogstavhyppighed: startindex = 0 return word fatwa pligt areal intet synål ceder tvist 8 Game program 9 Dit 5-bogstavers bud? sport from get_random_word import getRandom5letterNoun ord = getRandom5letterNoun() g = "" svar = "\n" sport 1r 1f Dit 5-bogstavers bud? stang while ord != g: g = "" while len(g) != 5: g = raw_input("Dit 5-bogstavers bud? ").strip() sport 1r 1f guess = g kopi = ord r = 0 # number of correctly placed matching letters f = 0 # number of incorrectly placed matching letters for b in range(5): if guess[b] == kopi[b]: r += 1 kopi = kopi[0:b] + '*' + kopi[b+1:] guess = guess[0:b] + '@' + guess[b+1:] Dit 5-bogstavers bud? satin for b in range(5): index = kopi.find(guess[b]) if index >= 0: f += 1 kopi = kopi[0:index] + '*' + kopi[index+1:] guess = guess[0:b] + '@' + guess[b+1:] Dit 5-bogstavers bud? salon svar = svar + "%s %dr %df\n"%(g, r, f) print svar satin 3r 0f stang 1r 2f sport 1r 1f stang 1r 2f satin 3r 0f sport 1r 1f stang 1r 2f salon 5r 0f 10 Intermezzo 1 – find it on the web: http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.1.html 1. 2. 3. 4. Copy the get-random-word module: /users/chili/CSS.E03/ExamplePrograms/get_random_word.py Make a new program that imports this module and prints out 5 random words. Make a new version of the get_random_word module so that it returns a random noun of between 5 and 10 letters. Import this module and print out 5 random words. Make a new version of the get_random_word module so that it finds a random word which has an alternative spelling and returns a tuple of both versions. E.g. (sponsering, sponsorering). Import this module and print out 5 random such word pairs using e.g. print "%s or %s" %getWords() (Hint: See this sample webpage generated by Dansk Sprognævn's website and find the word sponsorering which has an alternative spelling ("el." is short for "eller" which means or). Then look at the source code of this page which you can find here: /users/chili/CSS.E03/ExamplePrograms/dsn_page.txt. Check how exactly the words sponsering and sponsorering appear in the html. Use that example to write a new regular expression.) 11 solution # 5-10 letter nouns: wordRE = "<B>([a-zæøå]{5,10}) </B><I>sb" # words with alternative spelling (look at html first): .. <TR><TD> <B>sponsere </B><I>(el. </I>sponsorere<I>) vb., </I>-ede. </TD></TR> .. wordRE = "<B>([a-zæøå]+) </B><I>\(el. </I>([a-zæøå]+)" compiled_word = re.compile(wordRE) resultat = compiled_word.search(tekst) if resultat: word = resultat.group(1) word2 = resultat.group(2) .. return (word, word2) 12 Sequence formats Say we get this sequence in fasta format from some database: >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate: Phylip Format: The first line of the input file contains the number of species, the number of sequences and their length (in characters) separated by blanks. The next line contains the sequence name, followed by the sequence in blocks of 10 characters. 13 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL fasta So we copy and paste and translate the sequence: 1 1 338 FOSB_MOUSE MFQAFPGDYD PGSFVPTVTA GTSYSTPGLS TPEEEEKRRV IAELQKEKER GFGWLLPPPP TSSFVLTCPE SGSRCSSSPS ITTSQDLQWL AYSTGGASGS RRERNKLAAA LEFVLVAHKP PPPLPFQSSR VSAFAGAQRT AESQYLSSVD VQPTLISSMA GGPSTSTTTS KCRNRRRELT GCKIPYEEGP DAPPNLTASL SGSEQPSDPL SFGSPPTAAA QSQGQPLASQ GPVSARPARA DRLQAETDQL GPGPLAEVRD FTHSEVQVLG NSPSLLAL SQECAGLGEM PPAVDPYDMP RPRRPREETL EEEKAELESE LPGSTSAKED DPFPVVSPSY and all is well. Then our boss says “Do it for these 5000 sequences.” phylip 14 We need automatic filter! • • Need a program that reads any number of fasta sequences and converts them into phylip format (want to run sequences through a filter) Program structure: 1. Open fasta file 2. Parse file to extract needed information 3. Create and save phylip file • We will use this definition for the fasta format (and assume only one sequence per file): – – – – – The description line starts with a greater than symbol (">"). The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description. The "ID" and the description are optional. All lines of text should be shorter than 80 characters. The sequence ends if there is another greater than symbol (">") symbol at the beginning of a line and another sequence begins. 15 Pseudo-code fasta→phylip filter 1. Open fasta file 2. Find line starting with > 3. Parse this line and extract first word after the > (sequence name) 4. Read the sequence (count its length) 5. Open phylip file 6. Write “1 1” followed by seq. length 7. Write seq. name 8. Write sequence in blocks of 10 9. Close files 16 The other way too: pseudo-code phylip→fasta filter 1. Open phylip file 2. Find first non-empty line, ignore! 3. Parse next line and extract first word (sequence name) 4. Read rest of line and following lines to get the the sequence 5. Open fasta file 6. Write “>” followed by seq. name 7. Write sequence in lines of 80 8. Close files 17 More formats? phylipfasta phylip fasta fasta phylip • Boss: “Great! What about EMBL and GDE formats?” Coding, coding,.. : 12 filters! 18 More formats? • Boss: “Super. And Genebank and ClustalW..?” • Coding, coding, coding, ..: 30 filters • Next new format: 12 new filters! I.e., this doesn’t scale. 19 Intermediate format • Use our own internal format as intermediate step: phylipinternal phylip internalphylip internalfasta fasta fasta internal • Two formats: four filters internal 20 Intermediate format • Six formats: 12 filters (not 30) i-format • New format: always two new filters only 21 Let’s build a structured program! • Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format • Each internal2y filter module: save each i-format sequence in separate file in y format • Example: Overall phylip-fasta filter: – import phylip2i and i2fasta modules – obtain filenames to load from and save to from command line – call parse_file method of the phylip2i module – call the save_to_files method of the i2fasta module 22 Our internal format revisited Isequence: """Definition of abstract data type representing a sequence in I-format - internal format""" def __init__(self, t = "unknown“, n = "unknown“, i = "unknown“ ): """Initialize fields to given values""" self.type = t self.name = n self.id = i self.sequence = "" # represent the sequence itself as a string Thus, the information we keep about a sequence is type, name, id; all other information is disregarded 23 Example: fasta/phylip filter • Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format from Isequence import Isequence class Parser: # loads and parses fasta file into list of i-sequences def init__(self): self.iseqlist = [] # initialize empty list def parse_file(self, loadfilename): <<load file, save content in variable lines>> for line in lines: if line[0] == '>': # new sequence starts items = line.split() # assume: dna, first word after > is the id, next two words are the name. self.iseq = Isequence("dna", " ".join(items[1:3]), items[0][1:]) self.iseqlist.append(self.iseq) #put new Isequence object in list elif self.iseq: # we are currently building an iseq object, extend its sequence self.iseq.extend_sequence(line.strip()) # skip trailing newline return self.iseqlist • Each internal2y filter module: save each i-format sequence in separate file in y format from Isequence import Isequence class SaveToFiles: # save i-sequences in phylip format def save_to_files(self, iseqlist, savefilename): try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w") seqstring = seq.get_sequence() print >> savefile, "1 1 %d" %len( seqstring ) prefix = "%-10s " %seq.get_name() # write name savefile.write( prefix ) prefix = " " * len( prefix ) # on remaining lines write spaces instead of name counter = 1 for char in seqstring: savefile.write( char ) if counter%10 == 0: savefile.write( " " ) if counter%50 == 0: savefile.write( "\n%s" %prefix ) counter += 1 savefile.close() except IOError, message: sys.exit(message) 24 Command-line arguments • Python stores command-line arguments in a list called sys.argv • The first argument is the name of the program that the user is running from the command-line # filename: command_line_arguments.py import sys print "first argument is program name:", sys.argv[0] print "arguments for the program start at index 1:" for arg in sys.argv[1:]: print arg threonine:~...ExamplePrograms% python command_line_arguments.py 1 2 3 qq first argument is program name: command_line_arguments.py arguments for the program start at index 1: 1 2 3 qq 25 26 Overall fasta/phylip filter import Isequence from i2phylip import SaveToFiles from fasta2i import Parser import sys 1. import phylip2i and i2fasta modules 2. obtain filenames to load from and save to from command line 3. call parse_file method of the phylip2i module 4. call the save_to_files method of the i2fasta module # Now SaveToFiles is a class that can save i-format sequences in phylip format, # and Parser is a class that reads a fasta file and parses it into i-format. # load a fasta file, save each sequence in its own file in phylip format if len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save phylip sequences in.""") loadfilename = sys.argv[1] savefilename = sys.argv[2] # parse file and store each sequence in Isequence object: input_parser = Parser() iseq_list = input_parser.parse_file(loadfilename) # save each Isequence in required format in separate files: save_object = SaveToFiles() save_object.save_to_files(iseq_list, savefilename) NB: nothing about phylip and fasta below this point.. i2embl filter module..? from Isequence import Isequence class SaveToFiles: def save_to_files(self, iseqlist, savefilename): # same class name # same method name try: for seq in iseqlist: <<create appropriate suffix for the savefilename (a unique file per sequence)>> savefile = open(savefilename + suffix, "w") <<convert i-sequence to embl format and write to file>> savefile.close() except IOError, message: sys.exit(message) 27 28 Fasta/embl filter..? import Isequence from i2embl import SaveToFiles from fasta2i import Parser import sys # import same method name from different module # Now SaveToFiles is a class that can save i-format sequences in embl format, # and Parser is a class that reads a fasta file and parses it into i-format. # load a fasta file, save each sequence in its own file in phylip format if len(sys.argv) != 3: sys.exit("""Program takes two arguments: file to load fasta sequence(s) from and file (prefix) to save embl sequences in.""") loadfilename = sys.argv[1] savefilename = sys.argv[2] # parse file and store each sequence in Isequence object: input_parser = Parser() iseq_list = input_parser.parse_file(loadfilename) # save each Isequence in required format in separate files: save_object = SaveToFiles() save_object.save_to_files(iseq_list, savefilename) 29 Intermezzo 2, on the web: http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.2.html Oh no, the phylip format has been changed by its designers! • The first line of a file with a sequence in the new phylip format is a comment line and begins with "@@". In this comment line the name of the author of the file should appear, the year of creation, and the name of the author's favorite football player, separated by commas. • In the next lines the sequence is written, starting with "##". • In the final line (starting with "!!"), the sequence name is written. Thus, a phylip format file might look like this: @@Jakob Fredslund, 2003, Zinedine Zidane ##cgactaagcttagcacggatcgatcggaattctagagcgacgacgtctagcagcgcgtaacgtatagctcgcgaggaaagctctgtaggggactg cgagaagatgg !!Tyrannosaurus Rex Rewrite the fasta/phylip filter to incorporate the changed phylip format. Find all needed files here. I.e.: 1. Copy the needed files – remember Isequence.py 2. Run the overall fasta/phylip filter on the given example fasta file and check the resulting phylip files to see how it works. 3. Make the necessary changes in the right places. 30 solution • We only need to modify the i2phylip module (neat! – another good reason to use an intermediate format). # save each sequence in separate file: try: for seq in iseqlist: suffix = ".phylip" if len(iseqlist) > 1: suffix = "_" + seq.get_id() + suffix savefile = open(savefilename + suffix, "w") seqstring = seq.get_sequence() print >> savefile, "@@Jakob Fredslund, 2003, Zidane" print >> savefile, "##%s" %seqstring print >> savefile, "!!%s" %seq.get_name() savefile.close()