Phylip Format

advertisement
1
Parsing
• Obtain text from somewhere (file, user input, web
page, ..)
• Analyze text: split it into meaningful tokens
• Extract relevant information, disregard irrelevant
information
• ‘Meaningful’, ‘relevant’ depend on application:
what are we looking for?
– Search phone book for all people named “Ole Hansen”
– Search phone book for all phone numbers starting with 86
– Search phone book for all people living in Ny Munkegade
2
Example: Torleif game
Sort of like Master Mind with words and letters:
• Two players, each finds 5-letter noun
• Take turns in guessing
• Score each guess by
– Number of correctly placed letters also present in the hidden
word
– Number of incorrectly placed letters also present in hidden
word
sport
trofæ
frygt
..
1 correct, 2 incorrect
1 correct, 1 incorrect
3
Let’s write a computer player:
1. Pick random word (from homepage of Dansk
Sprognævn).
2. Ask for a guess
3. Was the guess correct?
4. Otherwise score the guess
5. Go to 2.
Dansk Sprognævn, dictionary web page
Ask for all
words starting
with ..
We are looking for
5-letter strings in
bold followed by
the string “sb” in
italics
Page displays at
most 50 words
at a time
4
5
Parsing the web page
The source code of the dynamically generated web page has 370 lines.
Some of it looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<TITLE>Retskrivningsordbogen på nettet fra Dansk Sprognævn</TITLE>
<META name="Author" content="Erik Bo Krantz Simonsen, www.progresso.dk">
<META name="Description" content="The official Danish orthography dictionary on the web">
<META name="KeyWords" content="RO2001, Retskrivningsordbogen, ordbog, dictionary,
orthography, Dansk Sprognævn">
<LINK rel="STYLESHEET" href="http://www.dsn.dk/ordbog.aux/ro2001ie.css" type="text/css">
<SCRIPT language="JavaScript" type="text/javascript"> <!-self.focus(); // frame focus
if (document.searchForm && document.searchForm.P)
src="http://www.dsn.dk/ordbog.aux/lowerRight.gif"></td></tr></table></TD>
<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR><TD rowspan="2" valign="top"><TABLE BORDER=0 CELLSPACING=0 CELLPADDING=7 WIDTH=390>
<TR BGCOLOR="#d0e0d0"><TD>
<B>spondæisk </B><I>adj., itk. d.s.</I>
</TD></TR>
<<TR><TD>
<B>sporstof </B><I>sb., </I>-fet, -fer.
</TD></TR>
<TR BGCOLOR="#d0e0d0"><TD>
<B>sport </B><I>sb., </I>-en, <I>i sms. </I>sports-, <I>fx </I>sportsstævne.
</TD></TR>
</HTML>
6
Algorithm for picking a random word
• Pick a random initial letter x (weighted – count
total number of words beginning with each letter)
• Pick random index (in the list of all words starting
with x)
• Ask website for webpage with next 50 x-words
starting at chosen index
• Parse webpage and look for first 5-letter noun
• If none is found, ask for next 50 (wrap-around)
import
import
import
import
7
urllib
sys
re
random
get_random_word.py module
def getRandom5letterNoun():
# q has weight 0 since there are no 5-letter Danish nouns starting with q!
hyppighed = (3194, 4540, 759, 2651, 1556, 5221, 2658, 3141, 1890, 526,4979,
2327, 3086, 1665, 2074, 3480, 0, 2455, 8460, 3845, 2315, 2230,
78, 20, 102, 77, 262, 252, 175)
# sum: 64018
r = random.randrange(0, 64018)
sum = hyppighed[0]
startbogstav = 0
while sum<r:
# pick random (weighted) starting letter
startbogstav+=1
sum+=hyppighed[startbogstav]
bogstavhyppighed = hyppighed[startbogstav]
startindex = random.randrange(0, bogstavhyppighed) # pick random index
if startbogstav ==
startbogstav =
elif startbogstav
startbogstav =
elif startbogstav
startbogstav =
else:
startbogstav =
26: # translate from chosen character code into actual letter
'æ'
== 27:
'ø'
== 28:
'å'
chr(startbogstav+97)
found_word = 0
while not found_word:
try:
# get next 50 words, starting from chosen index, from website:
myurl = "http://www.dsn.dk/cgi-bin/ordbog/ronet?M=1&P=%s&L=50&F=%d&T=%d” \
%(startbogstav, startindex, bogstavhyppighed)
tempfile = urllib.urlopen(myurl)
tekst = tempfile.read()
tempfile.close()
except IOError:
print "Kan ikke få fat på Dansk Sprognævn"
sys.exit(1)
tekst = tekst.replace("æ", "æ") # replace special codes with corresponding letters
tekst = tekst.replace("ø", "ø")
tekst = tekst.replace("å", "å")
wordRE = "<B>([a-zæøå]{5}) </B><I>sb"
# look for 5-letter noun
compiled_word = re.compile( wordRE )
resultat = compiled_word.search( tekst )
if resultat:
word = resultat.group(1)
found_word = 1
else:
# get next 50 words from website
startindex += 50
if startindex > bogstavhyppighed:
startindex = 0
return word
fatwa
pligt
areal
intet
synål
ceder
tvist
8
Game program
9
Dit 5-bogstavers bud?
sport
from get_random_word import getRandom5letterNoun
ord = getRandom5letterNoun()
g = ""
svar = "\n"
sport 1r 1f
Dit 5-bogstavers bud?
stang
while ord != g:
g = ""
while len(g) != 5:
g = raw_input("Dit 5-bogstavers bud? ").strip()
sport 1r 1f
guess = g
kopi = ord
r = 0
# number of correctly placed matching letters
f = 0
# number of incorrectly placed matching letters
for b in range(5):
if guess[b] == kopi[b]:
r += 1
kopi = kopi[0:b] + '*' + kopi[b+1:]
guess = guess[0:b] + '@' + guess[b+1:]
Dit 5-bogstavers bud?
satin
for b in range(5):
index = kopi.find(guess[b])
if index >= 0:
f += 1
kopi = kopi[0:index] + '*' + kopi[index+1:]
guess = guess[0:b] + '@' + guess[b+1:]
Dit 5-bogstavers bud?
salon
svar = svar + "%s %dr %df\n"%(g, r, f)
print svar
satin 3r 0f
stang 1r 2f
sport 1r 1f
stang 1r 2f
satin 3r 0f
sport 1r 1f
stang 1r 2f
salon 5r 0f
10
Intermezzo 1 – find it on the web:
http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.1.html
1.
2.
3.
4.
Copy the get-random-word module:
/users/chili/CSS.E03/ExamplePrograms/get_random_word.py
Make a new program that imports this module and prints out 5 random words.
Make a new version of the get_random_word module so that it returns a random
noun of between 5 and 10 letters. Import this module and print out 5 random
words.
Make a new version of the get_random_word module so that it finds a random
word which has an alternative spelling and returns a tuple of both versions. E.g.
(sponsering, sponsorering). Import this module and print out 5 random such
word pairs using e.g. print "%s or %s" %getWords()
(Hint: See this sample webpage generated by Dansk Sprognævn's website and
find the word sponsorering which has an alternative spelling ("el." is short for
"eller" which means or). Then look at the source code of this page which you
can find here:
/users/chili/CSS.E03/ExamplePrograms/dsn_page.txt.
Check how exactly the words sponsering and sponsorering appear in the html.
Use that example to write a new regular expression.)
11
solution
# 5-10 letter nouns:
wordRE = "<B>([a-zæøå]{5,10}) </B><I>sb"
# words with alternative spelling (look at html first):
..
<TR><TD>
<B>sponsere </B><I>(el. </I>sponsorere<I>) vb., </I>-ede.
</TD></TR>
..
wordRE = "<B>([a-zæøå]+) </B><I>\(el. </I>([a-zæøå]+)"
compiled_word = re.compile(wordRE)
resultat = compiled_word.search(tekst)
if resultat:
word = resultat.group(1)
word2 = resultat.group(2)
..
return (word, word2)
12
Sequence formats
Say we get this sequence in fasta format from some database:
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
Now we need to compare this sequence to all sequences in
some other database. Unfortunately this database uses the
phylip format, so we need to translate:
Phylip Format:
The first line of the input file contains the number of species, the number of sequences and
their length (in characters) separated by blanks.
The next line contains the sequence name, followed by the sequence in blocks of 10
characters.
13
Sequence formats
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
fasta
So we copy and paste and translate the sequence:
1 1 338
FOSB_MOUSE
MFQAFPGDYD
PGSFVPTVTA
GTSYSTPGLS
TPEEEEKRRV
IAELQKEKER
GFGWLLPPPP
TSSFVLTCPE
SGSRCSSSPS
ITTSQDLQWL
AYSTGGASGS
RRERNKLAAA
LEFVLVAHKP
PPPLPFQSSR
VSAFAGAQRT
AESQYLSSVD
VQPTLISSMA
GGPSTSTTTS
KCRNRRRELT
GCKIPYEEGP
DAPPNLTASL
SGSEQPSDPL
SFGSPPTAAA
QSQGQPLASQ
GPVSARPARA
DRLQAETDQL
GPGPLAEVRD
FTHSEVQVLG
NSPSLLAL
SQECAGLGEM
PPAVDPYDMP
RPRRPREETL
EEEKAELESE
LPGSTSAKED
DPFPVVSPSY
and all is well.
Then our boss says “Do it for these 5000 sequences.”
phylip
14
We need automatic filter!
•
•
Need a program that reads any number of fasta sequences
and converts them into phylip format (want to run
sequences through a filter)
Program structure:
1. Open fasta file
2. Parse file to extract needed information
3. Create and save phylip file
•
We will use this definition for the fasta format (and
assume only one sequence per file):
–
–
–
–
–
The description line starts with a greater than symbol (">").
The word following the greater than symbol (">") immediately is
the "ID" (name) of the sequence, the rest of the line is the
description.
The "ID" and the description are optional.
All lines of text should be shorter than 80 characters.
The sequence ends if there is another greater than symbol (">")
symbol at the beginning of a line and another sequence begins.
15
Pseudo-code fasta→phylip filter
1. Open fasta file
2. Find line starting with >
3. Parse this line and extract first
word after the > (sequence name)
4. Read the sequence (count its
length)
5. Open phylip file
6. Write “1 1” followed by seq. length
7. Write seq. name
8. Write sequence in blocks of 10
9. Close files
16
The other way too:
pseudo-code phylip→fasta filter
1. Open phylip file
2. Find first non-empty line, ignore!
3. Parse next line and extract first
word (sequence name)
4. Read rest of line and following
lines to get the the sequence
5. Open fasta file
6. Write “>” followed by seq. name
7. Write sequence in lines of 80
8. Close files
17
More formats?
phylipfasta
phylip
fasta
fasta phylip
• Boss: “Great! What about EMBL and GDE
formats?”
Coding, coding,.. : 12 filters!
18
More formats?
• Boss: “Super. And Genebank and ClustalW..?”
• Coding, coding, coding, ..: 30 filters 
• Next new format: 12 new filters! I.e., this doesn’t
scale.
19
Intermediate format
• Use our own internal format as intermediate step:
phylipinternal
phylip
internalphylip
internalfasta
fasta
fasta internal
• Two formats: four filters
internal
20
Intermediate format
• Six formats: 12 filters (not 30)
i-format
• New format: always two new filters only
21
Let’s build a structured program!
• Each x2internal filter module: parse file in x format,
extract information, return sequence(s) in internal format
• Each internal2y filter module: save each i-format sequence
in separate file in y format
• Example: Overall phylip-fasta filter:
– import phylip2i and i2fasta modules
– obtain filenames to load from and save to from
command line
– call parse_file method of the phylip2i module
– call the save_to_files method of the i2fasta
module
22
Our internal format revisited
Isequence:
"""Definition of abstract data type representing a sequence in I-format
- internal format"""
def __init__(self, t = "unknown“, n = "unknown“, i = "unknown“ ):
"""Initialize fields to given values"""
self.type = t
self.name = n
self.id = i
self.sequence = "" # represent the sequence itself as a string
Thus, the information we keep about a sequence is type, name,
id; all other information is disregarded
23
Example: fasta/phylip filter
• Each x2internal filter module: parse file in x format, extract
information, return sequence(s) in internal format
from Isequence import Isequence
class Parser:
# loads and parses fasta file into list of i-sequences
def init__(self):
self.iseqlist = [] # initialize empty list
def parse_file(self, loadfilename):
<<load file, save content in variable lines>>
for line in lines:
if line[0] == '>':
# new sequence starts
items = line.split()
# assume: dna, first word after > is the id, next two words are the name.
self.iseq = Isequence("dna", " ".join(items[1:3]), items[0][1:])
self.iseqlist.append(self.iseq) #put new Isequence object in list
elif self.iseq:
# we are currently building an iseq object, extend its sequence
self.iseq.extend_sequence(line.strip()) # skip trailing newline
return self.iseqlist
• Each internal2y filter module: save each i-format sequence in
separate file in y format
from Isequence import Isequence
class SaveToFiles:
# save i-sequences in phylip format
def save_to_files(self, iseqlist, savefilename):
try:
for seq in iseqlist:
<<create appropriate suffix for the savefilename (a unique file per sequence)>>
savefile = open(savefilename + suffix, "w")
seqstring = seq.get_sequence()
print >> savefile, "1 1 %d" %len( seqstring )
prefix = "%-10s " %seq.get_name() # write name
savefile.write( prefix )
prefix = " " * len( prefix ) # on remaining lines write spaces instead of name
counter = 1
for char in seqstring:
savefile.write( char )
if counter%10 == 0:
savefile.write( " " )
if counter%50 == 0:
savefile.write( "\n%s" %prefix )
counter += 1
savefile.close()
except IOError, message:
sys.exit(message)
24
Command-line arguments
•
Python stores command-line arguments in a list
called sys.argv
•
The first argument is the name of the program that
the user is running from the command-line
# filename: command_line_arguments.py
import sys
print "first argument is program name:", sys.argv[0]
print "arguments for the program start at index 1:"
for arg in sys.argv[1:]:
print arg
threonine:~...ExamplePrograms% python command_line_arguments.py 1 2 3 qq
first argument is program name: command_line_arguments.py
arguments for the program start at index 1:
1
2
3
qq
25
26
Overall fasta/phylip filter
import Isequence
from i2phylip import SaveToFiles
from fasta2i import Parser
import sys
1.
import phylip2i and i2fasta modules
2.
obtain filenames to load from and save to
from command line
3.
call parse_file method of the phylip2i
module
4.
call the save_to_files method of the
i2fasta module
# Now SaveToFiles is a class that can save i-format sequences in phylip format,
# and Parser is a class that reads a fasta file and parses it into i-format.
# load a fasta file, save each sequence in its own file in phylip format
if len(sys.argv) != 3:
sys.exit("""Program takes two arguments: file to load fasta sequence(s)
from and file (prefix) to save phylip sequences in.""")
loadfilename = sys.argv[1]
savefilename = sys.argv[2]
# parse file and store each sequence in Isequence object:
input_parser = Parser()
iseq_list = input_parser.parse_file(loadfilename)
# save each Isequence in required format in separate files:
save_object = SaveToFiles()
save_object.save_to_files(iseq_list, savefilename)
NB: nothing
about phylip
and fasta
below this
point..
i2embl filter module..?
from Isequence import Isequence
class SaveToFiles:
def save_to_files(self, iseqlist, savefilename):
# same class name
# same method name
try:
for seq in iseqlist:
<<create appropriate suffix for the savefilename (a unique file
per sequence)>>
savefile = open(savefilename + suffix, "w")
<<convert i-sequence to embl format and write to file>>
savefile.close()
except IOError, message:
sys.exit(message)
27
28
Fasta/embl filter..?
import Isequence
from i2embl import SaveToFiles
from fasta2i import Parser
import sys
# import same method name from different module
# Now SaveToFiles is a class that can save i-format sequences in embl format,
# and Parser is a class that reads a fasta file and parses it into i-format.
# load a fasta file, save each sequence in its own file in phylip format
if len(sys.argv) != 3:
sys.exit("""Program takes two arguments: file to load fasta sequence(s)
from and file (prefix) to save embl sequences in.""")
loadfilename = sys.argv[1]
savefilename = sys.argv[2]
# parse file and store each sequence in Isequence object:
input_parser = Parser()
iseq_list = input_parser.parse_file(loadfilename)
# save each Isequence in required format in separate files:
save_object = SaveToFiles()
save_object.save_to_files(iseq_list, savefilename)
29
Intermezzo 2, on the web:
http://www.daimi.au.dk/~chili/CSS/Intermezzi/9.10.2.html
Oh no, the phylip format has been changed by its designers!
•
The first line of a file with a sequence in the new phylip format is a comment line and begins with
"@@". In this comment line the name of the author of the file should appear, the year of creation,
and the name of the author's favorite football player, separated by commas.
•
In the next lines the sequence is written, starting with "##".
•
In the final line (starting with "!!"), the sequence name is written.
Thus, a phylip format file might look like this:
@@Jakob Fredslund, 2003, Zinedine Zidane
##cgactaagcttagcacggatcgatcggaattctagagcgacgacgtctagcagcgcgtaacgtatagctcgcgaggaaagctctgtaggggactg
cgagaagatgg
!!Tyrannosaurus Rex
Rewrite the fasta/phylip filter to incorporate the changed phylip format. Find all needed files here. I.e.:
1.
Copy the needed files – remember Isequence.py
2.
Run the overall fasta/phylip filter on the given example fasta file and check the resulting phylip
files to see how it works.
3.
Make the necessary changes in the right places.
30
solution
• We only need to modify the i2phylip module
(neat! – another good reason to use an
intermediate format).
# save each sequence in separate file:
try:
for seq in iseqlist:
suffix = ".phylip"
if len(iseqlist) > 1:
suffix = "_" + seq.get_id() + suffix
savefile = open(savefilename + suffix, "w")
seqstring = seq.get_sequence()
print >> savefile, "@@Jakob Fredslund, 2003, Zidane"
print >> savefile, "##%s" %seqstring
print >> savefile, "!!%s" %seq.get_name()
savefile.close()
Download