Slides - Edwards Lab

advertisement
Sequence File Parsing
using Biopython
BCHB524
2013
Lecture 11
10/7/2013
BCHB524 - 2013 - Edwards
Review

Modules in the standard-python library:





sys, os, os.path – access files, program environment
zipfile, gzip – access compressed files directly
urllib – access web-resources (URLs) as files
csv – read delimited line based records from files
Plus lots, lots more.
10/7/2013
BCHB524 - 2013 - Edwards
2
BioPython

Additional modules that make many common
bioinformatics tasks easier




Have to install separately


File parsing (many formats) & web-retrieval
Formal biological alphabets, codon tables, etc
Lots of other stuff…
Not part of standard python, or Enthought
biopython.org
10/7/2013
BCHB524 - 2013 - Edwards
3
Biopython: Fasta format


Most common biological sequence data format
Header/Description line


Multi-accession sometimes represented




>accession description
accession1|accession2|accession3
lots of variations, no standardization
No prescribed format for the description
Other lines


10/7/2013
sequence, one chunk per line.
Usually all lines, except the last, are the same length.
BCHB524 - 2013 - Edwards
4
BioPython: Bio.SeqIO
import Bio.SeqIO
import sys
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):
# Print out the various elements of the SeqRecord
print "\n------NEW SEQRECORD------\n"
print "seq_record.id:\n\t", seq_record.id
print "seq_record.description:\n\t",seq_record.description
print "seq_record.seq:\n\t",seq_record.seq
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
5
Biopython: Other formats

Genbank format


UniProt/SwissProt flat-file format


From UniProt for SwissProt and TrEMBL
UniProt-XML format:


From NCBI, also format for RefSeq sequence
From UniProt for SwissProt and TrEMBL
Use the gzip module to handle compressed
sequence databases
10/7/2013
BCHB524 - 2013 - Edwards
6
BioPython: Bio.SeqIO
import Bio.SeqIO
import sys
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "genbank"):
# Print out the various elements of the SeqRecord
print "\n------NEW SEQRECORD------\n"
print "seq_record.id:\n\t", seq_record.id
print "seq_record.description:\n\t",seq_record.description
print "seq_record.seq:\n\t",seq_record.seq
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
7
BioPython: Bio.SeqIO
import Bio.SeqIO
import sys
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "swiss"):
# Print out the various elements of the SeqRecord
print "\n------NEW SEQRECORD------\n"
print "seq_record.id:\n\t", seq_record.id
print "seq_record.description:\n\t",seq_record.description
print "seq_record.seq:\n\t",seq_record.seq
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
8
BioPython: Bio.SeqIO
import Bio.SeqIO
import sys
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):
# Print out the various elements of the SeqRecord
print "\n------NEW SEQRECORD------\n"
print "seq_record.id:\n\t", seq_record.id
print "seq_record.description:\n\t",seq_record.description
print "seq_record.seq:\n\t",seq_record.seq
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
9
BioPython: Bio.SeqIO and gzip
import Bio.SeqIO
import sys
import gzip
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = gzip.open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):
# Print out the various elements of the SeqRecord
print "\n------NEW SEQRECORD------\n"
print "seq_record.id:\n\t", seq_record.id
print "seq_record.description:\n\t",seq_record.description
print "seq_record.seq:\n\t",seq_record.seq
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
10
What about the other "stuff"

BioPython makes it easy to get access to
non-sequence information stored in "rich"
sequence databases




10/7/2013
Annotations
Cross-References
Sequence Features
Literature
BCHB524 - 2013 - Edwards
11
BioPython: Bio.SeqIO
import Bio.SeqIO
import sys
import gzip
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = gzip.open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):
# What else is available in the SeqRecord?
print "\n------NEW SEQRECORD------\n"
print "repr(seq_record)\n\t",repr(seq_record)
print "dir(seq_record)\n\t",dir(seq_record)
break
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
12
BioPython: Bio.SeqRecord
import Bio.SeqIO
import sys
import gzip
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the FASTA file and iterate through its sequences
seqfile = gzip.open(seqfilename)
for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):
# Print out the various elements of the SeqRecord
print "\n------NEW SEQRECORD------\n"
print "seq_record.annotations\n\t",seq_record.annotations
print "seq_record.features\n\t",seq_record.features
print "seq_record.dbxrefs\n\t",seq_record.dbxrefs
print "seq_record.format('fasta')\n",seq_record.format('fasta')
break
seqfile.close()
10/7/2013
BCHB524 - 2013 - Edwards
13
BioPython: Random access

Sometimes you want to access the sequence
records "randomly"…


Why not make a dictionary, with accessions
as keys, and SeqRecord values?


…to pick out the ones you want (by accession)
Use SeqIO.to_dict(…)
What if you don't want to hold it all in memory

10/7/2013
Use SeqIO.index(…)
BCHB524 - 2013 - Edwards
14
BioPython: Bio.SeqIO.to_dict(…)
import Bio.SeqIO
import sys
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Open the sequence database
seqfile = open(seqfilename)
# Use to_dict to make a dictionary of sequence records
sprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml"))
# Close the file
seqfile.close()
# Access and print a sequence record
print sprot_dict['Q6GZV8']
10/7/2013
BCHB524 - 2013 - Edwards
15
BioPython: Bio.SeqIO.index(…)
import Bio.SeqIO
import sys
# Check the input
if len(sys.argv) < 2:
print >>sys.stderr, "Please provide a sequence file"
sys.exit(1)
# Get the sequence filename
seqfilename = sys.argv[1]
# Use index to make an out of core dict of seq records
sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml")
# Access and print a sequence record
print sprot_index['Q6GZV8']
10/7/2013
BCHB524 - 2013 - Edwards
16
Exercises


Read through and try the examples from Chapters 2-5 of
BioPython's Tutorial.
Download human proteins from RefSeq and compute amino-acid
frequencies for the (RefSeq) human proteome.
 Which amino-acid occurs the most? The least?
 Hint: access RefSeq human proteins from
ftp://ftp.ncbi.nih.gov/refseq


Download human proteins from SwissProt and compute amino-acid
frequencies for the SwissProt human proteome.
 Which amino-acid occurs the most? The least?
 Hint: access SwissProt human proteins from
http://www.uniprot.org/downloads -> “Taxonomic divisions”
How similar are the human amino-acid frequencies of in RefSeq and
SwissProt?
10/7/2013
BCHB524 - 2013 - Edwards
17
Download