Lecture 8 PDF

advertisement
Python for Scientific Computing
Lecture 8: Biopython
Antonio M. Ferreira, PhD
Center for Simulation and Modeling
October 2, 2013
What is Biopython?
http://www.biopython.org
I
Automatically parses files into Python data structures:
I
I
I
I
I
I
I
I
BLAST
Clustalw
FASTA
PubMed and Medline
SwissProt
UniGene
ExPASy files
SCOP (including ’dom’ and ’lin’ files)
What is Biopython?
http://www.biopython.org
I
Interfaces for calling popular bioinformatics tools:
I
I
I
I
I
BLAST (network and standalone versions)
Clustalw
EMBOSS
BioSQL
Many others
What is Biopython?
http://www.biopython.org
Basically, we just like to program in Python and want to make it
as easy as possible to use Python for bioinformatics by creating
high-quality, reusable modules and scripts.
Getting Started
I
Get the workshop data
$
$
$
$
cd ~
mkdir -p biopython/data
cd biopython/data
cp -v /pan/genomics/data/Biopython_workshop/* .
Getting Started
I
Get the workshop data
$
$
$
$
I
cd ~
mkdir -p biopython/data
cd biopython/data
cp -v /pan/genomics/data/Biopython_workshop/* .
Set up your environment
$ module purge
$ module load genomics/ver6
Getting Started
I
Get the workshop data
$
$
$
$
I
cd ~
mkdir -p biopython/data
cd biopython/data
cp -v /pan/genomics/data/Biopython_workshop/* .
Set up your environment
$ module purge
$ module load genomics/ver6
I
Now you can start Python
$ cd ~/biopython
$ python
Enthought Python Distribution -- www.enthought.com
Version: 7.2-2 (64-bit)
Python 2.7.2 |EPD 7.2-2 (64-bit)| (default, Jul 3 2011, 15:17:51)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type "packages", "demo" or "enthought" for more information.
>>>
Quick Start
I
1
2
3
4
5
6
7
8
9
10
Let’s have some fun with sequences
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
from B i o . Seq i m p o r t Seq
my_seq = Seq ( " AGTACACTGGT " )
t y p e ( my_seq )
my_seq
p r i n t my_seq
my_seq . a l p h a b e t
my_seq . complement ( )
my_seq . r e v e r s e _ c o m p l e m e n t ( )
p r i n t my_seq . l o w e r ( )
p r i n t my_seq . u p p e r ( )
Quick Start
I
1
2
3
4
5
6
>>> from B i o i m p o r t SeqIO
>>> f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " data / ls_orchid . fasta " ,
" fasta " ) :
...
print seq_record . id
...
p r i n t repr ( seq_record . seq )
...
print len ( seq_record )
I
1
2
3
4
5
Now let’s play with a file
How about a genbank file?
>>> f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " data / ls_orchid . gbk " ,
" genbank " ) :
...
print seq_record . id
...
p r i n t repr ( seq_record . seq )
...
print len ( seq_record )
Sequences and Alphabets
I
I
Sequence objects are more than just strings
Bio.Alphabet.IUPAC provides basic DNA, RNA, and protein
functionality
I
ExtendedIUPACProtein class
I
I
I
I
I
I
I
U/Sec for selenocysteine
O/Pyl for pyrrolysine
J/Xle for leucine isoleucine
Z/Glx for glutamine or glutamic acid
X/Xxx for unknown amino acid
IUPACAmbiguousRNA or IUPACUnambiguousRNA
IUPACAmbiguousDNA or IUPACUnambiguousDNA
I
ExtendedIUPAC[DNA/RNA] for modified bases
Sequences and Alphabets
I
1
2
3
4
5
6
7
Sequences act like strings
>>> from B i o . Seq i m p o r t Seq
>>> from B i o . A l p h a b e t i m p o r t IUPAC
>>> my_seq = Seq ( ’ G A T C G A T G G G C C T A T A T A G G A T C G A A A A T C G C ’ ,
IUPAC . unambiguous_dna )
>>> my_seq = Seq ( " GATCG " , IUPAC . unambiguous_dna )
>>> f o r i n d e x , l e t t e r i n enumerate ( my_seq ) :
. . . p r i n t index , l e t t e r
Transcription
I
1
2
3
4
5
6
7
8
9
10
11
Biopython can simplify the mundane tasks
>>> from B i o . Seq i m p o r t Seq
>>> from B i o . A l p h a b e t i m p o r t IUPAC
>>> c o d i n g _ d n a = Seq ( " A T G G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A T A G " ,
IUPAC . unambiguous_dna )
>>> c o d i n g _ d n a
>>> t e m p l a t e _ d n a = c o d i n g _ d n a . r e v e r s e _ c o m p l e m e n t ( )
>>> t e m p l a t e _ d n a
>>> m e s s e n g e r _ r n a = c o d i n g _ d n a . t r a n s c r i b e ( )
>>> m e s s e n g e r _ r n a
>>> t e m p l a t e _ d n a . r e v e r s e _ c o m p l e m e n t ( ) . t r a n s c r i b e ( )
>>> m e s s e n g e r _ r n a . b a c k _ t r a n s c r i b e ( )
Translation
I
1
>>> m e s s e n g e r _ r n a . t r a n s l a t e ( )
I
1
2
3
Now we can generate the corresponding amino acid sequence
You can even do it directly from the DNA sequence
>>> c o d i n g _ d n a . t r a n s l a t e ( )
>>> c o d i n g _ d n a . t r a n s l a t e ( t a b l e=" Vertebrate Mitochondrial " )
>>> c o d i n g _ d n a . t r a n s l a t e ( t a b l e=2 , t o _ s t o p=True )
Translation Tables
1
2
3
4
5
6
7
8
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
from B i o . Data i m p o r t CodonTable
s t a n d a r d _ t a b l e = CodonTable . unambiguous_dna_by_name [ " Standard "
m i t o _ t a b l e = CodonTable . unambiguous_dna_by_name [ " Vertebrate Mi
print standard_table
print mito_table
mito_table . stop_codons
mito_table . start_codons
m i t o _ t a b l e . f o r w a r d _ t a b l e [ " ACG " ]
Mutable Sequence Objects
Try:
1
2
3
4
>>>
>>>
>>>
>>>
from B i o . Seq i m p o r t Seq
from B i o . A l p h a b e t i m p o r t IUPAC
my_seq = Seq ( " G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A " , IUPAC . unambi
my_seq [ 5 ] = " G "
Mutable Sequence Objects
Try:
1
2
3
4
>>>
>>>
>>>
>>>
from B i o . Seq i m p o r t Seq
from B i o . A l p h a b e t i m p o r t IUPAC
my_seq = Seq ( " G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A " , IUPAC . unambi
my_seq [ 5 ] = " G "
Did it work?
Mutable Sequence Objects
Try:
1
2
3
4
>>>
>>>
>>>
>>>
from B i o . Seq i m p o r t Seq
from B i o . A l p h a b e t i m p o r t IUPAC
my_seq = Seq ( " G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A " , IUPAC . unambi
my_seq [ 5 ] = " G "
Did it work?
Now try:
1
2
3
>>> m u t a b l e _ s e q = my_seq . t o m u t a b l e ( )
>>> m u t a b l e _ s e q
>>> m u t a b l e _ s e q [ 5 ] = " G "
Exercise
I
Create a sequence
I
Translate the sequence using the standard table
I
Change two of the sequence elements
I
Translate the new sequence
The SeqRecord Object
I
Defined in the Bio.SeqRecord module
I
Easy to create from a FASTA file
1
2
3
4
5
6
7
>>>
>>>
>>>
>>>
>>>
>>>
>>>
I
1
2
3
4
5
6
7
8
9
from B i o i m p o r t SeqIO
r e c o r d = SeqIO . r e a d ( " data / NC_005816 . fna " , " fasta " )
record
record . seq
record . id
r e c o r d . name
record . description
Now try with a GenBank file
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
from B i o i m p o r t SeqIO
r e c o r d = SeqIO . r e a d ( " data / NC_005816 . gb " , " genbank " )
record
record . seq
p r i n t r e c o r d . f e a t u r e s [ 20 ]
p r i n t r e c o r d . f e a t u r e s [ 21 ]
r c = r e c o r d . r e v e r s e _ c o m p l e m e n t ( i d=" TESTING " )
p r i n t rc . id , len ( rc ) , len ( rc . f e a t u r e s ) , len ( rc . d b x r e f s ) ,
len ( rc . annotations )
Restriction Enzymes
Biopython has over 600 built-in restriction enzymes
1
2
3
4
5
>>>
>>>
>>>
>>>
>>>
from B i o . R e s t r i c t i o n i m p o r t R e s t r i c t i o n
p r i n t R e s t r i c t i o n . Sau 3 AI . s i t e
d i g e s t = R e s t r i c t i o n . ApaI . c a t a l y s e ( r e c o r d . s e q )
p r i n t " Number of fragments is " , l e n ( d i g e s t )
print digest
Reading Sequence Files
1
2
3
4
5
from B i o i m p o r t SeqIO
f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " data / ls_orchid . fasta " , " fasta " ) :
print seq_record . id
p r i n t repr ( seq_record . seq )
print len ( seq_record )
or
1
2
3
4
f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " ls_orchid . gbk " , " genbank " ) :
print seq_record . id
p r i n t seq_record . seq
print len ( seq_record )
Iterating Over Records in a Sequence File
1
2
3
4
5
6
7
>>>
>>>
>>>
>>>
>>>
>>>
>>>
r e c o r d _ i t e r a t o r = SeqIO . p a r s e ( " ls_orchid . fasta " , " fasta " )
f i r s t _ r e c o r d = r e c o r d _ i t e r a t o r . next ( )
print first_record . id
print first_record . description
second_record = r e c o r d _ i t e r a t o r . next ( )
p r i n t second_record . id
p r i n t second_record . d e s c r i p t i o n
Getting Data from Entrez
1
2
3
4
5
6
7
8
>>> from B i o i m p o r t E n t r e z
>>> E n t r e z . e m a i l = " amf@pitt . edu "
>>> h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , r e t t y p e=" fasta " ,
r e t m o d e=" text " , i d=" 6273291 " )
>>> s e q _ r e c o r d = SeqIO . r e a d ( h a n d l e , " fasta " )
>>> h a n d l e . c l o s e ( )
>>> p r i n t " % s with % i features " %
( seq_record . id , len ( seq_record . f e a t u r e s ))
Getting Data from SwissProt
1
2
3
4
5
6
7
8
9
10
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
from B i o i m p o r t ExPASy
h a n d l e = ExPASy . g e t _ s p r o t _ r a w ( " O23729 " )
s e q _ r e c o r d = SeqIO . r e a d ( h a n d l e , " swiss " )
handle . c l o s e ()
print seq_record . id
p r i n t s e q _ r e c o r d . name
print seq_record . d e s c r i p t i o n
p r i n t repr ( seq_record . seq )
p r i n t " Length % i " % l e n ( s e q _ r e c o r d )
p r i n t s e q _ r e c o r d . a n n o t a t i o n s [ " keywords " ]
Writing Sequence Files
1
2
from d a t a . s a m p l e _ s e q s i m p o r t ∗
SeqIO . w r i t e ( my_records , " my_example . faa " , " fasta " )
Converting Between Sequence File Formats
Explicit
1
2
3
4
from B i o i m p o r t SeqIO
r e c o r d s = SeqIO . p a r s e ( " ls_orchid . gbk " , " genbank " )
c o u n t = SeqIO . w r i t e ( r e c o r d s , " my_example . fasta " , " fasta " )
p r i n t " Converted % i records " % c o u n t
Using Biopython
1
2
3
h e l p ( SeqIO . c o n v e r t )
c o u n t = SeqIO . c o n v e r t ( " ls_orchid . gbk " , " genbank " , " my_example . fast
p r i n t " Converted % i records " % c o u n t
Network-based NCBI BLAST
We can download data directly from NCBI in FASTA format
1
2
3
4
5
>>>
>>>
>>>
>>>
>>>
from B i o . B l a s t i m p o r t NCBIWWW
h e l p (NCBIWWW. q b l a s t )
r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , " 8332116 " )
f a s t a _ s t r i n g = open ( " m_cold . fasta " ) . r e a d ( )
r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , f a s t a _ s t r i n g )
Could also read in the FASTA file as a SeqRecord
1
2
3
4
>>>
>>>
>>>
>>>
from B i o . B l a s t i m p o r t NCBIWWW
from B i o i m p o r t SeqIO
r e c o r d = SeqIO . r e a d ( " m_cold . fasta " , f o r m a t=" fasta " )
r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , r e c o r d . s e q )
Network-based NCBI BLAST (cont.)
You can also use SeqRecord to make a FASTA string including the
identifier
1
2
3
4
5
6
7
8
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
from B i o . B l a s t i m p o r t NCBIWWW
from B i o i m p o r t SeqIO
r e c o r d = SeqIO . r e a d ( " m_cold . fasta " , f o r m a t=" fasta " )
r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , r e c o r d . f o r m a t ( "
s a v e _ f i l e = open ( " my_blast . xml " , " w " )
s a v e _ f i l e . write ( result_handle . read ())
save_file . close ()
result_handle . close ()
my_blast.xml now has the results. We can parse it directly:
1
2
3
>>> r e s u l t _ h a n d l e = open ( " my_blast . xml " )
>>> from B i o . B l a s t i m p o r t NCBIXML
>>> b l a s t _ r e c o r d s = NCBIXML . p a r s e ( r e s u l t _ h a n d l e )
Graphics from Biopython
Biopython uses the ReportLab module to render graphics directly
1
2
3
4
5
from r e p o r t l a b . l i b i m p o r t c o l o r s
from r e p o r t l a b . l i b . u n i t s i m p o r t cm
from B i o . G r a p h i c s i m p o r t GenomeDiagram
from B i o i m p o r t SeqIO
r e c o r d = SeqIO . r e a d ( " data / NC_005816 . gb " , " genbank " )
Next we create an empty track and feature set
1
2
3
4
5
gd_diagram = GenomeDiagram . Diagram ( " Yersinia pestis biovar
Microtus plasmid pPCP1 " )
g d _ t r a c k _ f o r _ f e a t u r e s = gd_diagram . n e w _ t r a c k ( 1 ,
name=" Annotated Features " )
g d _ f e a t u r e _ s e t = g d _ t r a c k _ f o r _ f e a t u r e s . new_set ( )
Graphics from Biopython (cont.)
Next we extract the features and color them blue and lightblue
1
2
3
4
5
6
7
8
9
10
for feature in record . features :
i f f e a t u r e . t y p e != " gene " :
#E x c l u d e t h i s f e a t u r e
continue
i f l e n ( g d _ f e a t u r e _ s e t ) % 2 == 0 :
color = colors . blue
else :
color = colors . lightblue
g d _ f e a t u r e _ s e t . a d d _ f e a t u r e ( f e a t u r e , c o l o r=c o l o r ,
l a b e l=True )
Graphics from Biopython (cont.)
Now we create the drawing object
1
2
3
gd_diagram . draw ( f o r m a t=" linear " , o r i e n t a t i o n=" landscape " ,
p a g e s i z e=" A4 " , f r a g m e n t s=4 , s t a r t=0 ,
end=l e n ( r e c o r d ) )
We can write it in nearly any format we choose
1
2
3
4
gd_diagram .
gd_diagram .
gd_diagram .
gd_diagram .
w r i t e ( " pla smid_lin ear . pdf " ,
w r i t e ( " pla smid_lin ear . eps " ,
w r i t e ( " pla smid_lin ear . svg " ,
w r i t e ( " pla smid_lin ear . png " ,
" PDF " )
" EPS " )
" SVG " )
" PNG " )
Let’s make a circular one:
1
2
3
gd_diagram . draw ( f o r m a t=" circular " , c i r c u l a r =True ,
p a g e s i z e =(2 0 ∗cm , 2 0 ∗cm ) , s t a r t=0 , end=l e n ( r e c o r d ) )
gd_diagram . w r i t e ( " p l a s m i d _ c i r c u la r . pdf " , " PDF " )
Other Capabilities
I
Read/Write PDB files (Bio.PDB)
I
Population Genetics (Bio.PopGen)
I
Phylogenetics (Bio.Phylo)
I
Sequence motif analysis (Bio.motifs)
I
Cluster Analysis (Bio.Cluster)
Supervised Learning (from Bio import ...)
I
I
I
Logistic Regression (LogisticRegression)
k–nearest neighbors (kNN)
I
Naive Bayes (Bio.NaiveBayes)
I
Maximum Entropy (Bio.MaximumEntropy)
I
Markov Models (Bio.MarkovModel and/or Bio.HMM.MarkovModel)
Further Information
I
http://elbo.gs.washington.edu/courses/GS_559_11_wi/
I
http://biopython.org/wiki/Category:Cookbook
I
http://biopython.org/DIST/docs/tutorial/Tutorial.html
I
http://www.bio-cloud.info/Biopython/en/index.html
Feel free to contact me for further help: amf@pitt.edu
Download