Python for Scientific Computing Lecture 8: Biopython Antonio M. Ferreira, PhD Center for Simulation and Modeling October 2, 2013 What is Biopython? http://www.biopython.org I Automatically parses files into Python data structures: I I I I I I I I BLAST Clustalw FASTA PubMed and Medline SwissProt UniGene ExPASy files SCOP (including ’dom’ and ’lin’ files) What is Biopython? http://www.biopython.org I Interfaces for calling popular bioinformatics tools: I I I I I BLAST (network and standalone versions) Clustalw EMBOSS BioSQL Many others What is Biopython? http://www.biopython.org Basically, we just like to program in Python and want to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts. Getting Started I Get the workshop data $ $ $ $ cd ~ mkdir -p biopython/data cd biopython/data cp -v /pan/genomics/data/Biopython_workshop/* . Getting Started I Get the workshop data $ $ $ $ I cd ~ mkdir -p biopython/data cd biopython/data cp -v /pan/genomics/data/Biopython_workshop/* . Set up your environment $ module purge $ module load genomics/ver6 Getting Started I Get the workshop data $ $ $ $ I cd ~ mkdir -p biopython/data cd biopython/data cp -v /pan/genomics/data/Biopython_workshop/* . Set up your environment $ module purge $ module load genomics/ver6 I Now you can start Python $ cd ~/biopython $ python Enthought Python Distribution -- www.enthought.com Version: 7.2-2 (64-bit) Python 2.7.2 |EPD 7.2-2 (64-bit)| (default, Jul 3 2011, 15:17:51) [GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2 Type "packages", "demo" or "enthought" for more information. >>> Quick Start I 1 2 3 4 5 6 7 8 9 10 Let’s have some fun with sequences >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> from B i o . Seq i m p o r t Seq my_seq = Seq ( " AGTACACTGGT " ) t y p e ( my_seq ) my_seq p r i n t my_seq my_seq . a l p h a b e t my_seq . complement ( ) my_seq . r e v e r s e _ c o m p l e m e n t ( ) p r i n t my_seq . l o w e r ( ) p r i n t my_seq . u p p e r ( ) Quick Start I 1 2 3 4 5 6 >>> from B i o i m p o r t SeqIO >>> f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " data / ls_orchid . fasta " , " fasta " ) : ... print seq_record . id ... p r i n t repr ( seq_record . seq ) ... print len ( seq_record ) I 1 2 3 4 5 Now let’s play with a file How about a genbank file? >>> f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " data / ls_orchid . gbk " , " genbank " ) : ... print seq_record . id ... p r i n t repr ( seq_record . seq ) ... print len ( seq_record ) Sequences and Alphabets I I Sequence objects are more than just strings Bio.Alphabet.IUPAC provides basic DNA, RNA, and protein functionality I ExtendedIUPACProtein class I I I I I I I U/Sec for selenocysteine O/Pyl for pyrrolysine J/Xle for leucine isoleucine Z/Glx for glutamine or glutamic acid X/Xxx for unknown amino acid IUPACAmbiguousRNA or IUPACUnambiguousRNA IUPACAmbiguousDNA or IUPACUnambiguousDNA I ExtendedIUPAC[DNA/RNA] for modified bases Sequences and Alphabets I 1 2 3 4 5 6 7 Sequences act like strings >>> from B i o . Seq i m p o r t Seq >>> from B i o . A l p h a b e t i m p o r t IUPAC >>> my_seq = Seq ( ’ G A T C G A T G G G C C T A T A T A G G A T C G A A A A T C G C ’ , IUPAC . unambiguous_dna ) >>> my_seq = Seq ( " GATCG " , IUPAC . unambiguous_dna ) >>> f o r i n d e x , l e t t e r i n enumerate ( my_seq ) : . . . p r i n t index , l e t t e r Transcription I 1 2 3 4 5 6 7 8 9 10 11 Biopython can simplify the mundane tasks >>> from B i o . Seq i m p o r t Seq >>> from B i o . A l p h a b e t i m p o r t IUPAC >>> c o d i n g _ d n a = Seq ( " A T G G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A T A G " , IUPAC . unambiguous_dna ) >>> c o d i n g _ d n a >>> t e m p l a t e _ d n a = c o d i n g _ d n a . r e v e r s e _ c o m p l e m e n t ( ) >>> t e m p l a t e _ d n a >>> m e s s e n g e r _ r n a = c o d i n g _ d n a . t r a n s c r i b e ( ) >>> m e s s e n g e r _ r n a >>> t e m p l a t e _ d n a . r e v e r s e _ c o m p l e m e n t ( ) . t r a n s c r i b e ( ) >>> m e s s e n g e r _ r n a . b a c k _ t r a n s c r i b e ( ) Translation I 1 >>> m e s s e n g e r _ r n a . t r a n s l a t e ( ) I 1 2 3 Now we can generate the corresponding amino acid sequence You can even do it directly from the DNA sequence >>> c o d i n g _ d n a . t r a n s l a t e ( ) >>> c o d i n g _ d n a . t r a n s l a t e ( t a b l e=" Vertebrate Mitochondrial " ) >>> c o d i n g _ d n a . t r a n s l a t e ( t a b l e=2 , t o _ s t o p=True ) Translation Tables 1 2 3 4 5 6 7 8 >>> >>> >>> >>> >>> >>> >>> >>> from B i o . Data i m p o r t CodonTable s t a n d a r d _ t a b l e = CodonTable . unambiguous_dna_by_name [ " Standard " m i t o _ t a b l e = CodonTable . unambiguous_dna_by_name [ " Vertebrate Mi print standard_table print mito_table mito_table . stop_codons mito_table . start_codons m i t o _ t a b l e . f o r w a r d _ t a b l e [ " ACG " ] Mutable Sequence Objects Try: 1 2 3 4 >>> >>> >>> >>> from B i o . Seq i m p o r t Seq from B i o . A l p h a b e t i m p o r t IUPAC my_seq = Seq ( " G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A " , IUPAC . unambi my_seq [ 5 ] = " G " Mutable Sequence Objects Try: 1 2 3 4 >>> >>> >>> >>> from B i o . Seq i m p o r t Seq from B i o . A l p h a b e t i m p o r t IUPAC my_seq = Seq ( " G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A " , IUPAC . unambi my_seq [ 5 ] = " G " Did it work? Mutable Sequence Objects Try: 1 2 3 4 >>> >>> >>> >>> from B i o . Seq i m p o r t Seq from B i o . A l p h a b e t i m p o r t IUPAC my_seq = Seq ( " G C C A T T G T A A T G G G C C G C T G A A A G G G T G C C C G A " , IUPAC . unambi my_seq [ 5 ] = " G " Did it work? Now try: 1 2 3 >>> m u t a b l e _ s e q = my_seq . t o m u t a b l e ( ) >>> m u t a b l e _ s e q >>> m u t a b l e _ s e q [ 5 ] = " G " Exercise I Create a sequence I Translate the sequence using the standard table I Change two of the sequence elements I Translate the new sequence The SeqRecord Object I Defined in the Bio.SeqRecord module I Easy to create from a FASTA file 1 2 3 4 5 6 7 >>> >>> >>> >>> >>> >>> >>> I 1 2 3 4 5 6 7 8 9 from B i o i m p o r t SeqIO r e c o r d = SeqIO . r e a d ( " data / NC_005816 . fna " , " fasta " ) record record . seq record . id r e c o r d . name record . description Now try with a GenBank file >>> >>> >>> >>> >>> >>> >>> >>> from B i o i m p o r t SeqIO r e c o r d = SeqIO . r e a d ( " data / NC_005816 . gb " , " genbank " ) record record . seq p r i n t r e c o r d . f e a t u r e s [ 20 ] p r i n t r e c o r d . f e a t u r e s [ 21 ] r c = r e c o r d . r e v e r s e _ c o m p l e m e n t ( i d=" TESTING " ) p r i n t rc . id , len ( rc ) , len ( rc . f e a t u r e s ) , len ( rc . d b x r e f s ) , len ( rc . annotations ) Restriction Enzymes Biopython has over 600 built-in restriction enzymes 1 2 3 4 5 >>> >>> >>> >>> >>> from B i o . R e s t r i c t i o n i m p o r t R e s t r i c t i o n p r i n t R e s t r i c t i o n . Sau 3 AI . s i t e d i g e s t = R e s t r i c t i o n . ApaI . c a t a l y s e ( r e c o r d . s e q ) p r i n t " Number of fragments is " , l e n ( d i g e s t ) print digest Reading Sequence Files 1 2 3 4 5 from B i o i m p o r t SeqIO f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " data / ls_orchid . fasta " , " fasta " ) : print seq_record . id p r i n t repr ( seq_record . seq ) print len ( seq_record ) or 1 2 3 4 f o r s e q _ r e c o r d i n SeqIO . p a r s e ( " ls_orchid . gbk " , " genbank " ) : print seq_record . id p r i n t seq_record . seq print len ( seq_record ) Iterating Over Records in a Sequence File 1 2 3 4 5 6 7 >>> >>> >>> >>> >>> >>> >>> r e c o r d _ i t e r a t o r = SeqIO . p a r s e ( " ls_orchid . fasta " , " fasta " ) f i r s t _ r e c o r d = r e c o r d _ i t e r a t o r . next ( ) print first_record . id print first_record . description second_record = r e c o r d _ i t e r a t o r . next ( ) p r i n t second_record . id p r i n t second_record . d e s c r i p t i o n Getting Data from Entrez 1 2 3 4 5 6 7 8 >>> from B i o i m p o r t E n t r e z >>> E n t r e z . e m a i l = " amf@pitt . edu " >>> h a n d l e = E n t r e z . e f e t c h ( db=" nucleotide " , r e t t y p e=" fasta " , r e t m o d e=" text " , i d=" 6273291 " ) >>> s e q _ r e c o r d = SeqIO . r e a d ( h a n d l e , " fasta " ) >>> h a n d l e . c l o s e ( ) >>> p r i n t " % s with % i features " % ( seq_record . id , len ( seq_record . f e a t u r e s )) Getting Data from SwissProt 1 2 3 4 5 6 7 8 9 10 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> from B i o i m p o r t ExPASy h a n d l e = ExPASy . g e t _ s p r o t _ r a w ( " O23729 " ) s e q _ r e c o r d = SeqIO . r e a d ( h a n d l e , " swiss " ) handle . c l o s e () print seq_record . id p r i n t s e q _ r e c o r d . name print seq_record . d e s c r i p t i o n p r i n t repr ( seq_record . seq ) p r i n t " Length % i " % l e n ( s e q _ r e c o r d ) p r i n t s e q _ r e c o r d . a n n o t a t i o n s [ " keywords " ] Writing Sequence Files 1 2 from d a t a . s a m p l e _ s e q s i m p o r t ∗ SeqIO . w r i t e ( my_records , " my_example . faa " , " fasta " ) Converting Between Sequence File Formats Explicit 1 2 3 4 from B i o i m p o r t SeqIO r e c o r d s = SeqIO . p a r s e ( " ls_orchid . gbk " , " genbank " ) c o u n t = SeqIO . w r i t e ( r e c o r d s , " my_example . fasta " , " fasta " ) p r i n t " Converted % i records " % c o u n t Using Biopython 1 2 3 h e l p ( SeqIO . c o n v e r t ) c o u n t = SeqIO . c o n v e r t ( " ls_orchid . gbk " , " genbank " , " my_example . fast p r i n t " Converted % i records " % c o u n t Network-based NCBI BLAST We can download data directly from NCBI in FASTA format 1 2 3 4 5 >>> >>> >>> >>> >>> from B i o . B l a s t i m p o r t NCBIWWW h e l p (NCBIWWW. q b l a s t ) r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , " 8332116 " ) f a s t a _ s t r i n g = open ( " m_cold . fasta " ) . r e a d ( ) r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , f a s t a _ s t r i n g ) Could also read in the FASTA file as a SeqRecord 1 2 3 4 >>> >>> >>> >>> from B i o . B l a s t i m p o r t NCBIWWW from B i o i m p o r t SeqIO r e c o r d = SeqIO . r e a d ( " m_cold . fasta " , f o r m a t=" fasta " ) r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , r e c o r d . s e q ) Network-based NCBI BLAST (cont.) You can also use SeqRecord to make a FASTA string including the identifier 1 2 3 4 5 6 7 8 >>> >>> >>> >>> >>> >>> >>> >>> from B i o . B l a s t i m p o r t NCBIWWW from B i o i m p o r t SeqIO r e c o r d = SeqIO . r e a d ( " m_cold . fasta " , f o r m a t=" fasta " ) r e s u l t _ h a n d l e = NCBIWWW. q b l a s t ( " blastn " , " nt " , r e c o r d . f o r m a t ( " s a v e _ f i l e = open ( " my_blast . xml " , " w " ) s a v e _ f i l e . write ( result_handle . read ()) save_file . close () result_handle . close () my_blast.xml now has the results. We can parse it directly: 1 2 3 >>> r e s u l t _ h a n d l e = open ( " my_blast . xml " ) >>> from B i o . B l a s t i m p o r t NCBIXML >>> b l a s t _ r e c o r d s = NCBIXML . p a r s e ( r e s u l t _ h a n d l e ) Graphics from Biopython Biopython uses the ReportLab module to render graphics directly 1 2 3 4 5 from r e p o r t l a b . l i b i m p o r t c o l o r s from r e p o r t l a b . l i b . u n i t s i m p o r t cm from B i o . G r a p h i c s i m p o r t GenomeDiagram from B i o i m p o r t SeqIO r e c o r d = SeqIO . r e a d ( " data / NC_005816 . gb " , " genbank " ) Next we create an empty track and feature set 1 2 3 4 5 gd_diagram = GenomeDiagram . Diagram ( " Yersinia pestis biovar Microtus plasmid pPCP1 " ) g d _ t r a c k _ f o r _ f e a t u r e s = gd_diagram . n e w _ t r a c k ( 1 , name=" Annotated Features " ) g d _ f e a t u r e _ s e t = g d _ t r a c k _ f o r _ f e a t u r e s . new_set ( ) Graphics from Biopython (cont.) Next we extract the features and color them blue and lightblue 1 2 3 4 5 6 7 8 9 10 for feature in record . features : i f f e a t u r e . t y p e != " gene " : #E x c l u d e t h i s f e a t u r e continue i f l e n ( g d _ f e a t u r e _ s e t ) % 2 == 0 : color = colors . blue else : color = colors . lightblue g d _ f e a t u r e _ s e t . a d d _ f e a t u r e ( f e a t u r e , c o l o r=c o l o r , l a b e l=True ) Graphics from Biopython (cont.) Now we create the drawing object 1 2 3 gd_diagram . draw ( f o r m a t=" linear " , o r i e n t a t i o n=" landscape " , p a g e s i z e=" A4 " , f r a g m e n t s=4 , s t a r t=0 , end=l e n ( r e c o r d ) ) We can write it in nearly any format we choose 1 2 3 4 gd_diagram . gd_diagram . gd_diagram . gd_diagram . w r i t e ( " pla smid_lin ear . pdf " , w r i t e ( " pla smid_lin ear . eps " , w r i t e ( " pla smid_lin ear . svg " , w r i t e ( " pla smid_lin ear . png " , " PDF " ) " EPS " ) " SVG " ) " PNG " ) Let’s make a circular one: 1 2 3 gd_diagram . draw ( f o r m a t=" circular " , c i r c u l a r =True , p a g e s i z e =(2 0 ∗cm , 2 0 ∗cm ) , s t a r t=0 , end=l e n ( r e c o r d ) ) gd_diagram . w r i t e ( " p l a s m i d _ c i r c u la r . pdf " , " PDF " ) Other Capabilities I Read/Write PDB files (Bio.PDB) I Population Genetics (Bio.PopGen) I Phylogenetics (Bio.Phylo) I Sequence motif analysis (Bio.motifs) I Cluster Analysis (Bio.Cluster) Supervised Learning (from Bio import ...) I I I Logistic Regression (LogisticRegression) k–nearest neighbors (kNN) I Naive Bayes (Bio.NaiveBayes) I Maximum Entropy (Bio.MaximumEntropy) I Markov Models (Bio.MarkovModel and/or Bio.HMM.MarkovModel) Further Information I http://elbo.gs.washington.edu/courses/GS_559_11_wi/ I http://biopython.org/wiki/Category:Cookbook I http://biopython.org/DIST/docs/tutorial/Tutorial.html I http://www.bio-cloud.info/Biopython/en/index.html Feel free to contact me for further help: amf@pitt.edu