BioPython_Workshop_Gershon

advertisement
Tel Aviv University
BioPython Workshop
Gershon Celniker
Introduction
• The Biopython Project is an international association of developers of freely available Python
(http://www.python.org) tools for computational molecular biology.
• Python is an object oriented, interpreted, exible language that is becoming increasingly
popular for scientific computing.
• Python is easy to learn, has a very clear syntax and can easily be extended with modules.
• The Biopython web
site (http://www.biopython.org) provides an online resource for modules,
https://github.com/biopython/biopython/tree/master/Doc/exa
scripts, and web links
for developers of Python-based software for bioinformatics use and
mples
research.
• Basically, the goal of Biopython is to make it as easy as possible to use Python for
bioinformatics by creating high-quality, reusable modules and classes.
• Biopython features include parsers for various Bioinformatics file formats(BLAST, Clustalw,
FASTA, Genbank,...), access to online services (NCBI, Expasy, Clustalw, DSSP, MSMS...)
• Basically, we just like to program in Python and want to make it as easy as possible to use
Python for bioinformatics by creating high-quality, reusable modules and scripts.
Introduction
• The full tutorial located here:
• http://biopython.org/DIST/docs/tutorial/Tutorial.html
• Example files are located here:
• https://github.com/biopython/biopython/tree/master/Doc/examples
BioPython, Lets try it!
FASTA format
http://en.wikipedia.org/wiki/FASTA_format
FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an extension of "FAST-P" (protein) and "FAST-N"
(nucleotide) alignment.
Lets write our first parsing script
Parsing sequence File formatsCypripedioideae (this is the subfamily of lady slipper orchids). This search gave me
only 94 hits, which I saved as a FASTA - ls orchid.fasta
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2
DNACGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACG
ATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTG
ATTTGTTGTTGGG
Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has defaulted to the rathergeneric
SingleLetterAlphabet() rather than something DNA specic.
Lets write our first parsing script
Output:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
Sequence slicing
Output:
gi|2765658|emb|Z78533.1|CIZ78533
GC content exercise
Output:
My seq legnth:
32
G:
9
Transcription
Output:
Translation
Output:
Translation tables
Translation – continued
Retrieving data from the net
Output:
O23729
CHS3_BROFI
RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE', ProteinAlphabet())
Length 394
['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
Parsing data from fasta – part B
Alignment
Blast
Plots
Plots - result
Going 3D: The PDB module
Bio.
Going 3D: The PDB module
Bio.
Download