pptx

advertisement
Programming for Engineers in Python
Biopython
1
Classes
class <classname>:
statement_1
.
.
statement_n
 The methods of a class get the instance as the
first parameter
 traditionally named self
 The method __init__ is called upon object
construction (if available)
2
Classes
 Reminder: type = data representation + behavior.
 Classes are user-defined types.
class <classname>:
statement_1
.
.
statement_n
Like a mini-program:
• Variables.
• Function Definitions.
• Even arbitrary commands.
 Objects of a class are called class instances.
3
Classes – Attributes and Methods
class Vector2D:
def __init__ (self, x, y):
self.x, self.y = x, y
Instance
def size (self):
return (self.x ** 2 + self.y ** 2) ** 0.5
Methods
4
Attributes
(each instance
has its own copy)
Classes – Instantiate and Use
>>> v = Vector2D(3, 4) # Make instance.
>>> v
<__main__.Vector2D object at 0x00000000031A2828>
>>> v.size() # Call method on instance.
5.0
Example – Multimap
 A dictionary with more than one value for each
key
 We already needed it once or twice and used:
>>> lst = d.get(key, [ ])
>>> lst.append(value)
>>> d[key] = lst
 We will now write a new class that will be a
wrapper around a dict
 The class will have methods that allow us to
keep multiple values for each key
6
Multimap. partial code
class Multimap:
def __init__(self):
'''Create an empty Multimap'''
self.inner = inner
def get(self, key):
'''Return list of values associated with key'''
return self.inner.get(key, [])
def put(self, key, value):
'''Adds value to the list of values associated with key'''
value_list = self.get(key)
if value not in value_list:
value_list.append(value)
self.inner[key] = value_list
7
Multimap put_all and remove
def put_all(self, key, values):
for v in values:
self.put(key, v)
def remove(self, key, value):
value_list = self.get(key)
if value in value_list:
value_list.remove(value)
self.inner[key] = value_list
return True
return False
8
Multimap. Partial code
def __len__(self):
'''Returns the number of keys in the map'''
return len(self.inner)
def __str__(self):
'''Converts the map to a string'''
return str(self.inner)
def __cmp__(self, other):
'''Compares the map with another map'''
return self.inner.cmp(other)
def __contains__(self, key):
'''Returns True if key exists in the map'''
return self.has_key(k)
9
Multimap
 Use case – a dictionary of countries and their
cities:
>>> m = Multimap()
>>> m.put('Israel', 'Tel-Aviv')
>>> m.put('Israel', 'Jerusalem')
>>> m.put('France', 'Paris')
>>> m.put_all('England',('London', 'Manchester',
'Moscow'))
>>> m.remove('England', 'Moscow')
10
>>> print m.get('Israel')
['Tel-Aviv', 'Jerusalem']
11
Biopython
 An international association of developers of
freely available Python (http://www.python.org)
tools for computational molecular biology
 Provides tools for
 Parsing files (fasta, clustalw, GenBank,…)
 Interface to common softwares
 Operations on sequences
 Simple machine learning applications
 BLAST
 And many more
12
Installing Biopython
 Go to http://biopython.org/wiki/Download
 Windows
 Unix
 Select python 2.7
 NumPy is required
13
SeqIO
 The standard Sequence Input/Output interface for
BioPython
 Provides a simple uniform interface to input and
output assorted sequence file formats
 Deals with sequences as SeqRecord objects
 There is a sister interface Bio.AlignIO for working
directly with sequence alignment files as
Alignment objects
14
Parsing a FASTA file
# Parse a simple fasta file
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
Why repr and not str?
15
16
GenBank files
# genbank files
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print seq_record
# added to print just one record example
break
17
GenBank files
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
18
Sequence objects
 Support similar methods as standard strings
 Provide additional methods
 Translate
 Reverse complement
 Support different alphabets
 AGTAGTTAAA can be
 DNA
 Protein
19
Sequences and alphabets
 Bio.Alphabet.IUPAC provides basic definitions for
proteins, DNA and RNA, but additionally provides
the ability to extend and customize the basic
definitions
 For example:
 Adding ambiguous symbols
 Adding special new characters
20
Example – generic alphabet
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.alphabet
Non-specific
Alphabet()
alphabet
21
Example – specific sequences
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)
>>> my_prot
Seq('AGTACACTGGT', IUPACProtein())
>>> my_prot.alphabet
IUPACProtein()
22
Sequences act like strings
 Access elements
>>> print my_seq[0] #first letter
G
>>> print my_seq[2] #third letter
T
>>> print my_seq[-1] #last letter
G
 Count without overlaps
>>> from Bio.Seq import Seq
>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2
23
Calculate GC content
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq =
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
24
Slicing
 Simple slicing
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())
 Start, stop, stride
>>> my_seq[0::3]
Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())
>>> my_seq[1::3]
Seq('AGGCATGCATC', IUPACUnambiguousDNA())
>>> my_seq[2::3]
Seq('TAGCTAAGAC', IUPACUnambiguousDNA())
25
Concatenation
 Simple addition as in Python
 But, alphabets must fit
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> protein_seq + dna_seq
Traceback (most recent call last):
…
26
Changing case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper()
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
27
Changing case
 Case is important for matching
>>> "GTAC" in dna_seq
False
>>> "GTAC" in dna_seq.upper()
True
 IUPAC names are upper case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())
28
Reverse complement
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq =
Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG',
IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC',
IUPACUnambiguousDNA())
29
Transcription
>>> coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATA
G", IUPAC.unambiguous_dna)
>>> template_dna = coding_dna.reverse_complement()
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUA
G', IUPACUnambiguousRNA())
As you can see, all this does is switch T → U,
and adjust the alphabet.
30
Translation
Simple example
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna =
Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',
IUPACUnambiguousRNA())
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
31
Stop codon!
Translation from the DNA
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',
IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
32
Using different translation tables
 In several cases we may want to use different
translation tables
 Translation tables are given IDs in GenBank
(standard=1)
 Vertebrate Mitochondrial is table 2
 More details in
33
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Using different translation tables
>>> coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
34
Translation tables in biopython
35
Translate up to the first stop in frame
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', IUPACProtein())
36
Comparing sequences
Standard “==“ comparison is done by comparing
the references (!), hence:
>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq1==seq2
Warning (from warnings module):
… FutureWarning: In future comparing Seq objects will use string
comparison (not object comparison). Incompatible alphabets will
trigger a warning (not an exception)… please use
str(seq1)==str(seq2) to make your code explicit and to avoid
this warning.
False
>>> seq1==seq1
True
37
Mutable vs. Immutable
 Like strings standard seq objects are immutable
 If you want to create a mutable object you need
to write it by either:
 Use the “tomutable()” method
 Use the mutable constructor
mutable_seq =
MutableSeq("GCCATTGTAATGGGCCGCTGAAAG
GGTGCCCGA", IUPAC.unambiguous_dna)
38
Unknown sequences example
 In many biological cases we deal with unknown
sequences
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import IUPAC
>>> unk_dna = UnknownSeq(20,
alphabet=IUPAC.ambiguous_dna)
>>> my_seq =
Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> unk_dna+my_seq
Seq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGC
CGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA())
39
MSA
40
Read MSA
 Use Bio.AlignIO.read(file, format)
 File – the file path
 Format support:
 “stockholm”
 “fasta”
 “clustal”
…
 Use help(AlignIO) for details
41
Example
 We want to parse this file from PFAM
42
Example
from Bio import AlignIO
alignment = AlignIO.read("PF05371.sth", "stockholm")
print alignment
43
Alignment object example
>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm")
>>> print alignment[1]
ID: Q9T0Q8_BPIKE/1-52
Name: Q9T0Q8_BPIKE
Description: Q9T0Q8_BPIKE/1-52
Number of features: 0
/start=1
/end=52
/accession=Q9T0Q8.1
Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVI
KLFKKFVSRA', SingleLetterAlphabet())
44
Alignment object example
>>> print "Alignment length %i" % alignment.get_alignment_length()
Alignment length 52
>>> for record in alignment:
print "%s - %s" % (record.seq, record.id)
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
- COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
- Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSK
A - COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
RA - COATB_BPIF1/22-73
45
Cross-references example
Did you notice in the raw file above that several of the
sequences include database cross-references to the
PDB and the associated known secondary structure?
>>> for record in alignment:
if record.dbxrefs:
print record.id, record.dbxrefs
COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']
COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']
Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']
COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']
46
Comments
 Remember that almost all MSA formats are
supported
 When you have more than one MSA in your files
use AlignIO.parse()
 Common example is PHYLIP’s output
 Use AlignIO.parse("resampled.phy", "phylip")
 The result is an iterator object that contains all
MSAs
47
Write alignment to file
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
align1 = MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),])
from Bio import AlignIO
AlignIO.write(align1, "my_example.phy", "phylip")
48
3 12
Alpha
Beta
Gamma
39
Delta
Epislon
Zeta
3 13
Eta
Theta
ACTGCTAGCT AG
ACT-CTAGCT AG
ACTGCTAGDT AG
GTCAGC-AG
GACAGCTAG
GTCAGCTAG
ACTAGTACAG CTG
ACTAGTACAG CT-
Slicing
Alignments work like numpy matrices
>>> print alignment[2,6]
T
# You can pull out a single column as a string like this:
>>> print alignment[:,6]
TTT---T
>>> print alignment[3:6,:6]
SingleLetterAlphabet() alignment with 3 rows and 6 columns
AEGDDP COATB_BPM13/24-72
AEGDDP COATB_BPZJ2/1-49
AEGDDP Q9T0Q9_BPFD/1-49
49
>>> print alignment[:,:6]
SingleLetterAlphabet() alignment with 7 rows and 6 columns
AEPNAA COATB_BPIKE/30-81
AEPNAA Q9T0Q8_BPIKE/1-52
DGTSTA COATB_BPI22/32-83
AEGDDP COATB_BPM13/24-72
AEGDDP COATB_BPZJ2/1-49
AEGDDP Q9T0Q9_BPFD/1-49
FAADDA COATB_BPIF1/22-73
External applications
 How do we call MSA algorithms on unaligned set
of sequences?
 Biopython provides wrappers
 The idea:
 Create a command line object with the algorithm
options
 Invoke the command (Python uses subprocesses)
 Bio.Align.Applications module:
 >>> import Bio.Align.Applications
 >>> dir(Bio.Align.Applications)

50
['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline',
'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline',
'TCoffeeCommandline' ]
ClustalW example
 First step: download ClustalW from
ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.1/
 Second step: install
 Third step: look for clustal exe files
 Now you can run ClustalW from your Python code
51
Run example
>>> import os
>>> from Bio.Align.Applications import ClustalwCommandline
>>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe"
>>> clustalw_cline = ClustalwCommandline(clustalw_exe,
infile="opuntia.fasta")
>>> assert os.path.isfile(clustalw_exe), "Clustal W executable missing"
>>> stdout, stderr = clustalw_cline()
The command line is actually a function
we can run!
52
ClustalW
>>> from Bio import AlignIO
>>> align = AlignIO.read("opuntia.aln", "clustal")
>>> print align
SingleLetterAlphabet() alignment with 7 rows and 906 columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA
gi|6273291|gb|AF191665.1|AF191
53
ClustalW - tree
In case you are interested, the opuntia.dnd file ClustalW creates is
just a standard Newick tree file, and Bio.Phylo can parse these:
>>> from Bio import Phylo
>>> tree = Phylo.read("opuntia.dnd", "newick")
>>> Phylo.draw_ascii(tree)
54
BLAST
55
Running BLAST over the internet
 We use the function qblast() in
the Bio.Blast.NCBIWWW module. This has three nonoptional arguments:
 The blast program to use for the search, as a lower case
string: works with blastn, blastp, blastx, tblast and
tblastx.
 The databases to search against. The options for this
are available on the NCBI web pages
at http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.s
html.
 A string containing your query sequence. This can either
be the sequence itself, the sequence in fasta format, or
an identifier like a GI number.
56
qblast additional parameters
 qblast can receive other parameters, analogous
to the parameters of the actual server
 Important examples:
 format_type: "HTML", "Text", "ASN.1", or "XML".
The default is "XML", as that is the format expected
by the parser (see next examples)
 expect sets the expectation or e-value threshold.
57
Step 1: call BLAST
>>> from Bio.Blast import NCBIWWW
# Option 1 - Use GI ID
>>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
# Option 2 – read a fasta file
>>> fasta_string = open("m_cold.fasta").read()
>>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
# option 3 – parse file to seq object
>>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")
>>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
58
Step2: parse the results
>>> from Bio.Blast import NCBIXML
>>> blast_record = NCBIXML.read(result_handle)
 Read can be used only once!
 blast_record object keeps the actual results
59
Remarks
 Basically, Biopython supports reading BLAST
results from HTMLs and text files.
 These methods are not stable and sometimes fail
because the servers change the format.
 XML is stable
 You can save XML files
 In the server
 From result_handle objects (next slide)
60
Save results as XML
>>> save_file = open("my_blast.xml", "w")
>>> save_file.write(result_handle.read())
>>> save_file.close()
>>> result_handle.close()
 Read can be used only once!
61
BLAST records
 A BLAST Record contains everything you might ever want to
extract from the BLAST output.
 Example:
62
>>> E_VALUE_THRESH = 0.04
>>> for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print '****Alignment****'
print 'sequence:', alignment.title
print 'length:', alignment.length
print 'e value:', hsp.expect
print hsp.query[0:75] + ''
print hsp.match[0:75] + ''
print hsp.sbjct[0:75] + ''
BLAST records
63
More functions
 We cover here very basic functions
 To get more details use
>>> import Bio.Blast.Record
>>> help(Bio.Blast.Record)
Help on module Bio.Blast.Record in Bio.Blast:
NAME
Bio.Blast.Record - Record classes to hold BLAST output.
FILE
d:\python27\lib\site-packages\bio\blast\record.py
DESCRIPTION
Classes:
Blast
Holds all the information from a blast search.
PSIBlast
Holds all the information from a psi-blast search.
64
Header
Holds information from the header.
Description
Holds information about one hit description.
Alignment
Holds information about one alignment hit.
HSP
Holds information about one HSP.
MultipleAlignment Holds information about a multiple alignment.
DatabaseReport Holds information from the database report.
Parameters
Holds information from the parameters.
Accessing
NCBI’s
Entrez
Databases
65
Bio.Entrez
 Module for programmatic access to Entrez
 Example: search PubMed or download GenBank
records from within a Python script
 Makes use of the Entrez Programming Utilities
http://www.ncbi.nlm.nih.gov/entrez/utils/
 Makes sure that the correct URL is used for the
queries, and that not more than one request is
made every three seconds, as required by NCBI
 Note! If the NCBI finds you are abusing their
systems, they can and will ban your access!
66
ESearch example
>>> handle =
Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND
matK[Gene]")
>>> record = Entrez.read(handle)
# Each of the IDs is a GenBank identifier.
>>> print (record["IdList"])
['126789333', '442591189', '442591187', '442591185', '442591183',
'442591181', '442591179', '442591177', '442591175', '442591173',
'442591171', '442591169', '442591167', '442591165', '442591163',
'442591161', '442591159', '442591157', '442591155', '442591153']
67
Explanation
Entrez.read
Transforms the actual results (retrieved as XML) to a
usable object of type
Bio.Entrez.Parser.DictionaryElement
>>> record
{u'Count': '158', u'RetMax': '20', u'IdList': ['126789333', '442591189', '442591187',
'442591185', '442591183', '442591181', '442591179', '442591177', '442591175',
'442591173', '442591171', '442591169', '442591167', '442591165', '442591163',
'442591161', '442591159', '442591157', '442591155', '442591153'],
u'TranslationStack': [{u'Count': '2482', u'Field': 'Organism', u'Term':
'"Cypripedioideae"[Organism]', u'Explode': 'Y'}, {u'Count': '71514', u'Field': 'Gene',
u'Term': 'matK[Gene]', u'Explode': 'N'}, 'AND'], u'TranslationSet': [{u'To':
'"Cypripedioideae"[Organism]', u'From': 'Cypripedioideae[Orgn]'}], u'RetStart': '0',
u'QueryTranslation': '"Cypripedioideae"[Organism] AND matK[Gene]'}
68
Database options
'pubmed', 'protein', 'nucleotide', 'nuccore',
'nucgss', 'nucest', 'structure', 'genome',
'books', 'cancerchromosomes', 'cdd', 'gap',
'domains', 'gene', 'genomeprj', 'gensat',
'geo', 'gds', 'homologene', 'journals', 'mesh',
'ncbisearch', 'nlmcatalog', 'omia', 'omim',
'pmc', 'popset', 'probe', 'proteinclusters',
'pcassay', 'pccompound', 'pcsubstance',
'snp', 'taxonomy', 'toolkit', 'unigene', 'unists'
69
Download a full record
>>> from Bio import Entrez
# Always tell NCBI who you are
>>> Entrez.email = A.N.Other@example.com
# rettype: get a GenBank record
>>> handle = Entrez.efetch(db="nucleotide",
id="186972394", rettype="gb", retmode="text")
>>> print handle.read()
70
71
Change ‘gb’ to ‘fasta’
72
Read directly to Seq.IO object
>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide",
id="186972394",rettype="gb", retmode="text")
>>> record = SeqIO.read(handle, "genbank")
>>> handle.close()
>>> print record
ID: EU490707.1
Name: EU490707
Description: Selenipedium aequinoctiale maturase K (matK) gene,
partial cds; chloroplast.
Number of features: 3
...
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAA
ATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA())
73
Download directly from a URL
 Suppose we know how the database URLs
look like
 Example: GEO (gene expression omnibus)
"http://www.ncbi.nlm.nih.gov/geo/download/?ac
c=GSE6609&format=file"
74
Use the urlib2 module
>>> import urllib2
>>> u =
urllib2.urlopen('http://www.ncbi.nlm.nih.gov/geo/dow
nload/?acc=GSE6609&format=file')
>>> localFile = open('gse6609_raw.tar', 'w')
>>> for x in u:
localFile.write(x)
>>> localFile.close()
75
More details
 We covered only a few concepts
 For more details on Biopython options, including
dealing with specialized parsers, see
http://biopython.org/DIST/docs/tutorial/Tutorial.ht
ml#sec:parsing-blast
 Chapter 9
 Look at the urllib2 manual
 http://docs.python.org/2/library/urllib2.html
76
Sequence
Motifs
77
Gene expression regulation
 Transcription is regulated mainly by transcription factors
(TFs) - proteins that bind to DNA subsequences, called
binding sites (BSs)
 TFBSs are located mainly in the gene’s promoter – the
DNA sequence upstream the gene’s transcription start
site (TSS)
 TFs can promote or repress transcription
 Other regulators: micro-RNAs (miRNAs)
Ab-initio motif discovery
 You are given a set of strings
 You want to find a motif that is significantly
represented in the strings
 For example: TF\miRNA binding site
79
TFBS models
The BSs of a particular TF share a common pattern, or
motif, which is often modeled using:
 Degenerate string
GGWATB (W={A,T}, B={C,G,T})
ATCGGAATTCTGCAG
GGCAATTCGGGAATG
AGGTATTCTCAGATTA
 PWM = Position weight matrix
1
2
3
4
5
6
A
0.1
0.8
0
0.7
0.2
0
C
0
0.1
0.5
0.1
0.4
0.6
G
0
0
0.5
0.1
0.4
0.1
T
0.9
0.1
0
0.1
0
0.3
Cutoff = 0.009
AGCTACACCCATTTAT 0.06
AGTAGAGCCTTCGTG 0.06
CGATTCTACAATATGA 0.01
Motif discovery:
The typical two-step pipeline
Promoter/3’UTR
sequences
Co-regulated gene set
Cluster I
Gene expression
microarrays
Clustering
Cluster II
Cluster III
Location analysis
(ChIP-chip, …)
Functional group
(e.g., GO term)
Motif
discovery
Motif discovery: Goals and challenges
 Goal: Reverse-engineer the transcriptional
regulatory network
 Challenges:





BSs are short and degenerate (non-specific)
Promoters are long + complex (hard to model)
Search space is huge (motif and sequence)
Data is noisy
What to look for? (enriched?, localized?, conserved?)
 Problem is still considered very difficult despite
extensive research
Biopython motif objects
from Bio import motifs
from Bio.Seq import Seq
instances =
[Seq("TACAA"),Seq("TACGC"),Seq("TACAC"),Seq("TACCC"
),Seq("AACCC"),Seq("AATGC"),Seq("AATGC")]
m = motifs.create(instances)
print m
TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC
83
Biopython motif objects
>>> print m.counts
0 1 2 3 4
A: 3.00 7.00 0.00 2.00 1.00
C: 0.00 0.00 5.00 2.00 6.00
G: 0.00 0.00 0.00 3.00 0.00
T: 4.00 0.00 2.00 0.00 0.00
84
Biopython motif objects
>>> m.consensus
Seq('TACGC', IUPACUnambiguousDNA())
#The anticonsensus sequence, corresponding to the smallest
values in the columns of the .counts matrix:
>>> m.anticonsensus
Seq('GGGTG', IUPACUnambiguousDNA())
85
Motif database
(http://jaspar.genereg.net/)
86
87
88
89
90
Read records
from Bio import motifs
arnt = motifs.read(open("Arnt.sites"), "sites")
print arnt.counts
0 1 2 3 4 5
A: 4.00 19.00 0.00 0.00 0.00 0.00
C: 16.00 0.00 20.00 0.00 0.00 0.00
G: 0.00 1.00 0.00 20.00 0.00 20.00
T: 0.00 0.00 0.00 0.00 20.00 0.00
91
MEME
 MEME is a tool for discovering motifs in a group
of related DNA or protein sequences.
 It takes as input a group of DNA or protein
sequences and outputs as many motifs as
requested.
 Therefore, in contrast to JASPAR files, MEME
output files typically contain multiple motifs.
92
Assumptions
 The number of motifs is known
 Assume this number is 1
 The size of the motif is known
 Biologically, we have estimates for the size for TFs
and miRNA
 Missing information
 PWM of the motif
 PWM of the background
 Motif locations
93
Assumptions
 Given a sequence X and a PWM Y, of the same
length we can calculate P(X|Y)
 Assume independence of motif positions
P( X | Y )   P( xi , Y,i )
i
94
Assumptions
 Given a sequence X and a PWM Y, of the same
length we can calculate P(X|Y)
 Assume independence of motif positions
P( X | Y )   P( xi , Y,i )
i
 Given a PWM we can now calculate for each
position K in each sequence J the probability the
motif starts at K in the sequence J.
95
Expectation Maximization (EM) Algorithm
•Start with initial guess for the PWMs
•The EM algorithm consists of the two steps, which are repeated
consecutively.
• Step 1, estimate the probability of finding the site at any position in
each of the sequences. These probabilities are used to provide new
information as to expected base or aa distribution for each column in
the site.
•Step 2, the maximization step, the new counts for bases or aa for each
position in the site found in the step 1 are substituted for the previous
set.
Expectation Maximization (EM) Algorithm
OOOOOOOOXXXXOOOOOOOO
OOOOOOOOXXXXOOOOOOOO
o o o o o o o o o o o o o o o o o o o o o o o o
OOOOOOOOXXXXOOOOOOOO
OOOOOOOOXXXXOOOOOOOO
IIII
IIIIIIII
Columns defined by a preliminary
alignment of the sequences provide
initial estimates of frequencies of aa in
each motif column
IIIIIII
Columns not in motif provide
background frequencies
Bases
Background
Site column 1
Site column 2
……
G
0.27
0.4
0.1
……
C
0.25
0.4
0.1
……
A
0.25
0.2
0.1
……
T
0.23
0.2
0.7
……
Total
1.00
1.00
1.00
……
Expectation Maximization (EM) Algorithm
XXXXOOOOOOOOOOOOOOOO
XXXX
A
IIII
IIIIIIIIIIIIIIII
OXXXXOOOOOOOOOOOOOOO
XXXX
B
IIII
I
IIIIIIIIIIIIIII
Use previous estimates of aa or
nucleotide frequencies for each
column in the motif to calculate
probability of motif in this
position, and multiply by……..
X
…background frequencies in the
remaining positions.
The resulting score gives the likelihood that the
motif matches positions A, B or other in seq 1. Repeat
for all other positions and find most likely locator.
Then repeat for the remaining seq’s.
EM Algorithm 2nd optimisation step: calculations
•The site probabilities for each seq calculated at the 1st step are then used to create a new
table of expected values for base counts for each of the site positions using the site
probabilities as weights.
• Suppose that P (site 1 in seq 1) = Psite1,seq1 / (Psite1,seq1 + Psite2,seq1 + …+ Psite78,seq1 ) = 0.01
and P (site 2 in seq 1) = 0.02.
•Then this values are added to the previous table as shown in the table below.
•This procedure is repeated for every other possible first columns in seq1 and then the
process continues for all other sequences resulting in a new version of the table.
•The expectation and maximization steps are repeated until the estimates of base frequencies
do not change.
Bases
Background
Site column 1
Site column 2
……
G
0.27 + …
0.4 + …
0.1 + …
……
C
0.25 + …
0.4 + …
0.1 + …
……
A
0.25 + …
0.2 + 0.01
0.1 + …
……
T
0.23 + …
0.2 + …
0.7 + 0.02
……
Total/
weighted
1.00
1.00
1.00
……
Run MEME (
100
http://meme.nbcr.net/meme/cgi-bin/meme.cgi
)
Results
101
Parse results
>>> handle = open("meme.dna.oops.txt")
>>> record = motifs.parse(handle, "meme")
>>> handle.close()
>>> len(record)
2
>>> motif = record[0]
>>> print motif.consensus
TTCACATGCCGC
>>> print motif.degenerate_consensus
TTCACATGSCNC
102
Motif attributes
>>> motif.num_occurrences
7
>>> motif.length
12
>>> evalue = motif.evalue
>>> print "%3.1g" % evalue
0.2
>>> motif.name
'Motif 1'
103
Where the motif was found
>>> motif = record['Motif 1']
# Each motif has an attribute .instances with the sequence instances in
which the motif was found, providing some information on each instance
>>> len(motif.instances)
7
>>> motif.instances[0]
Instance('TTCACATGCCGC', IUPACUnambiguousDNA())
>>> motif.instances[0].start
620
>>> motif.instances[0].strand
'-'
>>> motif.instances[0].length
12
>>> pvalue = motif.instances[0].pvalue
>>> print "%5.3g" % pvalue
1.85e-08
104
Amadeus
 Advanced algorithms improve upon MEME
 This is an algorithm for motif finding
 Appears to be one of the top algorithms in




105
many tests
Java based tool
Easy to use GUI
Supports analysis of TFs and miRNAs
Developed here in TAU
Amadeus
A Motif Algorithm for Detecting Enrichment in
mUltiple Species
Supports diverse motif discovery tasks:

1.
2.
3.
Finding over-represented motifs in one or more given sets of
genes.
Identifying motifs with global spatial features given only the
genomic sequences.
Simultaneous inference of motifs and their associated
expression profiles given genome-wide expression datasets.
How?



A general pipeline architecture for enumerating motifs.
Different statistical scoring schemes of motifs for different
motif discovery tasks.
 Input: ~350 genes expressed in the
human G2+M cell-cycle phases
[Whitfield
et al. ’02]
Pairs
analysis
CHR
NF-Y
(CCAAT-box)
Clustering
analysis
108
Clustering - reminder
 Cluster analysis is the grouping of items into
clusters based on the similarity of the items to
each other.
 Bio.Cluster module
 Kmeans
 SOM
 Hierarchical clustering
 PCA
109
K-means clustering
MacQueen, 65
 Input: a set of observations (x1, x2, …, xn)
 For example, each observation is a gene, and x is
the values
 Goal: partition the observation to K clusters
S = {S1, S2, …, Sk}
 Objective function:
110
K-means clustering
MacQueen, 65
 Initialize an arbitrary partition P into k clusters C1 ,…,
C k.
 For cluster Cj, element i Cj, EP(i, Cj) = cost of soln. if i
is moved to cluster Cj. Pick EP(r, Cs) if the new partition
is better
 Repeat until no improvement possible
 Requires knowledge of k
111
K-means variations
 Compute a centroid cp for each cluster Cp, e.g.,
gravity center = average vector
 Solution cost: clusters pi in cluster pd(vi,cp)
 Parallel version: move each to the cluster with
the closest centroid simultaneously
 Sequential version: one at a time
 “moving centers” approach
 Objective = homogeneity only (k fixed)
112
113
114
Data representation
 The data to be clustered are represented by
a n × m Numerical Python array data.
 Within the context of gene expression data
clustering, typically the rows correspond to
different genes whereas the columns correspond
to different experimental conditions.
 The clustering algorithms in Bio.Cluster can be
applied both to rows (genes) and to columns
(experiments).
115
Distance\Similarity functions
 'e': Euclidean distance
 'c': Pearson correlation coefficient
 'a': Absolute value of the Pearson correlation
coefficient
 'u': cosine of the angle between two data vectors
 'x': Absolute uncentered Pearson correlation
 's': Spearman’s rank correlation
116
Calculating distance matrices
>>> from Bio.Cluster import distancematrix
>>> matrix = distancematrix(data)
 data - required
 Additional options:
 transpose (default: 0)
Determines if the distances between the rows
of data are to be calculated (transpose==0), or
between the columns of data (transpose==1).
 dist (default: 'e', Euclidean distance)
117
Distancematrix
To save space Biopython keeps only the
lower\upper triangle of the matrix
118
Partitioning algorithms
 Algorithms that receive the number of clusters K
as an argument
 Kmeans
 Kmedians
 Often referred to as EM variations
119
Analysis example
120
Analysis example
# Read the data
import csv
file = open('ge_data_example.txt', 'rb')
data = csv.reader(file, delimiter='\t')
table = [row for row in data]
>>> len(table)
100
>>> table[1][1]
'9.412'
>>> table[0][0]
'sample'
>>> len(table[1])
17
121
Analysis example
# Transform the data to numpy matrix
from numpy import *
mat = matrix(table[1:][1:],dtype='float')
print len(mat)
# Create the distance matrix
from Bio.Cluster import distancematrix
dist_matrix = distancematrix(mat)
# Cluster
from Bio.Cluster import kcluster
clusterid, error, nfound = kcluster(mat)
122
Analysis example
# Cluster
from Bio.Cluster import kcluster
clusterid, error, nfound = kcluster(mat)
 Clusterid: array with cluster assignments
 Error: the within cluster sum of distances
 Nfound: the number of times the returned solution
was found
123
Analysis example
>>> clusterid
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1])
>>> error
15988.118370804612
>>> nfound
1
124
Kcluster: other options
 nclusters (default: 2): the number of clusters k.
 transpose (default: 0): Determines if rows (transpose is 0) or
columns (transpose is 1) are to be clustered.
 npass (default: 1): the number of times the k-means/-medians
clustering algorithm is performed
 method (default: a): describes how the center of a cluster is
found:
 method=='a': arithmetic mean (k-means clustering);
 method=='m': median (k-medians clustering).
 dist (default: 'e', Euclidean distance)
 initialid (default: None)
Specifies the initial clustering to be used for the algorithm.
125
Hierarchical clustering
from Bio.Cluster import treecluster
tree1 = treecluster(mat)
# Can be applied to a precalculated distance matrix
tree2 = treecluster(distancematrix=dist_matrix)
# Get the cluster assignments
clusterid = tree1.cut(3)
126
Hierarchical clustering using SciPy
 Better visualizations!
# Create a distance matrix
X=mat
D = scipy.zeros([len(x),len(x)])
for i in range(len(x)):
for j in range(len(x)):
D[i,j] = sum(abs(x[i] - x[j]))
127
Hierarchical clustering using SciPy
# Compute and plot first dendrogram.
fig = pylab.figure(figsize=(8,8))
# Add an axes at
position rect [left, bottom, width, height] where all
quantities are in fractions of figure width and height.
ax1 = fig.add_axes([0.09,0.1,0.2,0.6])
# Clustering analysis
Y = sch.linkage(D, method='centroid')
Z1 = sch.dendrogram(Y, orientation='right')
ax1.set_xticks([])
ax1.set_yticks([])
128
Hierarchical clustering using SciPy
# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
D = D[idx1,:]
im = axmatrix.matshow(D, aspect='auto',
origin='lower', cmap=pylab.cm.YlGnBu)
axmatrix.set_xticks([])
axmatrix.set_yticks([])
129
Hierarchical clustering using SciPy
# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
pylab.colorbar(im, cax=axcolor)
fig.show()
130
Phylogenetic
trees
131
Remember the Newick format?
 Simple example without branch length
(((A,B),(C,D)),(E,F,G))
132
Visualizing trees
>>> localFile.close()
>>> from Bio import Phylo
>>> tree = Phylo.read("simple.dnd", "newick")
>>> print tree
133
Tree(weight=1.0, rooted=False)
Clade(branch_length=1.0)
Clade(branch_length=1.0)
Clade(branch_length=1.0)
Clade(branch_length=1.0, name='A')
Clade(branch_length=1.0, name='B')
Clade(branch_length=1.0)
Clade(branch_length=1.0, name='C')
Clade(branch_length=1.0, name='D')
Clade(branch_length=1.0)
Clade(branch_length=1.0, name='E')
Clade(branch_length=1.0, name='F')
Clade(branch_length=1.0, name='G')
Visualizing trees
134
Use matplotlib
>>> import matplotlib
>>> tree.rooted = True
>>> Phylo.draw(tree)
135
Phylo IO
 Phylo.read() reads a tree with exactly one tree
 If you have many trees use a loop over the
returned object of Phylo.parse()
 Write to file using Phylo.write(treeObj,format)
 Popular formats: “nwk”, “xml”
 Convert tree formats using Phylo.convert
 Phylo.convert("tree1.xml", "phyloxml", "tree1.dnd",
"newick")
136
Download