Programming for Engineers in Python Biopython 1 Classes class <classname>: statement_1 . . statement_n The methods of a class get the instance as the first parameter traditionally named self The method __init__ is called upon object construction (if available) 2 Classes Reminder: type = data representation + behavior. Classes are user-defined types. class <classname>: statement_1 . . statement_n Like a mini-program: • Variables. • Function Definitions. • Even arbitrary commands. Objects of a class are called class instances. 3 Classes – Attributes and Methods class Vector2D: def __init__ (self, x, y): self.x, self.y = x, y Instance def size (self): return (self.x ** 2 + self.y ** 2) ** 0.5 Methods 4 Attributes (each instance has its own copy) Classes – Instantiate and Use >>> v = Vector2D(3, 4) # Make instance. >>> v <__main__.Vector2D object at 0x00000000031A2828> >>> v.size() # Call method on instance. 5.0 Example – Multimap A dictionary with more than one value for each key We already needed it once or twice and used: >>> lst = d.get(key, [ ]) >>> lst.append(value) >>> d[key] = lst We will now write a new class that will be a wrapper around a dict The class will have methods that allow us to keep multiple values for each key 6 Multimap. partial code class Multimap: def __init__(self): '''Create an empty Multimap''' self.inner = inner def get(self, key): '''Return list of values associated with key''' return self.inner.get(key, []) def put(self, key, value): '''Adds value to the list of values associated with key''' value_list = self.get(key) if value not in value_list: value_list.append(value) self.inner[key] = value_list 7 Multimap put_all and remove def put_all(self, key, values): for v in values: self.put(key, v) def remove(self, key, value): value_list = self.get(key) if value in value_list: value_list.remove(value) self.inner[key] = value_list return True return False 8 Multimap. Partial code def __len__(self): '''Returns the number of keys in the map''' return len(self.inner) def __str__(self): '''Converts the map to a string''' return str(self.inner) def __cmp__(self, other): '''Compares the map with another map''' return self.inner.cmp(other) def __contains__(self, key): '''Returns True if key exists in the map''' return self.has_key(k) 9 Multimap Use case – a dictionary of countries and their cities: >>> m = Multimap() >>> m.put('Israel', 'Tel-Aviv') >>> m.put('Israel', 'Jerusalem') >>> m.put('France', 'Paris') >>> m.put_all('England',('London', 'Manchester', 'Moscow')) >>> m.remove('England', 'Moscow') 10 >>> print m.get('Israel') ['Tel-Aviv', 'Jerusalem'] 11 Biopython An international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology Provides tools for Parsing files (fasta, clustalw, GenBank,…) Interface to common softwares Operations on sequences Simple machine learning applications BLAST And many more 12 Installing Biopython Go to http://biopython.org/wiki/Download Windows Unix Select python 2.7 NumPy is required 13 SeqIO The standard Sequence Input/Output interface for BioPython Provides a simple uniform interface to input and output assorted sequence file formats Deals with sequences as SeqRecord objects There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objects 14 Parsing a FASTA file # Parse a simple fasta file from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) Why repr and not str? 15 16 GenBank files # genbank files from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record # added to print just one record example break 17 GenBank files from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) 18 Sequence objects Support similar methods as standard strings Provide additional methods Translate Reverse complement Support different alphabets AGTAGTTAAA can be DNA Protein 19 Sequences and alphabets Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions For example: Adding ambiguous symbols Adding special new characters 20 Example – generic alphabet >>> from Bio.Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> my_seq.alphabet Non-specific Alphabet() alphabet 21 Example – specific sequences >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna) >>> my_seq Seq('AGTACACTGGT', IUPACUnambiguousDNA()) >>> my_seq.alphabet IUPACUnambiguousDNA() >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_prot = Seq("AGTACACTGGT", IUPAC.protein) >>> my_prot Seq('AGTACACTGGT', IUPACProtein()) >>> my_prot.alphabet IUPACProtein() 22 Sequences act like strings Access elements >>> print my_seq[0] #first letter G >>> print my_seq[2] #third letter T >>> print my_seq[-1] #last letter G Count without overlaps >>> from Bio.Seq import Seq >>> "AAAA".count("AA") 2 >>> Seq("AAAA").count("AA") 2 23 Calculate GC content >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> GC(my_seq) 46.875 24 Slicing Simple slicing >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) Start, stop, stride >>> my_seq[0::3] Seq('GCTGTAGTAAG', IUPACUnambiguousDNA()) >>> my_seq[1::3] Seq('AGGCATGCATC', IUPACUnambiguousDNA()) >>> my_seq[2::3] Seq('TAGCTAAGAC', IUPACUnambiguousDNA()) 25 Concatenation Simple addition as in Python But, alphabets must fit >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> protein_seq + dna_seq Traceback (most recent call last): … 26 Changing case >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq Seq('acgtACGT', DNAAlphabet()) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet()) 27 Changing case Case is important for matching >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq.upper() True IUPAC names are upper case >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> dna_seq.lower() Seq('acgt', DNAAlphabet()) 28 Reverse complement >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA()) 29 Transcription >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATA G", IUPAC.unambiguous_dna) >>> template_dna = coding_dna.reverse_complement() >>> messenger_rna = coding_dna.transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUA G', IUPACUnambiguousRNA()) As you can see, all this does is switch T → U, and adjust the alphabet. 30 Translation Simple example >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) 31 Stop codon! Translation from the DNA >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) 32 Using different translation tables In several cases we may want to use different translation tables Translation tables are given IDs in GenBank (standard=1) Vertebrate Mitochondrial is table 2 More details in 33 http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi Using different translation tables >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) 34 Translation tables in biopython 35 Translate up to the first stop in frame >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(to_stop=True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2, to_stop=True) Seq('MAIVMGRWKGAR', IUPACProtein()) 36 Comparing sequences Standard “==“ comparison is done by comparing the references (!), hence: >>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq1==seq2 Warning (from warnings module): … FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception)… please use str(seq1)==str(seq2) to make your code explicit and to avoid this warning. False >>> seq1==seq1 True 37 Mutable vs. Immutable Like strings standard seq objects are immutable If you want to create a mutable object you need to write it by either: Use the “tomutable()” method Use the mutable constructor mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAG GGTGCCCGA", IUPAC.unambiguous_dna) 38 Unknown sequences example In many biological cases we deal with unknown sequences >>> from Bio.Seq import UnknownSeq >>> from Bio.Alphabet import IUPAC >>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna) >>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) >>> unk_dna+my_seq Seq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGC CGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA()) 39 MSA 40 Read MSA Use Bio.AlignIO.read(file, format) File – the file path Format support: “stockholm” “fasta” “clustal” … Use help(AlignIO) for details 41 Example We want to parse this file from PFAM 42 Example from Bio import AlignIO alignment = AlignIO.read("PF05371.sth", "stockholm") print alignment 43 Alignment object example >>> from Bio import AlignIO >>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm") >>> print alignment[1] ID: Q9T0Q8_BPIKE/1-52 Name: Q9T0Q8_BPIKE Description: Q9T0Q8_BPIKE/1-52 Number of features: 0 /start=1 /end=52 /accession=Q9T0Q8.1 Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVI KLFKKFVSRA', SingleLetterAlphabet()) 44 Alignment object example >>> print "Alignment length %i" % alignment.get_alignment_length() Alignment length 52 >>> for record in alignment: print "%s - %s" % (record.seq, record.id) AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSK A - COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS RA - COATB_BPIF1/22-73 45 Cross-references example Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure? >>> for record in alignment: if record.dbxrefs: print record.id, record.dbxrefs COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;'] COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;'] Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;'] COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;'] 46 Comments Remember that almost all MSA formats are supported When you have more than one MSA in your files use AlignIO.parse() Common example is PHYLIP’s output Use AlignIO.parse("resampled.phy", "phylip") The result is an iterator object that contains all MSAs 47 Write alignment to file from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Align import MultipleSeqAlignment align1 = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),]) from Bio import AlignIO AlignIO.write(align1, "my_example.phy", "phylip") 48 3 12 Alpha Beta Gamma 39 Delta Epislon Zeta 3 13 Eta Theta ACTGCTAGCT AG ACT-CTAGCT AG ACTGCTAGDT AG GTCAGC-AG GACAGCTAG GTCAGCTAG ACTAGTACAG CTG ACTAGTACAG CT- Slicing Alignments work like numpy matrices >>> print alignment[2,6] T # You can pull out a single column as a string like this: >>> print alignment[:,6] TTT---T >>> print alignment[3:6,:6] SingleLetterAlphabet() alignment with 3 rows and 6 columns AEGDDP COATB_BPM13/24-72 AEGDDP COATB_BPZJ2/1-49 AEGDDP Q9T0Q9_BPFD/1-49 49 >>> print alignment[:,:6] SingleLetterAlphabet() alignment with 7 rows and 6 columns AEPNAA COATB_BPIKE/30-81 AEPNAA Q9T0Q8_BPIKE/1-52 DGTSTA COATB_BPI22/32-83 AEGDDP COATB_BPM13/24-72 AEGDDP COATB_BPZJ2/1-49 AEGDDP Q9T0Q9_BPFD/1-49 FAADDA COATB_BPIF1/22-73 External applications How do we call MSA algorithms on unaligned set of sequences? Biopython provides wrappers The idea: Create a command line object with the algorithm options Invoke the command (Python uses subprocesses) Bio.Align.Applications module: >>> import Bio.Align.Applications >>> dir(Bio.Align.Applications) 50 ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ] ClustalW example First step: download ClustalW from ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.1/ Second step: install Third step: look for clustal exe files Now you can run ClustalW from your Python code 51 Run example >>> import os >>> from Bio.Align.Applications import ClustalwCommandline >>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe" >>> clustalw_cline = ClustalwCommandline(clustalw_exe, infile="opuntia.fasta") >>> assert os.path.isfile(clustalw_exe), "Clustal W executable missing" >>> stdout, stderr = clustalw_cline() The command line is actually a function we can run! 52 ClustalW >>> from Bio import AlignIO >>> align = AlignIO.read("opuntia.aln", "clustal") >>> print align SingleLetterAlphabet() alignment with 7 rows and 906 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273287|gb|AF191661.1|AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273286|gb|AF191660.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273290|gb|AF191664.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273289|gb|AF191663.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273291|gb|AF191665.1|AF191 53 ClustalW - tree In case you are interested, the opuntia.dnd file ClustalW creates is just a standard Newick tree file, and Bio.Phylo can parse these: >>> from Bio import Phylo >>> tree = Phylo.read("opuntia.dnd", "newick") >>> Phylo.draw_ascii(tree) 54 BLAST 55 Running BLAST over the internet We use the function qblast() in the Bio.Blast.NCBIWWW module. This has three nonoptional arguments: The blast program to use for the search, as a lower case string: works with blastn, blastp, blastx, tblast and tblastx. The databases to search against. The options for this are available on the NCBI web pages at http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.s html. A string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number. 56 qblast additional parameters qblast can receive other parameters, analogous to the parameters of the actual server Important examples: format_type: "HTML", "Text", "ASN.1", or "XML". The default is "XML", as that is the format expected by the parser (see next examples) expect sets the expectation or e-value threshold. 57 Step 1: call BLAST >>> from Bio.Blast import NCBIWWW # Option 1 - Use GI ID >>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") # Option 2 – read a fasta file >>> fasta_string = open("m_cold.fasta").read() >>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) # option 3 – parse file to seq object >>> record = SeqIO.read(open("m_cold.fasta"), format="fasta") >>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) 58 Step2: parse the results >>> from Bio.Blast import NCBIXML >>> blast_record = NCBIXML.read(result_handle) Read can be used only once! blast_record object keeps the actual results 59 Remarks Basically, Biopython supports reading BLAST results from HTMLs and text files. These methods are not stable and sometimes fail because the servers change the format. XML is stable You can save XML files In the server From result_handle objects (next slide) 60 Save results as XML >>> save_file = open("my_blast.xml", "w") >>> save_file.write(result_handle.read()) >>> save_file.close() >>> result_handle.close() Read can be used only once! 61 BLAST records A BLAST Record contains everything you might ever want to extract from the BLAST output. Example: 62 >>> E_VALUE_THRESH = 0.04 >>> for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '' print hsp.match[0:75] + '' print hsp.sbjct[0:75] + '' BLAST records 63 More functions We cover here very basic functions To get more details use >>> import Bio.Blast.Record >>> help(Bio.Blast.Record) Help on module Bio.Blast.Record in Bio.Blast: NAME Bio.Blast.Record - Record classes to hold BLAST output. FILE d:\python27\lib\site-packages\bio\blast\record.py DESCRIPTION Classes: Blast Holds all the information from a blast search. PSIBlast Holds all the information from a psi-blast search. 64 Header Holds information from the header. Description Holds information about one hit description. Alignment Holds information about one alignment hit. HSP Holds information about one HSP. MultipleAlignment Holds information about a multiple alignment. DatabaseReport Holds information from the database report. Parameters Holds information from the parameters. Accessing NCBI’s Entrez Databases 65 Bio.Entrez Module for programmatic access to Entrez Example: search PubMed or download GenBank records from within a Python script Makes use of the Entrez Programming Utilities http://www.ncbi.nlm.nih.gov/entrez/utils/ Makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI Note! If the NCBI finds you are abusing their systems, they can and will ban your access! 66 ESearch example >>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]") >>> record = Entrez.read(handle) # Each of the IDs is a GenBank identifier. >>> print (record["IdList"]) ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153'] 67 Explanation Entrez.read Transforms the actual results (retrieved as XML) to a usable object of type Bio.Entrez.Parser.DictionaryElement >>> record {u'Count': '158', u'RetMax': '20', u'IdList': ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153'], u'TranslationStack': [{u'Count': '2482', u'Field': 'Organism', u'Term': '"Cypripedioideae"[Organism]', u'Explode': 'Y'}, {u'Count': '71514', u'Field': 'Gene', u'Term': 'matK[Gene]', u'Explode': 'N'}, 'AND'], u'TranslationSet': [{u'To': '"Cypripedioideae"[Organism]', u'From': 'Cypripedioideae[Orgn]'}], u'RetStart': '0', u'QueryTranslation': '"Cypripedioideae"[Organism] AND matK[Gene]'} 68 Database options 'pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists' 69 Download a full record >>> from Bio import Entrez # Always tell NCBI who you are >>> Entrez.email = A.N.Other@example.com # rettype: get a GenBank record >>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") >>> print handle.read() 70 71 Change ‘gb’ to ‘fasta’ 72 Read directly to Seq.IO object >>> from Bio import Entrez, SeqIO >>> handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="gb", retmode="text") >>> record = SeqIO.read(handle, "genbank") >>> handle.close() >>> print record ID: EU490707.1 Name: EU490707 Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast. Number of features: 3 ... Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAA ATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA()) 73 Download directly from a URL Suppose we know how the database URLs look like Example: GEO (gene expression omnibus) "http://www.ncbi.nlm.nih.gov/geo/download/?ac c=GSE6609&format=file" 74 Use the urlib2 module >>> import urllib2 >>> u = urllib2.urlopen('http://www.ncbi.nlm.nih.gov/geo/dow nload/?acc=GSE6609&format=file') >>> localFile = open('gse6609_raw.tar', 'w') >>> for x in u: localFile.write(x) >>> localFile.close() 75 More details We covered only a few concepts For more details on Biopython options, including dealing with specialized parsers, see http://biopython.org/DIST/docs/tutorial/Tutorial.ht ml#sec:parsing-blast Chapter 9 Look at the urllib2 manual http://docs.python.org/2/library/urllib2.html 76 Sequence Motifs 77 Gene expression regulation Transcription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) TFs can promote or repress transcription Other regulators: micro-RNAs (miRNAs) Ab-initio motif discovery You are given a set of strings You want to find a motif that is significantly represented in the strings For example: TF\miRNA binding site 79 TFBS models The BSs of a particular TF share a common pattern, or motif, which is often modeled using: Degenerate string GGWATB (W={A,T}, B={C,G,T}) ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA PWM = Position weight matrix 1 2 3 4 5 6 A 0.1 0.8 0 0.7 0.2 0 C 0 0.1 0.5 0.1 0.4 0.6 G 0 0 0.5 0.1 0.4 0.1 T 0.9 0.1 0 0.1 0 0.3 Cutoff = 0.009 AGCTACACCCATTTAT 0.06 AGTAGAGCCTTCGTG 0.06 CGATTCTACAATATGA 0.01 Motif discovery: The typical two-step pipeline Promoter/3’UTR sequences Co-regulated gene set Cluster I Gene expression microarrays Clustering Cluster II Cluster III Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery Motif discovery: Goals and challenges Goal: Reverse-engineer the transcriptional regulatory network Challenges: BSs are short and degenerate (non-specific) Promoters are long + complex (hard to model) Search space is huge (motif and sequence) Data is noisy What to look for? (enriched?, localized?, conserved?) Problem is still considered very difficult despite extensive research Biopython motif objects from Bio import motifs from Bio.Seq import Seq instances = [Seq("TACAA"),Seq("TACGC"),Seq("TACAC"),Seq("TACCC" ),Seq("AACCC"),Seq("AATGC"),Seq("AATGC")] m = motifs.create(instances) print m TACAA TACGC TACAC TACCC AACCC AATGC AATGC 83 Biopython motif objects >>> print m.counts 0 1 2 3 4 A: 3.00 7.00 0.00 2.00 1.00 C: 0.00 0.00 5.00 2.00 6.00 G: 0.00 0.00 0.00 3.00 0.00 T: 4.00 0.00 2.00 0.00 0.00 84 Biopython motif objects >>> m.consensus Seq('TACGC', IUPACUnambiguousDNA()) #The anticonsensus sequence, corresponding to the smallest values in the columns of the .counts matrix: >>> m.anticonsensus Seq('GGGTG', IUPACUnambiguousDNA()) 85 Motif database (http://jaspar.genereg.net/) 86 87 88 89 90 Read records from Bio import motifs arnt = motifs.read(open("Arnt.sites"), "sites") print arnt.counts 0 1 2 3 4 5 A: 4.00 19.00 0.00 0.00 0.00 0.00 C: 16.00 0.00 20.00 0.00 0.00 0.00 G: 0.00 1.00 0.00 20.00 0.00 20.00 T: 0.00 0.00 0.00 0.00 20.00 0.00 91 MEME MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It takes as input a group of DNA or protein sequences and outputs as many motifs as requested. Therefore, in contrast to JASPAR files, MEME output files typically contain multiple motifs. 92 Assumptions The number of motifs is known Assume this number is 1 The size of the motif is known Biologically, we have estimates for the size for TFs and miRNA Missing information PWM of the motif PWM of the background Motif locations 93 Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions P( X | Y ) P( xi , Y,i ) i 94 Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions P( X | Y ) P( xi , Y,i ) i Given a PWM we can now calculate for each position K in each sequence J the probability the motif starts at K in the sequence J. 95 Expectation Maximization (EM) Algorithm •Start with initial guess for the PWMs •The EM algorithm consists of the two steps, which are repeated consecutively. • Step 1, estimate the probability of finding the site at any position in each of the sequences. These probabilities are used to provide new information as to expected base or aa distribution for each column in the site. •Step 2, the maximization step, the new counts for bases or aa for each position in the site found in the step 1 are substituted for the previous set. Expectation Maximization (EM) Algorithm OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO o o o o o o o o o o o o o o o o o o o o o o o o OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO IIII IIIIIIII Columns defined by a preliminary alignment of the sequences provide initial estimates of frequencies of aa in each motif column IIIIIII Columns not in motif provide background frequencies Bases Background Site column 1 Site column 2 …… G 0.27 0.4 0.1 …… C 0.25 0.4 0.1 …… A 0.25 0.2 0.1 …… T 0.23 0.2 0.7 …… Total 1.00 1.00 1.00 …… Expectation Maximization (EM) Algorithm XXXXOOOOOOOOOOOOOOOO XXXX A IIII IIIIIIIIIIIIIIII OXXXXOOOOOOOOOOOOOOO XXXX B IIII I IIIIIIIIIIIIIII Use previous estimates of aa or nucleotide frequencies for each column in the motif to calculate probability of motif in this position, and multiply by…….. X …background frequencies in the remaining positions. The resulting score gives the likelihood that the motif matches positions A, B or other in seq 1. Repeat for all other positions and find most likely locator. Then repeat for the remaining seq’s. EM Algorithm 2nd optimisation step: calculations •The site probabilities for each seq calculated at the 1st step are then used to create a new table of expected values for base counts for each of the site positions using the site probabilities as weights. • Suppose that P (site 1 in seq 1) = Psite1,seq1 / (Psite1,seq1 + Psite2,seq1 + …+ Psite78,seq1 ) = 0.01 and P (site 2 in seq 1) = 0.02. •Then this values are added to the previous table as shown in the table below. •This procedure is repeated for every other possible first columns in seq1 and then the process continues for all other sequences resulting in a new version of the table. •The expectation and maximization steps are repeated until the estimates of base frequencies do not change. Bases Background Site column 1 Site column 2 …… G 0.27 + … 0.4 + … 0.1 + … …… C 0.25 + … 0.4 + … 0.1 + … …… A 0.25 + … 0.2 + 0.01 0.1 + … …… T 0.23 + … 0.2 + … 0.7 + 0.02 …… Total/ weighted 1.00 1.00 1.00 …… Run MEME ( 100 http://meme.nbcr.net/meme/cgi-bin/meme.cgi ) Results 101 Parse results >>> handle = open("meme.dna.oops.txt") >>> record = motifs.parse(handle, "meme") >>> handle.close() >>> len(record) 2 >>> motif = record[0] >>> print motif.consensus TTCACATGCCGC >>> print motif.degenerate_consensus TTCACATGSCNC 102 Motif attributes >>> motif.num_occurrences 7 >>> motif.length 12 >>> evalue = motif.evalue >>> print "%3.1g" % evalue 0.2 >>> motif.name 'Motif 1' 103 Where the motif was found >>> motif = record['Motif 1'] # Each motif has an attribute .instances with the sequence instances in which the motif was found, providing some information on each instance >>> len(motif.instances) 7 >>> motif.instances[0] Instance('TTCACATGCCGC', IUPACUnambiguousDNA()) >>> motif.instances[0].start 620 >>> motif.instances[0].strand '-' >>> motif.instances[0].length 12 >>> pvalue = motif.instances[0].pvalue >>> print "%5.3g" % pvalue 1.85e-08 104 Amadeus Advanced algorithms improve upon MEME This is an algorithm for motif finding Appears to be one of the top algorithms in 105 many tests Java based tool Easy to use GUI Supports analysis of TFs and miRNAs Developed here in TAU Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species Supports diverse motif discovery tasks: 1. 2. 3. Finding over-represented motifs in one or more given sets of genes. Identifying motifs with global spatial features given only the genomic sequences. Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How? A general pipeline architecture for enumerating motifs. Different statistical scoring schemes of motifs for different motif discovery tasks. Input: ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02] Pairs analysis CHR NF-Y (CCAAT-box) Clustering analysis 108 Clustering - reminder Cluster analysis is the grouping of items into clusters based on the similarity of the items to each other. Bio.Cluster module Kmeans SOM Hierarchical clustering PCA 109 K-means clustering MacQueen, 65 Input: a set of observations (x1, x2, …, xn) For example, each observation is a gene, and x is the values Goal: partition the observation to K clusters S = {S1, S2, …, Sk} Objective function: 110 K-means clustering MacQueen, 65 Initialize an arbitrary partition P into k clusters C1 ,…, C k. For cluster Cj, element i Cj, EP(i, Cj) = cost of soln. if i is moved to cluster Cj. Pick EP(r, Cs) if the new partition is better Repeat until no improvement possible Requires knowledge of k 111 K-means variations Compute a centroid cp for each cluster Cp, e.g., gravity center = average vector Solution cost: clusters pi in cluster pd(vi,cp) Parallel version: move each to the cluster with the closest centroid simultaneously Sequential version: one at a time “moving centers” approach Objective = homogeneity only (k fixed) 112 113 114 Data representation The data to be clustered are represented by a n × m Numerical Python array data. Within the context of gene expression data clustering, typically the rows correspond to different genes whereas the columns correspond to different experimental conditions. The clustering algorithms in Bio.Cluster can be applied both to rows (genes) and to columns (experiments). 115 Distance\Similarity functions 'e': Euclidean distance 'c': Pearson correlation coefficient 'a': Absolute value of the Pearson correlation coefficient 'u': cosine of the angle between two data vectors 'x': Absolute uncentered Pearson correlation 's': Spearman’s rank correlation 116 Calculating distance matrices >>> from Bio.Cluster import distancematrix >>> matrix = distancematrix(data) data - required Additional options: transpose (default: 0) Determines if the distances between the rows of data are to be calculated (transpose==0), or between the columns of data (transpose==1). dist (default: 'e', Euclidean distance) 117 Distancematrix To save space Biopython keeps only the lower\upper triangle of the matrix 118 Partitioning algorithms Algorithms that receive the number of clusters K as an argument Kmeans Kmedians Often referred to as EM variations 119 Analysis example 120 Analysis example # Read the data import csv file = open('ge_data_example.txt', 'rb') data = csv.reader(file, delimiter='\t') table = [row for row in data] >>> len(table) 100 >>> table[1][1] '9.412' >>> table[0][0] 'sample' >>> len(table[1]) 17 121 Analysis example # Transform the data to numpy matrix from numpy import * mat = matrix(table[1:][1:],dtype='float') print len(mat) # Create the distance matrix from Bio.Cluster import distancematrix dist_matrix = distancematrix(mat) # Cluster from Bio.Cluster import kcluster clusterid, error, nfound = kcluster(mat) 122 Analysis example # Cluster from Bio.Cluster import kcluster clusterid, error, nfound = kcluster(mat) Clusterid: array with cluster assignments Error: the within cluster sum of distances Nfound: the number of times the returned solution was found 123 Analysis example >>> clusterid array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> error 15988.118370804612 >>> nfound 1 124 Kcluster: other options nclusters (default: 2): the number of clusters k. transpose (default: 0): Determines if rows (transpose is 0) or columns (transpose is 1) are to be clustered. npass (default: 1): the number of times the k-means/-medians clustering algorithm is performed method (default: a): describes how the center of a cluster is found: method=='a': arithmetic mean (k-means clustering); method=='m': median (k-medians clustering). dist (default: 'e', Euclidean distance) initialid (default: None) Specifies the initial clustering to be used for the algorithm. 125 Hierarchical clustering from Bio.Cluster import treecluster tree1 = treecluster(mat) # Can be applied to a precalculated distance matrix tree2 = treecluster(distancematrix=dist_matrix) # Get the cluster assignments clusterid = tree1.cut(3) 126 Hierarchical clustering using SciPy Better visualizations! # Create a distance matrix X=mat D = scipy.zeros([len(x),len(x)]) for i in range(len(x)): for j in range(len(x)): D[i,j] = sum(abs(x[i] - x[j])) 127 Hierarchical clustering using SciPy # Compute and plot first dendrogram. fig = pylab.figure(figsize=(8,8)) # Add an axes at position rect [left, bottom, width, height] where all quantities are in fractions of figure width and height. ax1 = fig.add_axes([0.09,0.1,0.2,0.6]) # Clustering analysis Y = sch.linkage(D, method='centroid') Z1 = sch.dendrogram(Y, orientation='right') ax1.set_xticks([]) ax1.set_yticks([]) 128 Hierarchical clustering using SciPy # Plot distance matrix. axmatrix = fig.add_axes([0.3,0.1,0.6,0.6]) idx1 = Z1['leaves'] D = D[idx1,:] im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=pylab.cm.YlGnBu) axmatrix.set_xticks([]) axmatrix.set_yticks([]) 129 Hierarchical clustering using SciPy # Plot colorbar. axcolor = fig.add_axes([0.91,0.1,0.02,0.6]) pylab.colorbar(im, cax=axcolor) fig.show() 130 Phylogenetic trees 131 Remember the Newick format? Simple example without branch length (((A,B),(C,D)),(E,F,G)) 132 Visualizing trees >>> localFile.close() >>> from Bio import Phylo >>> tree = Phylo.read("simple.dnd", "newick") >>> print tree 133 Tree(weight=1.0, rooted=False) Clade(branch_length=1.0) Clade(branch_length=1.0) Clade(branch_length=1.0) Clade(branch_length=1.0, name='A') Clade(branch_length=1.0, name='B') Clade(branch_length=1.0) Clade(branch_length=1.0, name='C') Clade(branch_length=1.0, name='D') Clade(branch_length=1.0) Clade(branch_length=1.0, name='E') Clade(branch_length=1.0, name='F') Clade(branch_length=1.0, name='G') Visualizing trees 134 Use matplotlib >>> import matplotlib >>> tree.rooted = True >>> Phylo.draw(tree) 135 Phylo IO Phylo.read() reads a tree with exactly one tree If you have many trees use a loop over the returned object of Phylo.parse() Write to file using Phylo.write(treeObj,format) Popular formats: “nwk”, “xml” Convert tree formats using Phylo.convert Phylo.convert("tree1.xml", "phyloxml", "tree1.dnd", "newick") 136