Functional Annotation Background + Strategy The Group 27th Feb 2012 Lavanya Rishishwar Artika Nath Lu Wang Haozheng Tian Shengyun Peng Ashwath Kumar Hamidreza Hassanzadeh 1 Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 2 Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 3 Functional Annotation THE ‘WHAT?’ 27th Feb 2012 4 Genome Assembly Assemble the Pieces Right 27th Feb 2012 5 Gene Prediction When on board HMS Beagle, as Identify the words naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of hen on board HMS Beagle, as the present to the past inhabitants naturalist, I was much struckofwith that continent. These facts certain facts in the distribution seemed of to me to throw some light the inhabitants of South America, on the origin of species - that and in the geological relations mystery of of mysteries, as it has been the present to the past inhabitants called by one of our greatest of that continent. These philosophers. facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatest philosophers. W 27th Feb 2012 6 Functional Annotation nat·u·ral·ist [nach-er-uh-list, nach-ruh-] noun 1. a person who studies or is an expert in natural history, especially a zoologist or botanist. 2. an adherent of naturalism in literature or art. Origin: 1580–90; natural + -ist DATABASES 27th Feb 2012 Identify the function (i.e., meaning) of each word When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatest philosophers. PROFILES Origin of Species, The noun ( On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life ) a treatise (1859) by Charles Darwin setting forth his theory of 7 evolution. Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 8 THE GRAVITY OF THE ANNOTATION PROCESS Not just Newtonian 27th Feb 2012 9 “Ultimately, one wishes to determine how genes—and the proteins they encode—function function in the intact organism.” Albert B, et al. (2002) Molecular biology of cell. New York: Garland Science. 27th Feb 2012 10 Function? What is it? • To a cell biologist function might refer to the network of interactions in which the protein participates or to the location to a certain cellular compartment. • To a biochemist, function refers to the metabolic process in which a protein is involved or to the reaction catalyzed by an enzyme. 27th Feb 2012 11 Functional Annotation Functional annotation consists of attaching biological information to genomic elements. • Biochemical function • Biological function • Involved regulation and interactions • Expression 27th Feb 2012 12 Whatever happened to wet-lab? “Experimentally annotating one complete bacterial genome varies from organism to organism. Roughly speaking, it could take as much as $25,000 and a period of 6-12 months for completing the process” - Alejandro Caro 27th Feb 2012 13 The Naked Truth 2000 No. of Genomes in KEGG 1800 1600 1400 1200 1000 800 600 400 200 0 7/98 27th Feb 2012 10/99 1/01 4/02 7/03 10/04 1/06 4/07 7/08 KEGG Genome: Release Update of Jan 2012 10/09 1/11 14 How Gene Performs Function? Operon • Operon: Several genes with related functions that are regulated together, because one piece of mRNA codes for several related proteins. • Polycistronic mRNA,, mRNA coding for more than one polypeptide, is found only in prokaryotes 27th Feb 2012 15 Coding and non coding RNA’s Protein Coding Enzymes Structural Regulatory Signal Transduction Receptors Toxins Virulence Factors Membrane/ Transmembrane Non Coding Riboswitches CRISPR Srna's 27th Feb 2012 Pathway Prediction 16 Domain/Motif • Domain: A discrete structural unit that is assumed to fold independently of the rest of the protein and to have its own function. ~20-100 aa • Motif: Are short, conserved regions and frequently are the most conserved regions of domains. Motifs are critical for the domain to function. 27th Feb 2012 17 Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 18 Haemophilus haemolyticus - The Biography Understanding the Target 27th Feb 2012 19 Haemophilus haemolyticus • • • • Gram-negative Facultative anaerobe Known to colonize the human respiratory tract. Out of the 8 Haemophilus species found to colonize the respiratory tract, H. influenzae and H. haemolyticus are the most prevalent ones. • H. haemolyticus is an emerging pathogen – 5 cases of invasive disease reported between 2009-10. 27th Feb 2012 20 Strains of H. haemolyticus Species Disease State State Isolated Hemolysis Hpd fucK M19107 H. Haemolyticus Asymptomatic Minnesota Y - - M19501 H. Haemolyticus Asymptomatic Minnesota N + - M21127 H.Haemolyticus Pathogenic Georgia Y - - M21621 H. Haemolyticus Pathogenic Texas Y - - M21639 H. Haemolyticus Pathogenic Illinois N - - M21709 H. Influenzae Pathogenic NY N - + fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolates Hpd: encoding a lipoprotein protein D, 27th Feb 2012 21 Phylogeny Niels Nørskov-Lauritsen, N., et al. (2005).Multilocus sequence phylogenetic study of the genus Haemophilus with description of Haemophilus pittmaniae sp. nov. International Journal of Systematic and Evolutionary 27th Feb 2012 55, 449–456 22 Microbiology, Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 23 View from 300 ft and a brief time travel 27th Feb 2012 24 Ontology • An ontology is a "formal, explicit specification of a shared conceptualization“ • Two formal major ontology schemes: – EC – Enzyme Commission Number – GO – Gene Ontology 27th Feb 2012 25 Enzyme Commission (EC) • A large scale comprehensive attempt to organize and classify enzymes according to its function • For inclusion in the list, direct experimental evidence is to be provided for its claimed activity • Organizes the list of enzymes in four levels of hierarchy, starting with the top most 6 classes: 1. 2. 3. 4. 5. 6. 27th Feb 2012 Oxidoreductases Transferases Hydrolases Lyases Isomerases Ligases 26 Chronology: Enzyme Commission (EC) • Cons of EC: • Hierarchy only provides parent to child relationship • Only specific to enzymes (doesn't cover all of the proteins) 27th Feb 2012 27 Chronology: Gene Ontology (GO) Or in other words "give this protein a name and stick to it!!" 27th Feb 2012 28 What is the GO? • • • • Molecular Function Biological Process Cellular Component Relations between the terms – ‘is_a’ – ‘part_of’, ‘has_part’ – ’regulates’ 27th Feb 2012 29 Structure of GO du Plessis L, Skunca N, Dessimoz C (2011). The what, where, how and why of gene ontology–a primer for bioinformaticians. Brief Bioinform. Doi: 10.1093/bib/bbr002 27th Feb 2012 30 General Rule To Apply Evidence Code 27th Feb 2012 31 Where Do Annotations Come From? • Inferred from experiment – Most reliable – Base for computational method • Inferred from computational method – Sequence similarity, structural similarity, etc. • Inferred from author statement • Curator statement and Obsolete evidence codes 27th Feb 2012 32 Why use the GO? • The ‘GO Consortium’ consists of a number of large databases working together to define standardized ontologies and provide annotations to the GO. • Search for interacting genes • Reason across the relations • Analyze the results of high-throughput experiment • Infer function of un-annotated genes and inter proteinprotein interactions. 27th Feb 2012 33 Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 34 CAUTION! PROS AND CONS OF CONVENTIONAL APPROACHES Choosing The Right Function Prediction Tool 27th Feb 2012 35 “Perutz et al. showed in 1960 that myoglobin and hemoglobin, the first two protein structures to be solved at atomic resolution using X-ray crystallography, have similar structures even though their sequences differ.” 27th Feb 2012 36 Pros and Cons: There are no free lunches! • Homology Useful but different from “same” function – Simply implies common ancestry 27th Feb 2012 37 Pros and Cons: There are no free lunches! 27th Feb 2012 38 Pros and Cons: There are no free lunches! • Quality of Prediction is as good as the quality of annotation of the database • Eukaryotic function predictor can not be used for Prokaryotes and vice versa 27th Feb 2012 39 Outline • • • • • • What is Functional Annotation The Importance of Functional Annotation The Biology of H. haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach – Breadth – Depth 27th Feb 2012 40 BREADTH AND DEPTH OF THE ANALYSIS A Snapshot of the Iceberg Named Functional Annotation 27th Feb 2012 41 Spectrum of Methods Selected BREADTH 27th Feb 2012 42 Criteria for selecting methods 1. Currently being maintained 2. Applicable to Prokaryotic sequences 3. Could be installed locally (support batch jobs if GUI) OR Could be included in a pipeline i.e., have a command-line interface 27th Feb 2012 43 Categories of Approaches • Sequence similarity-based • Phylogenomics-based • Domain/pattern/profile - based – Domain-based – Pattern-based – Profile-based • Sequence clustering-based • Machine learning-based • Network-based 27th Feb 2012 44 Breadth: Options Approach Sequence similarity based Phylogenomics based Domain/pattern/profile based 27th Feb 2012 Resource GOtcha PFP GOsling OntoBlast GOblet Blast2GO SIFTER AFAWE RIO OrthoStrapper InterProScan TMHMM HMMTOP HMMER Pfam SUPERFAMILY PROSITE PRINTS SMART Gene3D PANTHER TIGRFAMs SCOP CATH CatFam PIRSF PRODOM EFICAz PRIAM Approach Sequence clustering based Machine learning based Network based Pipelines Resource ProtoNet CluSTr eggNOG COGs InParanoid MultiParanoid OrthoMCL ProtFun GOPET SVM-Prot ffPred EzyPred MCODE AGeS SAMBA RNSC PRODISTIN Cytoscape STRING VisANT VIRGO RAST MultiParanoid AGMIAL MicroScope Dead GUI Proprierty Eukaryotic Model External Servers InterPro Web-based Servers 45 Flowchart 27th Feb 2012 46 Description of Selected Methods DEPTH 27th Feb 2012 47 Level 1 The building blocks! 27th Feb 2012 48 PanGenome Analysis • PanGeome is the full complement of genes in a species. • It includes core genome which is a set of genes that are present in all strains, dispensable genome that are genes present in 2 or more strains and unique genes which are unique to specific strains. • In this case, we will be using pangeome of Haemophilus influenzae. • This database will be used as the reference database in BLAST. • This method gives high confidence annotations since the strains selected are very closely related to the organism in question. 27th Feb 2012 49 BLAST: How it works? 1. Divide a query sequence into short chunks called words, 2. Look for exact matches 3. in case of hit try extending the alignment 27th Feb 2012 50 Statistical assessment E-value: 𝐸 = 𝑚 × 𝑛 × 𝑃 where, 𝑚 = Total number of residues in the database 𝑛 = Number of residues in the query sequence 𝑃 = Probability that an HSP alignment is a result of random chance For e.g., 𝑚 = 1 × 1020 , 𝑛 = 100 , 𝑃 = 1 × 10−20 ⇒ 𝐸 = 1 × 10−6 27th Feb 2012 51 Different flavors! • BLASTN – Queries nucleotide vs. nucleotide sequences • BLASTP – Queries protein vs. protein sequences • BLASTX – Queries 6 possible frames of nucleotide sequences vs. protein sequences • TBLASTN – Reciprocal of BLASTX • TBLASTX – Queries 6 possible frames of nucleotide sequences vs. 6 possible frames of nucleotide sequences inside the database 27th Feb 2012 52 "InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites." • Combines protein signatures from a number of member databases into a single searchable resource • Capitalizes on their individual strengths to produce an integrated database and diagnostic tool. Current release: 36.0 23 February 2012 New features: • An update to Pfam (26.0) and PIRSF (2.78). • The integration of 755 new methods from the GENE3D, PANTHER, PIRSF, Pfam and SUPERFAMILY databases. Member database information Signature Database GENE3D HAMAP PANTHER PIRSF PRINTS PROSITE patterns PROSITE profiles Pfam PfamB ProDom SMART SUPERFAMILY TIGRFAMs Version 3.3.0 140911 7 2.78 41.1 20.72 20.72 26 26 2006.1 6.2 1.73 10.1 Signatures* 2386 1702 69566 2983 2050 1308 922 13672 20000 1894 1008 1774 4023 Integrated Signatures** 1441 1686 2392 2983 2001 1291 897 12672 0 1105 1002 1208 4002 * Some signatures may not have matches to UniProtKB proteins. ** Not all signatures of a member database may be integrated at the time of an InterPro release. HAMAP TIGRFAMs PIRSF ProDom Evolutionary relationships of proteins Protein ANalysis THrough Simple Modular Architecture from superto sub-families Evolutionary Relationships : Database of Automated protein domains, families and functional sites Research Tool High-quality and Manual Annotation of microbial “SUPERFAMILY is a Member database of structural and functional database information “The Gene3D databaseofisprotein a largefamily collection of Proteomes “PRINTS is a database ‘fingerprints’ annotation for all proteins and genomes.” CATH(Class, Architecture, Topology, Homologues offering a diagnostic resource for newly-determined Signature Database Version Signatures* Integrated Signatures** superfamily) protein domain assignments for ENSEMBL sequences.” 3.3.0 GENE3D 2386 1441 genomes and140911 Uniprot sequences.” HAMAP 1702 1686 : PANTHER 7 69566 2392 PIRSF PRINTS PROSITE patterns PROSITE profiles Pfam PfamB ProDom SMART SUPERFAMILY TIGRFAMs 2.78 41.1 20.72 20.72 26 26 2006.1 6.2 1.73 10.1 2983 2050 1308 922 13672 20000 1894 1008 1774 4023 2983 2001 1291 897 12672 0 1105 1002 1208 4002 * Some signatures may not have matches to UniProtKB proteins. ** Not all signatures of a member database may be integrated at the time of an InterPro release. Integration into InterPro Features of Member Databases • ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. • PROSITE patterns: provider of simple regular expressions. : • PROSITE and HAMAP profiles: provide sequence matrices. • PRINTS provider of fingerprints, which are groups of aligned, unweighted Position Specific Sequence Matrices (PSSMs). • PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs). Querying with InterProScan “Sequence-based queries are performed using InterProScan, a tool that combines the different protein signature recognition methods native to the InterPro member databases into one resource.” Query Sequence InterProScan Querying with InterProScan • Web version • Stand-alone version – A wrapper of sequence analysis apps – Database and output files scanning – Bulk data processing Member Databases & Scanning Methods Member Databases PROSITE patterns Prosite Profiles HAMAP Profiles PRINTS PFAM PRODOM SMART TIGRFAMs PIR SuperFamily SUPERFAMILY GENE3D Scanning Methods Software Package pfscan Pftools pfscan Pftools FingerPRINTScan hmmscan ProDomBlast hmmpfam hmmscan hmmpfam hmmpfam/hmmsearch hmmpfam HMMER3.0b3 HMMER2.3.2 HMMER3.0b3 HMMER2.3.2 HMMER2.3.2 HMMER2.3.2 The TMHMM and SignalP prediction search algorithms are provided through the web interface at EBI. However, they are not integrated into InterPro. Blast2GO • B2G has been design to (1) allow automatic and highthroughput sequence annotation and (2) integrate functionality for annotationbased data mining. 27th Feb 2012 62 Why Blast2GO? • Blast2GO is designed for high-throughput sequence annotation. • Better at mining and visualization capabilities • Good at utilizing annotated sequences already deposited in public databases. 27th Feb 2012 63 How Blast2GO works? • Basically, Blast2GO uses local or remote BLAST searches to find similar sequences to one or several input sequences. • The program extracts the GO terms associated to each of the obtained hits and returns an evaluated GO annotation for the query sequence(s). • Enzyme codes are obtained by mapping from equivalent GOs while InterPro motifs are directly queried at the InterProScan web service. • GO annotation can be visualized reconstructing the structure of the Gene Ontology relationships and ECs are highlighted on KEGG maps 27th Feb 2012 64 How Blast2GO works? • OBTAINING GO TERMS – The first step is to find sequences similar to a query set by Blast searching. Homology search can either be done at public databases or custom databases when a local Blast installation is available. – By using Blast hit gene identifiers (gi) and gene accessions B2G retrieves all GO annotations for the hit sequences, together with their evidence codes (EC). 27th Feb 2012 65 How Blast2GO works? • ANNOTATION ASSIGNMENT – annotation score (AS), direct term (DT) 27th Feb 2012 66 How Blast2GO works? • STATISTICS – statistical assessment of GO term enrichments in a group of interesting genes when compared with a reference group (Blüthgen et al., 2004). – Gossip computes Fisher’s Exact Test applying robust FDR (false discovery rate) correction for multiple testing and returns a list of significant GO terms ranked by their corrected or one-test Pvalues • VISUALIZATION 27th Feb 2012 67 Systems for Functional Annotation • • • • Clusters of Orthologous Groups (COGs) euKaryote Orthologous Groups (KOGs) Gene Ontology (GO) Enzyme Commission no. (EC) 27th Feb 2012 68 Clusters of Orthologous Groups of Genes (KOGs, COGs) – Why? • Orthologs retain the same function during evolution and hence have a critical role in functional annotation. COGs provides a framework for functional analysis. • It's also important for phylogenetic and evolutionary analysis of genomes. Interpretable phylogenetic trees generally can be constructed only within sets of orthologs. 27th Feb 2012 69 How to find Orthologous genes? • Naive approach: For a query gene and target genome, the highest similarity score indicates homologous relationship – Gives good results for not so distant species – How about larger phylogenetically distances? • Gene duplications: Suggests that a many-to-many relationship required • What if several hits with not a so high score emerge ? Stringent threshold may lead to false negatives • COG approach: Each two genes inside a COG are either orthologous genes or orthologous groups of paralogs 27th Feb 2012 70 How to create COGs • Choose all 2-permutations of available genes and perform pairwise comparison between genes from different clades (in this case 5 clades) 2 2 10 90 3000 ~8.9e6 17967 ~3.2e8 • Best hits (BeT) in other organisms are recognized • Make the graph of consistent relations (does not depend on an absolute threshold level) • The simplest case is a triangle: if a gene yields a hit with two other genomes there are, being orthologs is a necessary condition for yielding a hit between those two genes • Merge all triangles with common side 27th Feb 2012 71 How to create COGs - continued 6. Do to existence of paralogs, BeTs are not necessarily symmetrical (RBBH [Reciprocal Best Blast Hits] ) ? 27th Feb 2012 Tatusov, Koonin & Lipman, Science 278, 631 (1997) 72 Facing challenges when creating COGs • The clusters however are subject to ambiguity: – Proteins with distinct regions (multi-domain proteins) each belonging to a different conserved family. • Sol: Further inspection of domains – When one gene in a pair of paralogs is lost in one lineage (but not in the other), it may artificially merge the two COGs. • Sol: Similarity measures 27th Feb 2012 73 COGs vs. Gene Function • Each COG includes proteins from at least 3 major clades with divergence time estimated around over a billion year. Hence they are ancient conserved families with important (if not necessary function) • Accordingly, the proteins belonging to mysterious COGs are good possible candidates for further analysis 27th Feb 2012 74 Clusters of Orthologous Groups (COGs) 27th Feb 2012 http://www.ncbi.nlm.nih.gov/COG/ 75 Classification of COGs by functional categories INFORMATION STORAGE AND PROCESSING [J] Translation, ribosomal structure and biogenesis [A] RNA processing and modification [K] Transcription [L] Replication, recombination and repair [B] Chromatin structure and dynamics CELLULAR PROCESSES AND SIGNALING [D] Cell cycle control, cell division, chromosome partitioning [Y] Nuclear structure [V] Defense mechanisms [T] Signal transduction mechanisms [M] Cell wall/membrane/envelope biogenesis [N] Cell motility [Z] Cytoskeleton [W] Extracellular structures [U] Intracellular trafficking, secretion, and vesicular transport [O] Posttranslational modification, protein turnover, chaperones METABOLISM [C] Energy production and conversion [G] Carbohydrate transport and metabolism [E] Amino acid transport and metabolism [F] Nucleotide transport and metabolism [H] Coenzyme transport and metabolism [I] Lipid transport and metabolism [P] Inorganic ion transport and metabolism [Q] Secondary metabolites biosynthesis, transport and catabolism POORLY CHARACTERIZED [R] General function prediction only [S] Function unknown 27th Feb 2012 76 LipoP • It is a tool used to mainly predict lipoprotein signal peptides. • It is most suitable for Gram negative bacteria but shown to have considerable accuracy for Gram positive bacteria as well. • It uses Hidden Markov Models to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins. 27th Feb 2012 77 Thank You! To be continued… 27th Feb 2012 78