Blast2GO presentation @ StatSeq COST workshop 21nd -23rd April Helsinki, Finland Friday 25th2013, January 2013, Royal Melbourne Hospital Why Blast2GO Functional characterization of novel sequence data Adapted of high throughput needs of biological laboratories Extracting knowledge about functioning of genomes Blast2GO Impact Outline Concepts on Functional Annotation The Blast2GO annotation framework Visualization of functional data Pathway analysis with Blast2GO Concepts of Functional Annotation What is functional annotation? How to annotate a large dataset? The Gene Ontology Three branches: Biological Process Molecular Function Cellular Component Annotations are given to te most specific (low) level True path rule: annotation at a given term implies annotation to all its parent terms Annotation is given with an Evidence Code: o o o o IDA: inferred by direct assay TAS: traceable author statement ISS: infered by sequence similarity IEA: electronic annotation o …. More general More specific Functional assignment Annotation Empirical Transference Literature reference Phylogeny Molecular interactions Gene/protein expression Biochemical assay Structure Comparison Identification of folds Sequence analysis Sequence homology Motif identification Annotation by similarity: concerns GO1, GO2, GO3, GO4 HIT GO1, GO2, GO3, GO4 QUERY Level of homology (~ from 40-60% is possible) The overlap between hit and query, association function and structure The paralog problem: genes with similar sequences might have different functional specifications The evidence for the original annotation Balance between quality and quantity: depends on the use The Blast2GO annotation framework Application scheme biological process Fasta cellular component Application scheme biological process Fasta cellular component Application scheme biological process Fasta cellular component Basic annotation procedure Sq1 Sq1 Hit1 Hit2 Hit3 Hit4 Sq2 Hit1 Hit2 Hit3 Hit4 Sq3 Sq3 Hit1 Hit2 Hit3 Hit4 Sq4 Sq4 Sq2 Blast Hit1 Hit2 Hit1 Hit2 Hit3 Hit4 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 Sq2 Hit1 Hit2 Hit3 Hit4 go6,go9, go8 go1,go8 go4,go1, go8,go9 Sq3 Hit1 Hit2 Hit3 Hit4 go2 go2,go4, go4 go2,go5, go6 go2,go4 Sq4 Hit1 Hit2 Sq1 Mapping Sq1 Annotation Sq2 Sq3 Sq4 go1,go2, go3 go1,go3, go4 go3,go5, go6,go8 go1,go4 go6,go9, go8 go1,go8 go4,go1, go8,go9 go2 go2,go4, go4 go2,go5, go6 go2,go4 Annotation Rule - Let be GO1…n be candidate annotations for sequence S1, obtained from hits Hi…k - We compute an annotation score AS for each GOi that depends on: - The similarity between sequence S1 and Hj - The evidence code of GOi - The existence of other neigboring GO candidates - The structure of the Gene Ontology - We define an abritary annotation threshold (AT) - S1 is annotated with GOi if its ASGOi > AT Annotation Rule Possibility of abstraction Similarity Requirement GO4 Quality of source annotation: IEA=0.7, IDA = 1, NR = 0.0, ... GO1 GO2 GO3 Annotation Score Cut-Off Value new annotation True-Path-Rule selectivity vs. specificity Blast2GO annotation rule - When I have a GO with ECw =1 and I do not allow abstraction (GOw = 0), then the Annotation Score = %similarity - If the ECw < 1 my similarity requirement is higher to obtain the same Annotation Score - If I allow abstraction GOw > 0, then with less similarity I can obtain the required Annotation Score at a parent node www.blast2go.com Start Blast2GO Blast2GO Application (1) Blast (2) Mapping (3) Annotation Main Sequence Table Any operation will only affect to selected sequences!!!! Application statistics Blast results Application messages Graph visualisation Load sequences Input data (in FASTA format, AA or nt) >my_favourite_species_seq1 | still unknown gtgatggaaaagaaaagttttgttatcgtcgacgcatatgggtttctttttcgcgcgtattatgcgctgcctggattaagcacctcatacaattttcctgtaggaggtgtatatggtttt ataaacatacttttgaaacatctctctttccacgatgcagattatttagttgtggtatttgattcggggtcgaaaaattttcgtcacactatgtattccgaatacaaaactaatcgccct aaagcaccagaggatctgtcactacaatgtgctccgctacgtgaggctgttgaagcgtttaatattgtaagtgaagaagtgcttaactacgaagcagacgacgtaatagcta cactctgtacaaaatatgcatctagtaatgttggagtgagaatactgtcagcagataaggatttactacaactcctaaatgataatgttcaagtttacgaccctataaaaagca gatacctcaccaatgaatacgttttagaaaaatttggtgtttcatcagataagttgcatattgatacggttgcatcgagttataatgagaaaattattctcagctaagctgtacacc gtttattacacactcgaaaggccgttag as >my_favourite_species_seq2 | no clue df ttgttagctaaaaaggaagactttcacacctttggtaatggtgttggctctgctggaacaggtggagttgtagtttctgcatccatgttgtctgcggatttttcaaatcttagagaaga gatagcagcggttagtacggctggtgcagattggttacacattgatgtgatggatgggtgcttcgtccccagtttgactatgggtcctgtggtgatttccggcattaggaaatgta caaatatgtttcttgatgtgcatttgatgattaatcgcccaggcgatcatctgaagagtgtggtagatgctggagctgataagatagagcacattcgcaagatgatagaggaa asdf agctcatcaaccgcgaaaatcgctgttgatggtggtgtttcaacggataatgcccgggctgttatcgaggcaggtgcgaatatactcgttgttggaacggcgctgtttgctgctg acgatatgagtaaagttgtaagaactttaaaatcattttaa >my_favourite_species_seq3 | just sequenced gtgggactgctcatccctgtaggcagggtggctattttttgtgtaaaggcagtctttcatagtcttgtaccgccatactatctatggataactacaaagcagttttttgaggtgtggttt ttctctcttcctatagtagcagttacatctttgtttacgggaggcgcgttagcccttcaggataccctcgtgggaagcgctaaagtatcagggtaatggagtttttactcctgcaag atgtaatagagggtctggtaaaagctgtatcgtttgggctggtaatttcgctagttgggtgttacaacgggtatcactgtgagataggcgcaaggggtgtaggaacagcgaca acaaaaacttcggtagcagcttctatgctcataattttgttaaactatataattactgttttttacgcgta >my_favourite_species_seq4 | we will see soon... atgtacgctgtatctctttcaaatttgcatgtctctttcaacaacaaggaggttttgaaaggtgttgacttggacatagcatggggggattccctggttatactgggagaatctggta gtggaaagtctgtactaacaaaggttgtattgggtctaatagtgccccaagagggaagtgttactgtagatggcaccaatattcttgagaataggcagggcatcaagaatttt agtgttttgtttcaaaactgtgcgttatttgacagtcttacgatttgggaaaatgtagtattcaatttccgtaggaggcttcgtttagataaggataatgccaaggctttggctttacgg ggattggagcttgtgggattggacgccagtgtaatgaacgtgtatcctgtggagctatcaggcgggatgaaaaagcgcgtagctttggcaagagctattataggtagtccca aaattctaattttggatgagccaacttcgggattggatcctataatgtcttcagtggt BLAST You email adress BLAST program (normally blastx) BLAST database (many options) E-Value (depends on the DB) Number of HITs (use <= 20) Recommended to save as XML Human readable seq. Descriptions via BDA Additional BLAST params Set word size and filter Use your own server Minimum HSP length Filter by description Parsing options for own databases BLAST Results RED Blast Distribution Charts Evaluate the similarity of your sequences with public DBs Single Sequence Menu Single Sequence Menu Mapping Results GREEN Annotation Menu BLAST based annotation Other Annotation modes Validation and Annex Annotation Allows to set a minimum percentage of the HIT sequence which should be expand by the QUERY sequence This helps to avoid the problem of cis-annotation Annotation Result BLUE Annotation Charts Annotation Charts Commonly, level 5 is the most abundant specificity level in the Gene Ontology Additional Annotation: ANNEX Recovers implicit biological process and cellular component GO terms based on molecular function annotations Molecular Function is involved in Biological Process Myhre et al, Bioinformatics 2006 acts in Cellular Component Additional Annotation: InterProScan Runs InterProScan searches at the EBI through Blast2GO Once you have completed your InterPro annotation, results can be transformed to GO terms and merged to Blast annotation Results are stored at your computer as XML files. You can upload them later InterProScan Results Column with InterProScan results Additional Annotation: GOSlim GOSlim is a reduction of the Gene Ontology to a more reduced vocabulary → Helps to summarize information After GOSlim transformation sequences get YELLOW Different GOSlims available at Blast2GO Enzyme annotation and Kegg Maps GO Enzyme Codes KEGG maps Manual Curation You can modify manually annotation of particular sequences If you click in this box, curated sequences get purple Export Results Saves the complete B2G project (heavy) Export annotation results in different formats Export formats .annot C04018C10 C04018C10 C04018A12 C04018A12 GO:0004707 EC:2.7.11.24 GO:0016798 GO:0000272 mitogen-activated protein kinase 3 Also for import! class iv chitinase GeneSpring Format C04013E10 response to water deprivation; regulation ofnucleus; transcription; multicellular organismal transcription development; factor activity; response to abscisic acid stimulus; C04013A12 translation; ribosome; plastid; structural constituent of ribosome; C04013C12 galactose metabolic process; plastid; aldose 1-epimerase activity; carbohydrate binding; GoStat C04018C10 C04018A12 C04018C12 4707,9409,6979,10200,5524,169 16798,272,44248 4869,12505,8233 By Seq C04018A02 glyoxalase i C04018C02 metallothionein-like protein C04018G02 protein phosphatase GO:0004462 F:lactoylglutathione lyase activity GO:0046872 F:metal ion binding GO:0008287 C:protein serine/threonine phosphatase complex More export formats Export Sequence Table Seq. Name C04018C12 C04018E12 C04018G12 C04018A02 C04018C02 C04018E02 C04018G02 C04018C04 C04018E04 C04018G04 C04018A06 Seq. Description Seq. Length #Hits min. eValuemean Similarity#GOs GOs Enzyme Codes InterProScan cysteine proteinase inhibitor 663 20 25 80.00% 3 F:GO:0004869; C:GO:0012505; F:GO:0008233 IPR000010; IPR01807 protein phosphatase 2c 663 20 77 85.00% 2 N:GO:0015071; F:GO:0003824 IPR001932; IPR01404 alpha beta fold family protein 578 20 84 79.00% 4 F:GO:0016787; C:GO:0005739; C:GO:0009507; noIPR P:GO:00 glyoxalase i 600 20 64 74.00% 2 P:GO:0005975; F:GO:0004462 EC:4.4.1.5 IPR004360; noIPR metallothionein-like protein 625 18 14 74.00% 1 F:GO:0046872 IPR000347 haemolysin-iii related familyexpressed 612 20 32 72.00% 1 C:GO:0016020 noIPR protein phosphataseexpressed 645 20 97 81.00% 5 C:GO:0008287; N:GO:0015071; P:GO:0006470; no IPS match C:GO:00 phosphoglycerate bisphosphoglycerate780 mutase20 family protein 63 66.00% 2 P:GO:0008152; F:GO:0003824 IPR001345; IPR01307 polyubiquitin 707 20 115 99.00% 2 P:GO:0006464; C:GO:0005622 IPR000626; IPR01995 meiotic recombination 11 575 20 45 89.00% 21 C:GO:0019013; P:GO:0007126; F:GO:0004519; IPR003701; IPR00484 F:GO:000 late embryogenesis-abundant protein 648 20 43 68.00% 2 P:GO:0009737; P:GO:0009409 no IPS match Export BestHit Data Sequence name C04018C10 C04018E10 C04018G10 C04018A12 C04018C12 C04018E12 C04018G12 C04018A02 C04018C02 Sequence desc. Sequence lengthHit desc. Hit ACC E-Value Similarity Score Alignment lengthPositives mitogen-activated protein kinase 3 717 gi|122894104|gb|ABM67698.1|mitogen-activated ABM67698 1.35E-123 protein kinase 99[Citrus 445.28 sinensis] 222 221 ---NA--706 gi|157356307|emb|CAO62459.1|unnamed CAO62459 2.69E-036 protein product [Vitis 83vinifera] 155.22 119 99 protein 620 gi|114153154|gb|ABI52743.1|10 ABI52743 kDa putative 7.47E-015 secreted protein63 [Argas 83.57 monolakensis] 90 57 class iv chitinase 715 gi|3608477|gb|AAC35981.1|chitinase AAC35981CHI11.45E-061 [Citrus sinensis] 78 239.2 171 134 cysteine proteinase inhibitor 663 gi|8099682|gb|AAF72202.1|AF265551_1cysteine AAF72202 9.33E-025 protease inhibitor 83 116.7 [Manihot esculenta] 99 83 protein phosphatase 2c 663 gi|46277128|gb|AAS86762.1|protein AAS86762 phosphatase 2.76E-077 2C [Lycopersicon 91 291.2 esculentum] 180 164 alpha beta fold family protein 578 gi|147865769|emb|CAN83251.1|hypothetical CAN83251 1.67E-084 protein [Vitis vinifera] 94 314.69 >gi|157339464|emb|CAO44005.1| 179 169 unn glyoxalase i 600 gi|2213425|emb|CAB09799.1|hypothetical CAB09799 2.16E-064 protein [Citrus x paradisi] 81 248.05 114 93 metallothionein-like protein 625 gi|3308980|dbj|BAA31561.1|metallothionein-like BAA31561 2.23E-014 protein [Citrus 100unshiu] 82.03 40 40 Sequence Selection Sequence Selection tool to obtain a selection based on annotation status Sequence Selection By Name/Description By Function View Menu Functions to switch between displaying IDs or descriptions for GO annotation or InterPro results Hands-on I Annotation 10 seqs with Blast2GO Visualization How to understand the functional context of a annotated dataset Combined Graph Each term has a number of sequences associated Nodes can be coloured to indicate relevance Each term is displayed around its biological context Node shape to differentiate between direct and indirect annotation Combined Graph Different GO branches Reduces nodes by number of annotate sequences Node data to be displayed Criterion for highlighting and filtering nodes Node information content Accumulated by GO term 5 (Sequence Count) 1 4 1 3 1 3 Incomming information 2.5 (Node Score) Σ seq(g)*α g∈desc(g') 1 2.4 dist (g, g') 1 3 1 3 Compacting Graphs by GO-Slim Saving Options Save as picture and as txt Graph Charts Graph Charts • Sequence Distribution/GO as Bar-Chart • Sequence Distribution/GO as Level-Pie (level selection) • Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff) Analysis of a specific function How many sequences are annotated to the function “photosynthesis”? Option 1: Find in the GO graph -> direct & indirect annotation Option 2: Find through the Select function. Two sub-options Option 2.1. Direct annotation (use GO-ID or description) Option 2.2. Direct & indirect (use GO-ID and “include GO parents”) Analysis of a specific function export search Find a function on the graph Analysis of a specific function Exporting sequence table you see sequences Annotated to the function Analysis of a specific function Select all sequences annotated to this function and its descendents Analysis of a specific function Locate these sequences Hands-on II Summary statistics Visualize & Search Pathway analysis with Blast2GO Which cellular functions are important in my experiment Functional Enrichment Analysis One Gene List (Responsive genes) The other list (Non responsive genes) Are this two groups of genes carrying out different biological roles? Biosynthesis 54% Biosynthesis 18% Sporulation Sporulation 18% 18% Are pathway frequencies different? Fisher's Exact Test One Gene List (Responsive genes) The other list (Non responsive genes) Biosynthesis 54% Biosynthesis 18% Sporulation Sporulation 18% Biosynthesis No biosynthesis A 6 5 p-value for Biosynthesis = 0.0913 B 2 9 Contingency table Genes in group A have not significantly to do with biosynthesis nor sporulation. 18% Multiple testing correction We do this for all GO term of our dataset!!! Many tests => Many false positive => We need correction! FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses. FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests. (more conservative) Different types of comparisons Compare two equivalent conditions (root vs leaves) Remove Common Ids Test and Ref-Set are interchangeable Common IDs Set 1 Set 2 Compare a subset against the total Common ids removed from reference Test and Ref-Set are NOT interchangeable TestSet Common IDs RefSet RefSet Common IDs TestSet FET in Blast2GO Two-Tailed test not only identifies over but also under represented functions. If no Ref-Set is chosen all annotations are used as reference FatiGO Results Result table with link out to sequence lists Most specific terms Retains only the lowest, most specific enriched term per GO branch Enriched Graph View enriched terms data as DAG graphs! reduce => To draw all nodes, set filter to 1 Hands-on III Enrichment Analysis Concluding Remarks Blast2GO is a versatile tool for the annotation of sequence data Blast2GO uses controlled vocabularies and a elaborated annotation rule to generate GO labels Visualization and data mining functions help to understand the functional content of your dataset