Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17th April 2008 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore Large-scale Research Questions What can we learn from large-scale studies of pathogens? Does H5N1 Avian influenza have pandemic potential? What makes Human flu different from Avian flu? What are stable potential immune epitopes to use as vaccine candidates for influenza? How does each serotype of dengue differ from all others? Page 2 Large-scale Research Questions What can we learn from large-scale studies of pathogens? Does H5N1 Avian influenza Large scale have pandemic potential? Statistical What makes Human fluevidence different from Avian flu? Historical data Systematic analysis What are stable potential immune epitopes to use as vaccine candidates for influenza? How does each serotype of dengue differ from all others? Page 3 We need Metadata! Metadata = Descriptive data about sequences If you want to compare avian vs human, you need host organism info If you want conservation analysis, you need to have serotype and host information If you want to study a period of virus evolution, you need date information If you want a balanced dataset, you may need to filter according to country, date, subtype Page 4 Knowledge Mining Identify mutations in H5N1 that characterize transmissibility amongst humans Viral Sequence and Metadata User-defined Queries User-defined Dictionaries Viral Protein References User-defined Extraction Rules and Priorities Evidence of strain cocirculation Cross-reference Identifiers Knowledge Aggregation Public Database Records Extract Desired Source Knowledge from Public Databases Characteristic Mutations Analysis H5N1 mutation map Conservation Analysis Epitope Vaccine Candidates Viral Sequence and Metadata Identify Evolutionarily Stable Region across subgroups Biomedical Text Active Text Mining User-defined Patterns Documents with Cross-reactivity information User-defined Dictionaries Identify Biomedical literature with Crossreactivity information Curator's Knowledge Previous Annotations Page 5 Scalability in Bioinformatics Knowledge Mining Integrative scalability We need to integrate heterogeneous information from multiple data repositories with multiple purposes Quantitative scalability We need methods that can leverage on and explore effectively large-scale data sets Hierarchical scalability We need to cascade analysis tasks, flowing knowledge from one task to the next Page 6 Obstacles to Scalability Heterogeneity of Biological Databases Systemic: access to data in different databases Syntactic: data formats, use of free text Structural: different table structures in different databases Semantic: data with different meaning and intent Semantic Heterogeneity is particularly insidious Data is rarely used in the way it was originally intended Low level of end-use technical expertise Biologists, not computer scientists Excel spreadsheets, Web page “scraping” Does not scale up Page 7 Semantic Heterogeneity in GenBank Not so Good Pretty Bad Good Page 8 Semantic Heterogeneity in GenBank Fields (e.g. country/date) are inconsistently encoded Inconsistent level of details between databases Inconsistent field location within different records of the same database Implicit encoding of the data (e.g. within the title of a publication) Multiple usage of the same field BAC77216 Usage of /isolation_source=" /isolation_source="Samoa" Samoa isolation_source AAN74539 field in different GenPept records /isolation_source="isolated in 1993" AAT85667 /isolation_source="Homo sapiens" Page 9 Influenza Large-Scale Studies Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences Various types of analysis, e.g. Identify amino acid mutations sites that characterize human-transmissible strains Compare the diversity of viral sequences over different periods of time and geographical areas Several Metadata fields required Protein name Host Subtype Country Isolate Year Manual Curation is not an Option! Page 10 The Aggregator of Biological Knowledge Public Repositories An end-user environment for data retrieval, extraction and analysis input Uses XML technology and structural rules to allow biologists to extract and reconcile the data needed Data Collection augment Wrapper framework provides access to multiple sources query manage Researcher Manages extracted results Offers plug-in architecture for analysis tools control Data Management input Data Analysis augment filter KDDABK System Page 11 ABK Structural Rules Hierarchical value reconciliation Automatic formation of XML Structural Rule Concise visualization of XML as name/value tree Familiar presentation of metadata for biologists Point-and-click selection of location and constraints Tabulated visualization and manual curation RDF storage and output Page 12 Data Extraction and Cleaning DENV-1 sequences Different rules (or different documents) produced conflicting values Values produced by user-defined rules User can fill in or override values Page 13 Rule performance rule1 rule4 rule3 rule4 rule2 Multiple rules often needed genbank rule4 rule1 rule5 rule5 rule3 rule3 Year rule4 rule2 80% 70% 60% 50% 40% 30% 20% 10% 0% rule2 rule2 rule3 rule1 rule1 rule6 rule5 rule4 rule2 rule3 genpept rule1 genpept Some properties 100% are very fragmented 90% rule1 rule6 rule5 rule4 rule3 rule2 rule3 genbank rule1 rule2 rule3 rule2 genpept genbank 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% rule1 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Origin 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% rule1 rule3 rule1 genbank rule2 rule3 rule2 rule1 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Host Isolate Name rule2 Subtype genpept Page 14 Can H5N1 viruses spread amongst humans? Page 15 The Antigenic Variability Analyzer (AVANA) Page 16 Using MI to detect Characteristic Sites At a characteristic site, the residue observed is strongly associated to a set of sequences E.g. : Arg -> Avian Thr -> Human This association is explored by measuring mutual information of The residue observed at a site The label of the set in which it is observed MI is in range 0 – 1.0 MI = 0.0 -> no statistical significance in the occurrence of residues in the two sets MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa Page 17 PB2 Protein Spikes indicate characteristic sites A2A (719 sequences) MI Entropy PB2 Protein H2H (1650 sequences) Page 18 RNP proteins: PB2 PB1 binding NP binding DE A M T TA 9 44 64 81 NT S T MV VM 105 RNA cap binding Nuclear Localization Signal A TI IV R L DE AV VA E 199 271 292 368 475 567 588 613 627 K M S A T N I TI K A AS K A2A 661 674 702 T T R H2H PB2 (759 aa) 17 sites http://www-micro.msb.le.ac.uk/3035/Orthomyxoviruses.html Page 19 H2H characteristic mutations in H5N1 M1 M2 A2A 1997,HONG KONG,A/Hong Kong/156/97 1997,HONG KONG,A/Hong Kong/481/97 1997,HONG KONG,A/Hong Kong/482/97 1997,HONG KONG,A/Hong Kong/483/97 1997,HONG KONG,A/Hong Kong/486/97 1997,HONG KONG,A/Hong Kong/532/97 1997,HONG KONG,A/Hong Kong/538/97 1997,HONG KONG,A/Hong Kong/542/97 1998,HONG KONG,A/HongKong/97/98 V V V V V V V V V V T T T T T T T T T T T T T T T T T T T T 2003,HONG KONG,A/HK/212/03 2003,HONG KONG,A/HK/213/03 2004,THAILAND,A/THAILAND/5(KK-494)/2004 2004,VIETNAM,A/Viet Nam/1194/2004 2004,VIETNAM,A/Viet Nam/1203/2004 2004,VIETNAM,A/Viet Nam/3046/2004 2004,VIETNAM,A/Viet Nam/3062/2004 2004,VIETNAM,A/Vietnam/CL01/2004 2004,VIETNAM,A/Vietnam/CL26/2004 2005,INDONESIA,A/Indonesia/5/2005 2005,INDONESIA,A/Indonesia/CDC184/2005 2005,INDONESIA,A/Indonesia/CDC287E/2005 2005,INDONESIA,A/Indonesia/CDC292T/2005 2005,INDONESIA,A/Indonesia/CDC7/2005 2005,THAILAND,A/Thailand/676/2005 2005,VIETNAM,A/Vietnam/CL105/2005 2005,VIETNAM,A/Vietnam/CL115/2005 2005,VIETNAM,A/Vietnam/CL2009/2005 2006,CHINA,A/human/Zhejiang/16/2006 2006,INDONESIA,A/Indonesia/CDC326/2006 2006,INDONESIA,A/Indonesia/CDC329/2006 2006,INDONESIA,A/Indonesia/CDC357/2006 2006,INDONESIA,A/Indonesia/CDC390/2006 2006,INDONESIA,A/Indonesia/CDC523/2006 2006,INDONESIA,A/Indonesia/CDC582/2006 2006,INDONESIA,A/Indonesia/CDC594/2006 2006,INDONESIA,A/Indonesia/CDC595/2006 2006,INDONESIA,A/Indonesia/CDC623/2006 2006,INDONESIA,A/Indonesia/CDC624E/2006 2006,INDONESIA,A/Indonesia/CDC625/2006 2006,INDONESIA,A/Indonesia/CDC634/2006 2006,INDONESIA,A/Indonesia/CDC699/2006 2006,INDONESIA,A/Indonesia/CDC742/2006 H2H V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V I T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T A T T T T T T T T T T T A A T T T T T T A A A T A A T T T T T A A A A T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T I G G G G G G G G G G E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E S S S S S S S S S S S S S S S S S S S I I S S S S S S I I I S I I S S S S S I I I N I V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R L NP L F F F F F F F F F L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L F Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y H Q Q Q Q Q Q Q Q Q Q V V V V V V V V V V Q Q Q Q Q Q Q Q Q Q Q Q Q Q V V V V V V V V V V V V V V Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q K V V V V V V V V V V V V V V V V A NS1 NS2 G G G G G G G G G G V V V V V V V V V V I I I I I I I I I I R R R R R R R R R R L M M M M M M M M M R R R R R R R R R R L L L L L L L L L L R R R R R R R R R R F F F F F F F F F F Q Q Q Q Q Q Q Q Q Q D D D D D D D D D D A V S V A V S A A S F F F F F F F F F F A E E E E E E E E E I I I I I I I I I I V V V V V V V V V V S S S S S S S S S S D E E E E E E E E E P P P P P P P P P P E E E E E E E E E E S N N N N N N N N N S S S S S S S S S S G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G D V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I L R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R V L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L M R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R K L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L P R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R K F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F Y Q Q K Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q K D D D D D D D D D D D D D D D E D D D D D D D D D D D D D D D D D G S S A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A S F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F V A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A V M T S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S P G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G I P P P P P P P P P P P P P P P P P P L P P P P P P P P P P P P P P T E E E I S E I S I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S E I S R N G L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L F PA PB1 PB2 P P P P P P P P P P D D D D D D D D D D R R R R R R R R R R S S S S S S S S S S G G G G G G G G G G V V V V V V V V V V S S S S S S S S S S L L L L L L L L L L N N N N N N S N N N A A A A A A A A A A K K K K K K K K K K E E E E E E E E E E P L L L L L L L L L A A A A A A A A A A S N N N S N N N N S S S S S S S S S S S T T T T T T T T T T V V V V V V V V V V D D D N D D D D D D A A A A A A A A A A M M M M M M M M M M T T T T T T T T T T T T T T T T T T T T A S A S A S A S A S T T T T T T T T T T I V V V V V V V V V R R R R R R R R R R L L L L L L L L L L D E E E E E E E E E A A A A A A A A A A V V V V V V V V V V E E E E K E E E E E A T T T T T T T T T A A A A A A A A A A K K K K K K R K R K P P P P P P P P P P P P P P P P P P D D D D D D D D D D D D D D D D D D R R R R R R R R R R R R R R R R R R S S S S S S S S S S S S S S S S S S G G G G G G G G G G G G G G G G G G V V V V V V V V V V V V V V V V V V S S S S S S S S S S S S S S S S S S L L L L L L L L L L L L L L L L L L N N N N N N N N N N N N N N N N N N A A A A A A A A A A A A A A A A A A K K K K K K K K K K K K K K K K K K E E E E E E E E E E E E E E E E E E S S S S S S S S S S S S S S S S F S A A A A A A A A A A A A A A A A A A S S S S S S S S S S S S S S S S S S S S S S I S S S S S S S S S S S S S T T T T T T T T T T T T T T T T T T V V V V V V V V V V V V V V V V V V D D D D D A A A A A I I I I I T T T T T A A A A A A A A A A T T T T T I I I I I D D D D D D D D D A A A A A A A A A I I I I I I I I I T T T T T T T T T A A A T T T T T A A A A A A A A A A T T T T T T T T T I I I T T T T T I P P P P P P P P P P P P P P L D D D D D D D D D D D D D D N R R R R R R R R R R R R R R Q S S S S S S S S S S S S S S L G G G G G G G G G G G G G G D V V V V I I I I V V I I I V A S S S S S S S S S S S S S S C L L L L L L L L L L L L L L I N N N N N N N N N N N N N N Y A A A A A A A A A A A A A A S K K K R K K K K K K K K K K R E E E E E E E E E E E E E E D S S S S S S S S S S S S S S L A A A A A A A A A A A A A A S S S S S S S S S N S S S S S N S S S S S S S S S S S S S S I T T T T T T T T T T T T T T S V V V V V V V V V V V V V V I D D D D D D D D D D D D D D D D D N A A A A A A A A A A A A A A A A A S I I T I I I I I I I I I I I I I V T T T T T T T T T T T T T T T T T T M A A T T T T T T T T T T T T T T T V A A A A A A A A A A A A A A A A A S T T T T T T T T T T T T T T T M T A I I I T T T T T T T T T T T T T T T R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R K L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L M D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D N A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A I V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V T E E K K K E K K K E E E E E K K E K E E E E K E K E E E E E E E E K A A A A A A A A A A A A A A A A A A A A A A A A A T T A A T A A A T A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A T K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K K R Page 20 Ongoing Projects at ISS InViDiA - Integrated Virus Diversity Analysis Web-based tool for metadata-enabled diversity analysis WADE - Web-based Aggregation and Display of Epitopes Web-based tool for aggregating epitope predictions from multiple prediction systems Page 21 Thanks to Johns Hopkins University Prof. J Thomas August Dana-Farber Cancer Institute, Harvard Dr. Vladimir Brusic Dept. of Biochemistry, NUS Prof. Tan Tin Wee AT Heiny, Asif M Khan, Hu Yong Li Institut Pasteur Dr. Hervé Bourhy Partial Grant Support: National Institute of Allergy and Infectious Diseases, NIH Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C Page 22