Building Gene Networks by Information Extraction, Cleansing, & Integration Limsoon Wong Institute for Infocomm Research Singapore Copyright © 2005 by Limsoon Wong Plan • Motivation for study of gene networks • Example Efforts at I2R – Disease Pathweaver – Dragon ERG Solution • Technical Challenges Involved – – – – Name entity recognition Co-reference resolution Protein interaction extraction Database cleansing Copyright © 2005 by Limsoon Wong. Motivation Copyright © 2005 by Limsoon Wong Some Useful Information Sources • Disease-centric resources – OMIM [NCBI OMIM, 2004] – Gene2Disease [Perez-Iratxeta et al., Nature, 2002] – MedGene [Hu et al., J. Proteome Res., 2003] • Emphasized direct genedisease relationships – Provide lists of diseaserelated genes – Do not provide info on gene-gene interactions & their networks • Related interaction resources – KEGG [Kanehisa, NAR, 2000] • Manually constructed protein interaction networks • Mostly metabolic pathways and few disease pathways – Only 7 disease pathways Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng Why Gene Network? • Many common diseases are – not caused by a genetic variation within a single gene • But are influenced by – complex interactions among multiple genes – environmental & lifestyle factors Monogenic Heterogenic Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Desired Outcome of Gene Network Study • Help scientists understand the mechanism of complex diseases by – Greatly reducing work load for primary study of genetic diseases, broaden the scope of molecular studies – Easily identifying key players in the gene network, help in finding potential drug targets • Scalable framework – Extend to many genetic diseases – Include other resources of gene interactions Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng Some Gene Network Study Efforts at I2R • Disease Pathweaver • Dragon ERG Solution Copyright © 2005 by Limsoon Wong Some Gene Network Study Efforts at I2R • Disease Pathweaver • Dragon ERG Solution Copyright © 2005 by Limsoon Wong Disease Pathweaver, Zhang et al. APBC 2005 • Automatically constructing disease pathways – Identify core genes – Mine info on core genes – Construct interaction network betw core genes • Data sources: – Online literature – High-thru’put biological data Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: The Portal • Portal for human nervous diseases gene networks – http://research.i2r.a-star.edu.sg/NSDPath • Statistics – 37 Human Nervous System Disorders – 7 ~ 60 core genes per disease – 2 ~ 320 core interactions per disease Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: A Tour Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Disease Pathweaver: DPW vs KEGG on Huntington Disease Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhuo Zhang, Suisheng Tang, & See-Kiong Ng Some Gene Network Study Efforts at I2R • Disease Pathweaver • Dragon ERG Solution Copyright © 2005 by Limsoon Wong Estrogen-Responsive Genes • Why – Affects human physiology in many aspects – Related to many diseases – Widely used in clinic • Challenges – Multiple pathways – Difficult to predict ERE – Many estrogen-responsive genes but only a few are well-studied – Difficult to keep up w/ speed of knowledge accumulation Needed – Tools to predict ERE & estrogen-responsive genes – Database of useful info – Systems to predict impt regulatory units, associate gene functions, & generate global view of gene network Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution ERE dependent E2 dependent E2 independent ERE Finder ERG Finder ERE independent ERs bind to other TFs ERGDB ERG Explorer Coming soon! Membrane receptors Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution: Dragon ERE Finder, Bajic et al, NAR, 2003 Predict functional ERE in genomic DNA One prediction in 13.3k bp Allow further analysis http://sdmc.i2r.a-star.edu.sg/ERE-V2/index Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution: Dragon Estrogen-Responsive Gene Finder, Tang et al, NAR, 2004a Only for human genome Using 117 bp ERE frame Evaluated by PubMed & microarray data http://sdmc.i2r.a-star.edu.sg/DRAGON/ERGP1_0 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution: Dragon Estrogen-Responsive Genes Database, Tang et al, NAR, 2004b Contains >1000 genes Manually curated Basic gene info Experimental evidence Full set of references ERE sites annotated http://sdmc.i2r.a-star.edu.sg/promoter/Ergdb-v11 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution: DEERGF Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Suisheng Tang & Vlad Bajic Dragon ERG Solution: Case-Specific TF relation networks, Pan et al, NAR, 2004 • Analyse abstracts • Stemming, POS tagging • Use ANNs, SVM, discriminant analysis • Simplified rules for sentence analysis • Constraints on the forms of sentences • Sensitivity ~75% • Precision ~82% • Produce reports & direct links to PubMed docs, & graphical presentations of entity links Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Vlad Bajic Technical Challenges Copyright © 2005 by Limsoon Wong • Named entity recognition • Co-reference resolution • Protein interaction extraction • Data cleansing Technical Challenges Copyright © 2005 by Limsoon Wong • Named entity recognition • Co-reference resolution • Protein interaction extraction • Data cleansing Bio Entity Name Recognition, Zhou et al., BioCreAtIvE, 2004 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Jian Su Bio Entity Name Recognition: Ensemble Classification Approach • Features considered – orthographic, POS, morphologic, surface word, trigger words (TW1: receptor, enhancer, etc. TW2: activation, stimulation, etc.) • SVM – Context of 7 words – Each word gives 5 features, plus its position HMM1 balanced precision & recall HMM2 low precision & high recall Majority Voting SVM high precision & low recall • HMMs – 3 features used (orthographic, POS, surface word) – HMM1 & HMM2 use POS taggers trained on diff corpora Copyright © 2005 by Limsoon Wong. Orthographic features: • capital letters, numeric characters, & their combinations Morphologic features: • prefix, suffix, etc. Bio Entity Name Recognition: Performance at BioCreAtIvE 2004 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Zhou et al., BioCreAtIve 2004 Technical Challenges Copyright © 2005 by Limsoon Wong • Named entity recognition • Co-reference resolution • Protein interaction extraction • Data cleansing Co-Reference Resolution, Yang et al., IJCNLP, 2004 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Yang et al., IJCNLP 2004 Co-Reference Resolution: Baseline Features Used Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Yang et al., IJCNLP 2004 Co-Reference Resolution: New Features Used & Performance Base Classifier: C5.0 Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Yang et al., IJCNLP 2004 Technical Challenges Copyright © 2005 by Limsoon Wong • Named entity recognition • Co-reference resolution • Protein interaction extraction • Data cleansing Protein Interaction Extraction, Xiao et al, submitted • The max entropy model: • where – o is the outcome – h is the feature vector – Z(h) is normalization function – fj are feature functions – j are feature weights Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Xiao et al., IJCNLP 2004 Protein Interaction Extraction: Features Used Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Xiao et al. Protein Interaction Extraction: Performance on IEPA Corpus Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Xiao et al. Technical Challenges Copyright © 2005 by Limsoon Wong • Named entity recognition • Co-reference resolution • Protein interaction extraction • Data cleansing Data Cleansing, Koh et al, DBiDB, 2005 • 11 types & 28 subtypes of data artifacts – Critical artifacts (vector contaminated sequences, duplicates, sequence structure violations) – Non-critical artifacts (misspellings, synonyms) • > 20,000 seq records in public contain artifacts • Identification of these artifacts are impt for accurate knowledge discovery • Sources of artifacts – Diverse sources of data • • Repeated submissions of seqs to db’s Cross-updating of db’s – Data Annotation • • • Db’s have diff ways for data annotation Data entry errors can be introduced Different interpretations – Lack of standardized nomenclature • • Variations in naming Synonyms, homonyms, & abbrevn – Inadequacy of data quality control mechanisms • Systematic approaches to data cleaning are lacking Copyright © 2005 by Limsoon Wong. Adapted w/ permission from Judice Koh, Mong Li Lee, & Vladimir Brusic Data Cleansing: A Classification of Errors Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Invalid values Spelling errors Format violation ATTRIBUTE Ambiguity Data Cleansing: Dubious sequences Example Spelling Errors Vector contaminated sequence Crossannotation error RECORD Annotation error Sequence structure violation SINGLE SOURCE DATABASE • Usually typo errors • Occurs in different fields of the record • We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez. Misspellings Corrections Immunoglobin Immunoglobulin Cassete Cassette tranmembrane transmembrane asociated associated Sequence redundancy Data Provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation Context of the misspellings GenBank:AAD26534 nectin-1 [Rattus norvegicus] TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited to Cadherin-based Adherens Junctions through Interaction with Afadin, a PDZ Domain-containing Protein gi|4590334|gb|AAD26534.1 Patent Database:A76783 Sequence 11 from Patent WO9315210 CDS <1..150 /note="gene cassete encoding intercalating jun-zipper and linker" gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638] Swiss-Prot:P03385 Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70); Tranmembrane protein (TM) (p15E); R protein]. gi|119478|sp|P03385|ENV_MLVMO EMBL:Y18050 E.faecium pbp5 gene TITLE Modification of penicillin-binding protein 5 asociated with high level ampicillin resistance in Enterococcus faecium gi|1143442|emb|X92687.1|EFPBP5G Incompatible schema Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Uninformative sequences Invalid values Undersized sequences ATTRIBUTE Ambiguity Data Cleansing: Dubious sequences Example Meaningless Seqs Vector contaminated sequence Crossannotation error RECORD Annotation error • Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004). • In Nov 2004, the total number of undersized protein sequences increases to 3,350. • Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004). • In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711. Sequence structure violation Undersized protein sequences in major databases Sequence redundancy Data provenance flaws 1015 1000 DDBJ 800 EMBL 600 400 200 GenBank 528 383 364 218 171 116 123 3 0 SwissProt 51 2 0 151 42 125 12 23 0 MULTIPLE SOURCE DATABASE 1 Erroneous data transformation PDB 2 Sequence Length 3 PIR Number of records SINGLE SOURCE DATABASE Number of records 1200 Undersized nucleotide sequences in major databases 233 228 250 200 DDBJ 150 100 50 115108 108 73 69 45 40 6 2 104 81 9 3 77 51 55 67 2 3 Sequence Length Incompatible schema Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic 4 GenBank PDB 24 0 1 EMBL 5 Invalid values ATTRIBUTE Overlapping intron/exon Data Cleansing: Ambiguity Example Overlapping Intron/Exon Dubious sequences Vector contaminated sequence Crossannotation error RECORD Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data Provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation • Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6. • rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2. Incompatible schema Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Replication of sequence information Invalid values Different views Overlapping annotations of the same sequence ATTRIBUTE Ambiguity Dubious sequences Data Cleansing: Example Seqs w/ Identical Info • Submission of the same sequence to different databases Vector contaminated sequence Crossannotation error RECORD • Repeated submission of the same sequence to the same database • Initially submitted by different groups • Protein sequences may be translated from duplicate nucleotide sequences Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation Incompatible schema http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Invalid values Replication of sequence information Different views Overlapping annotations of the same sequence ATTRIBUTE Ambiguity Data Cleansing: Dubious sequences Overlapping Annotations of the Same Seqs Vector contaminated sequence Crossannotation error RECORD Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data provenance flaws http://www.expasy.org/cgi-bin/niceprot.pl?Q95P69 MULTIPLE SOURCE DATABASE Erroneous data transformation http://www.expasy.org/cgi-bin/niceprot.pl?Q9GNG8 Incompatible schema Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Trying our hands on a data cleansing problem... Spelling errors Dictionary lookup Synonyms Homonyms/Abbreviations ATTRIBUTE Uninformative sequences Undersized sequences Integrity constraints Format violation Misuse of fields Vector screening RECORD Sequence Structure Parser Schema remapping Vector contaminated sequences Features do not correspond with sequence Sequence structure violation Concatenated values Mis-fielded values SINGLESOURCE DATABASE Replication of sequence information Duplicate detection MULTISOURCE DATABASE Different views Overlapping annotations of the same sequence Fragments Comparative analysis Putative features Cross-annotation error Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Association Rule Mining Approach Select matching criteria Compute similarity scores from known duplicate pairs Generate association rules Detect duplicates using the rules Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Features to Match Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Association Rule Mining AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0 AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0 Similarity scores of known duplicate pairs P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0 S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0 P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 Association rule mining Frequent item-set with support LE1.0 PD0 SQ1.0 (99.7%) SP1 PD0 SQ1.0 (97.1%) SP1 LE1.0 PD0 SQ1.0 (96.8%) DB0 PD0 SQ1.0 (93.1%) DB0 LE1.0 PD0 SQ1.0 (92.8%) DB0 SP1 PD0 SQ1.0 (90.4%) DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%) RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%) RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%) Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Dataset Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB) scorpion AND (venom OR toxin) serpentes AND venom AND PLA2 Scorpion venom dataset containing 520 records Snake PLA2 venom dataset containing 780 records Expert annotation 251 duplicate pairs 444 duplicate pairs 695 duplicate pairs are collectively identified. Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Results Duplicates detected by association rules 60 49.4 FP% and FN% 50 40 36.3 32.7 30 20 10 6 2.4 1.8 5.7 3.8 0.3 9.4 7.9 7.5 5.2 0.1 0 le Ru 1 le Ru 2 le Ru 3 le Ru 4 le Ru 5 le Ru 6 le Ru 7 Association rules FP% Exam 3 rules w/ highest support FN% x 1000 Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Rule 4 S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%) Rule 5 S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%) Rule 6 S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%) Rule 7 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%) Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Results Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates. What else do we learn? Definition of the sequence records do not play a role in identifying the record duplicates Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic References • Zhang et al., “Toward discovering disease-specific gene networks from online literature”, APBC, 3:161-169, 2005 • Pan et al., “Dragon TF association miner: A system for exploring transcription factor associations through text mining”, NAR, 32:W230-W234, 2004 • Tang et al., “Computational method for discovery of estrogenresponsive genes”, NAR, 32:6212-6217, 2004a • Tang et al., “ERGDB: Estrogen-responsive genes database”, NAR, 32:D533-D536, 2004b • Bajic et al., “Dragon ERE finded ver.2: A tool for accurate detection and analysis of estrogen-response elements in vertebrate genomes”, NAR, 31:3605-3607, 2003 • Koh et al., “A Classification of Biological Data Artifacts”, DBiBD, 2005 Copyright © 2005 by Limsoon Wong. References • Zhou et al., “Recognition of protein and gene names from text using an ensemble of classifiers and effective abbreviation resolution”, Proc. BioCreAtIvE Workshop, pp 26-30, 2004 • Yang et al., “Improving Noun Phrase Co-reference Resolution by Matching Strings”, IJCNLP, 1:226-333, 2004 • Xiao et al., “Protein-protein interaction extraction: A supervised learning approach”, submitted Copyright © 2005 by Limsoon Wong. Acknowledgements I2R Data Cleansing: Judice Koh, Vladimir Brusic, Mong Li Lee, Asif M. Khan, Paul T.J. Tan, Heiny Tan, Services & Applications Kenneth Lee, Wilson Goh, Songsak Tongchusak, Kavitha Gopalakrishnan Infocomm Security Knowledge Discovery Communications & Devices Context-Aware Systems Media Media Processing Media Semantics Human Centric Media Pathweaver: Zhuo Zhang, See-Kiong Ng, Suisheng Tang, Chris Tan Info Extraction: Dragon ERG & TF Miner: Embedded Systems Suisheng Tang, Vlad Bajic, Radio Systems Networking Zuo Li, Pan Hong, Vidhu Chaudhary, Raja Kangasa Copyright © 2005 by Limsoon Wong Guodong Zhou, Jian Su, ChewLim Tan, Juan Xiao, Xiaofeng Yang, Chris Tan, Digital Wireless Dan Shen, Lightwave Jie Zhang