Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore Copyright © 2005 by Limsoon Wong Issues • Sound: – Is the contents of our databases correct? – Trying our hands on a data cleansing problem • Complete: – Is the structure of our databases expressive enough to capture critical information explicitly? • Understandable: – Is our databases or search results understandable? • Other issues relating to NLP/IE Copyright © 2005 by Limsoon Wong. Soundness: Is the contents of our databases correct? Copyright © 2005 by Limsoon Wong This part is based on work of Judice Koh and Vladimir Brusic Categories of errors found Copyright © 2005 by Limsoon Wong. Spelling errors Invalid values Format violation ATTRIBUTE Ambiguity Dubious sequences Vector contaminated sequence Crossannotation error RECORD Annotation error Example Spelling Errors • Usually typo errors • Occurs in different fields of the record • We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez. Misspellings Corrections Immunoglobin Immunoglobulin Cassete Cassette tranmembrane transmembrane asociated associated Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data Provenance flaws MULTIPLE SOURCE DATABASE Context of the misspellings GenBank:AAD26534 nectin-1 [Rattus norvegicus] TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited to Cadherin-based Adherens Junctions through Interaction with Afadin, a PDZ Domain-containing Protein gi|4590334|gb|AAD26534.1 Patent Database:A76783 Sequence 11 from Patent WO9315210 CDS <1..150 /note="gene cassete encoding intercalating jun-zipper and linker" gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638] Swiss-Prot:P03385 Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70); Tranmembrane protein (TM) (p15E); R protein]. gi|119478|sp|P03385|ENV_MLVMO Erroneous data transformation Incompatible schema Copyright © 2005 by Limsoon Wong EMBL:Y18050 E.faecium pbp5 gene TITLE Modification of penicillin-binding protein 5 asociated with high level ampicillin resistance in Enterococcus faecium gi|1143442|emb|X92687.1|EFPBP5G Invalid values ATTRIBUTE Ambiguity Dubious sequences Overlapping intron/exon Example Overlapping Intron/Exon Errors Vector contaminated sequence Crossannotation error RECORD Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data Provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6. rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2. Incompatible schema Copyright © 2005 by Limsoon Wong Replication of sequence information Invalid values Different views Overlapping annotations of the same sequence ATTRIBUTE Ambiguity Dubious sequences Example Seqs w/ Identical Info • Submission of the same sequence to different databases Vector contaminated sequence Crossannotation error RECORD • Repeated submission of the same sequence to the same database • Initially submitted by different groups • Protein sequences may be translated from duplicate nucleotide sequences Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation Incompatible schema Copyright © 2005 by Limsoon Wong http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept Soundness: Trying our hands on a data cleansing problem Copyright © 2005 by Limsoon Wong This part is based on work of Judice Koh Spelling errors Dictionary lookup Synonyms Homonyms/Abbreviations ATTRIBUTE Uninformative sequences Undersized sequences Integrity constraints Format violation Misuse of fields Vector screening RECORD Sequence Structure Parser Schema remapping Vector contaminated sequences Features do not correspond with sequence Sequence structure violation Concatenated values Mis-fielded values SINGLESOURCE DATABASE Replication of sequence information Duplicate detection Different views Overlapping annotations of the same sequence MULTISOURCE DATABASE Fragments Comparative analysis Copyright © 2005 by Limsoon Wong. Putative features Cross-annotation error Dataset Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB) scorpion AND (venom OR toxin) serpentes AND venom AND PLA2 Scorpion venom dataset containing 520 records Snake PLA2 venom dataset containing 780 records Expert annotation 251 duplicate pairs 444 duplicate pairs 695 duplicate pairs are collectively identified. Copyright © 2005 by Limsoon Wong. Results Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates. What else do we learn? Definition of the sequence records do not play a role in identifying the record duplicates Copyright © 2005 by Limsoon Wong. Completeness: Is the structure of our databases expressive enough to capture critical information explicitly? Copyright © 2005 by Limsoon Wong Expressive Power • Take a key paper such as the Kohn paper that summarises current knowledge on p53 regulation. • Is there a structured database that is able to capture all info in that paper explicitly? • Is there a semi-structured database that is able to capture all info in that paper explicitly? • How well does this (semi-) structured database generalize to other similar type of papers? Copyright © 2005 by Limsoon Wong. Understandability: Is our databases or search results understandable? Copyright © 2005 by Limsoon Wong Self-Organization • Take a search on p53. You will get >300k hits or some number like that on MEDLINE • It is not feasible for anyone to go thru all of that to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year. • Need to organize the database and/or the search results into hierarchy or “semantic” net to make it easier for users to understand or to browse the results • How do we define this hierarchy/net? • Can this hierarchy/net be self-organized? Copyright © 2005 by Limsoon Wong. Problems relating to NLP/IE This part is mostly based on work of Chris Tan and See-Kiong Ng Copyright © 2005 by Limsoon Wong Handling full-length papers • • • • Source document structure parsing Hyper-linked file tracking Figure and table processing Special symbol handling Copyright © 2005 by Limsoon Wong. Information retrieval • Document and sentence retrieval • Relevant interaction filtering Copyright © 2005 by Limsoon Wong. Bio name recognition • Nomenclature loosely followed • Frequent use of conjunction and disjunction in bio names with multiple bio-entity names sharing one head noun • Long descriptive names • Names of genes and proteins used interchangeably Copyright © 2005 by Limsoon Wong. Bio-interaction extraction • Inherent complexity of biological interactions Sentences describing them also tend to be complicated Copyright © 2005 by Limsoon Wong. Bio-interaction extraction • Domain knowledge is often needed for interaction template filling Copyright © 2005 by Limsoon Wong. Extraction of other relevant info • Contextual information – Species, cell type, cellular localisation, etc • Negative information • Speculative & incomplete facts Copyright © 2005 by Limsoon Wong. Information integration • Bio-name mapping • Bio-interaction mapping – how do you know two complex sentences are talking about the same interaction? Copyright © 2005 by Limsoon Wong. Resource for training & benchmarking • Is there such a good resource, especially for the more complex tasks? Copyright © 2005 by Limsoon Wong. Acknowledgements I2R Data Cleansing: Judice Koh, Vladimir Brusic, Mong Li Lee, Asif M. Khan, Paul T.J. Tan, Heiny Tan, Services & Applications Kenneth Lee, Wilson Goh, Songsak Tongchusak, Kavitha Gopalakrishnan Infocomm Security Knowledge Discovery Communications & Devices Context-Aware Systems Media Media Processing Media Semantics Human Centric Media NLP/IE Issues: See-Kiong Ng, Chris Tan Embedded Systems Radio Systems Copyright © 2005 by Limsoon Wong Networking Digital Wireless Lightwave