Some interesting issues in constructing gene/protein networks

advertisement
Some Interesting
Issues in Constructing
Gene/Protein Networks
Limsoon Wong
Institute for Infocomm Research
Singapore
Copyright © 2005 by Limsoon Wong
Issues
• Sound:
– Is the contents of our databases correct?
– Trying our hands on a data cleansing problem
• Complete:
– Is the structure of our databases expressive enough
to capture critical information explicitly?
• Understandable:
– Is our databases or search results understandable?
• Other issues relating to NLP/IE
Copyright © 2005 by Limsoon Wong.
Soundness:
Is the contents of our
databases correct?
Copyright © 2005 by Limsoon Wong
This part is based on work of Judice Koh and Vladimir Brusic
Categories of errors found
Copyright © 2005 by Limsoon Wong.
Spelling errors
Invalid
values
Format violation
ATTRIBUTE
Ambiguity
Dubious
sequences
Vector
contaminated
sequence
Crossannotation
error
RECORD
Annotation
error
Example Spelling Errors
• Usually typo errors
• Occurs in different fields of the record
• We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez.
Misspellings
Corrections
Immunoglobin
Immunoglobulin
Cassete
Cassette
tranmembrane
transmembrane
asociated
associated
Sequence
structure
violation
SINGLE
SOURCE
DATABASE
Sequence
redundancy
Data
Provenance
flaws
MULTIPLE
SOURCE
DATABASE
Context of the misspellings
GenBank:AAD26534
nectin-1 [Rattus norvegicus]
TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited
to Cadherin-based Adherens Junctions through Interaction with
Afadin, a PDZ Domain-containing Protein
gi|4590334|gb|AAD26534.1
Patent Database:A76783
Sequence 11 from Patent WO9315210
CDS <1..150
/note="gene cassete encoding intercalating jun-zipper and
linker"
gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638]
Swiss-Prot:P03385
Env polyprotein precursor
DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70);
Tranmembrane protein (TM) (p15E); R protein].
gi|119478|sp|P03385|ENV_MLVMO
Erroneous data
transformation
Incompatible
schema
Copyright © 2005 by Limsoon Wong
EMBL:Y18050
E.faecium pbp5 gene
TITLE Modification of penicillin-binding protein 5 asociated with high
level ampicillin resistance in Enterococcus faecium
gi|1143442|emb|X92687.1|EFPBP5G
Invalid
values
ATTRIBUTE
Ambiguity
Dubious
sequences
Overlapping intron/exon
Example Overlapping
Intron/Exon Errors
Vector
contaminated
sequence
Crossannotation
error
RECORD
Annotation
error
Sequence
structure
violation
SINGLE
SOURCE
DATABASE
Sequence
redundancy
Data
Provenance
flaws
MULTIPLE
SOURCE
DATABASE
Erroneous data
transformation
Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping
intron 5 and exon 6.
rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1
and exon 2.
Incompatible
schema
Copyright © 2005 by Limsoon Wong
Replication of sequence information
Invalid
values
Different views
Overlapping annotations of the same sequence
ATTRIBUTE
Ambiguity
Dubious
sequences
Example Seqs w/ Identical Info
• Submission of the same sequence to different databases
Vector
contaminated
sequence
Crossannotation
error
RECORD
• Repeated submission of the same sequence to the same database
• Initially submitted by different groups
• Protein sequences may be translated from duplicate nucleotide sequences
Annotation
error
Sequence
structure
violation
SINGLE
SOURCE
DATABASE
Sequence
redundancy
Data provenance
flaws
MULTIPLE
SOURCE
DATABASE
Erroneous data
transformation
Incompatible
schema
Copyright © 2005 by Limsoon Wong
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db
=protein&list_uids=11692005&dopt=GenPept
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db
=protein&list_uids=11692005&dopt=GenPept
Soundness:
Trying our hands on a
data cleansing
problem
Copyright © 2005 by Limsoon Wong
This part is based on work of Judice Koh
Spelling errors
Dictionary
lookup
Synonyms
Homonyms/Abbreviations
ATTRIBUTE
Uninformative sequences
Undersized sequences
Integrity
constraints
Format violation
Misuse of fields
Vector
screening
RECORD
Sequence
Structure
Parser
Schema
remapping
Vector contaminated sequences
Features do not correspond with sequence
Sequence structure violation
Concatenated values
Mis-fielded values
SINGLESOURCE
DATABASE
Replication of sequence information
Duplicate
detection
Different views
Overlapping annotations of the same sequence
MULTISOURCE
DATABASE
Fragments
Comparative
analysis
Copyright © 2005 by Limsoon Wong.
Putative features
Cross-annotation error
Dataset
Entrez (GenBank, GenPept,
SwissProt, DDBJ, PIR, PDB)
scorpion AND (venom OR toxin)
serpentes AND venom AND PLA2
Scorpion venom dataset
containing 520 records
Snake PLA2 venom dataset
containing 780 records
Expert annotation
251 duplicate pairs
444 duplicate pairs
695 duplicate pairs are collectively identified.
Copyright © 2005 by Limsoon Wong.
Results
Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)
Identical sequences with the same sequence length and not originated from PDB
are 99.7% likely to be duplicates.
Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)
Identical sequences with the same sequence length and of the same species are
97.1% likely to be duplicates.
Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)
Identical sequences with the same sequence length, of the same species and not
originated from PDB are 96.8% likely to be duplicates.
What else do we learn?
Definition of the sequence records do not play
a role in identifying the record duplicates
Copyright © 2005 by Limsoon Wong.
Completeness:
Is the structure of our
databases expressive
enough to capture
critical information
explicitly?
Copyright © 2005 by Limsoon Wong
Expressive Power
• Take a key paper such as the Kohn paper that
summarises current knowledge on p53
regulation.
• Is there a structured database that is able to
capture all info in that paper explicitly?
• Is there a semi-structured database that is able
to capture all info in that paper explicitly?
• How well does this (semi-) structured database
generalize to other similar type of papers?
Copyright © 2005 by Limsoon Wong.
Understandability:
Is our databases or
search results
understandable?
Copyright © 2005 by Limsoon Wong
Self-Organization
• Take a search on p53. You will get >300k hits
or some number like that on MEDLINE
• It is not feasible for anyone to go thru all of that
to find what he wants! And this problem is growing bigger as
MEDLINE doubles every 1-2 year.
• Need to organize the database and/or the
search results into hierarchy or “semantic” net
to make it easier for users to understand or to
browse the results
• How do we define this hierarchy/net?
• Can this hierarchy/net be self-organized?
Copyright © 2005 by Limsoon Wong.
Problems relating to
NLP/IE
This part is mostly based on work of Chris Tan and See-Kiong Ng
Copyright © 2005 by Limsoon Wong
Handling full-length papers
•
•
•
•
Source document structure parsing
Hyper-linked file tracking
Figure and table processing
Special symbol handling
Copyright © 2005 by Limsoon Wong.
Information retrieval
• Document and sentence retrieval
• Relevant interaction filtering
Copyright © 2005 by Limsoon Wong.
Bio name recognition
• Nomenclature loosely followed
• Frequent use of conjunction and disjunction in
bio names with multiple bio-entity names
sharing one head noun
• Long descriptive names
• Names of genes and proteins used
interchangeably
Copyright © 2005 by Limsoon Wong.
Bio-interaction extraction
• Inherent complexity of biological interactions
 Sentences describing them also tend to be
complicated
Copyright © 2005 by Limsoon Wong.
Bio-interaction extraction
• Domain knowledge is often needed for
interaction template filling
Copyright © 2005 by Limsoon Wong.
Extraction of other relevant info
• Contextual information
– Species, cell type, cellular localisation, etc
• Negative information
• Speculative & incomplete facts
Copyright © 2005 by Limsoon Wong.
Information integration
• Bio-name mapping
• Bio-interaction mapping
– how do you know two complex sentences are
talking about the same interaction?
Copyright © 2005 by Limsoon Wong.
Resource for training &
benchmarking
• Is there such a good resource, especially for
the more complex tasks?
Copyright © 2005 by Limsoon Wong.
Acknowledgements
I2R
Data Cleansing:
Judice Koh, Vladimir Brusic,
Mong Li Lee, Asif M. Khan,
Paul T.J.
Tan, Heiny Tan,
Services
& Applications
Kenneth Lee, Wilson Goh,
Songsak Tongchusak,
Kavitha Gopalakrishnan
Infocomm
Security
Knowledge
Discovery
Communications & Devices
Context-Aware
Systems
Media
Media
Processing
Media
Semantics
Human Centric
Media
NLP/IE Issues:
See-Kiong Ng,
Chris Tan
Embedded
Systems
Radio Systems
Copyright © 2005 by Limsoon Wong
Networking
Digital Wireless
Lightwave
Download