A Classification of Biological Data Artifacts Judice L.Y. Koh, Mong Li Lee,

advertisement
A Classification of Biological Data
Artifacts
1,2Judice
L.Y. Koh, 2Mong Li Lee,
1Vladimir Brusic
1Knowledge
Discovery Department, Institute for Infocomm Research
2School of Computing, National University of Singapore
DBiBD: Workshop on Database Issues in Biological Databases
http://research.i2r.a-star.edu.sg/Templar/
GenPept, EMBL, TrEMBL, NCBI RefSeq, Patent database
…
Data warehousing process
The analysis and publishing of specialist database (SSDW) takes 2 weeks to 1 month. But
they typically spent up to ¼ of their time cleaning biological data records.
Biological data quality
Public molecular databases (GenBank, Swiss-Prot, DDBJ, EMBL, PIR,
among others) provide rich sources of biological data.
Information for data analysis and data sources for knowledge discovery in the
BioWare data warehouse.
The accuracy of data analysis and the ability to produce correct results from
data mining relies on the quality of data.
But how can we ensure high quality data in our data warehouse?
DBiBD: A classification of biological data artifacts
Objectives of our study
of biological data artifacts in biological databases
 Critical assessment of the quality of data which biologists/computer scientists
have been using for data analysis and data mining.
 Roadmap to improving the quality of data in molecular databases.
 Form the basis of biological data cleaning.
DBiBD: A classification of biological data artifacts
Biological data artifacts
Errors, discrepancies, redundancies, ambiguities,
and incompleteness in molecular databases
reducing the quality of the biological data.
DBiBD: A classification of biological data artifacts
Outline of presentation
•
•
•
•
•
•
Sources of Biological Data Artifacts
HEADER Artifacts
FEATURE Artifacts
SEQUENCE Artifacts
Data Cleaning Framework
Conclusion
DBiBD: A classification of biological data artifacts
Sources of biological data artifacts
(1) Diverse sources of data
•
Extensive duplication
•
Repeated submissions of the sequences to same or different databases
•
Cross-updating of databases (Propagation of errors)
(2) Data Annotation
•
Enrichment of sequences with descriptions of their structural and functional features, related references
and other sequence information
•
By database annotators or sequence submitters
•
Databases have different mechanisms for data annotation
(GENBANK – only direct submission; SWISS-PROT – all sequence records)
•
Data entry errors can be introduced
•
Different interpretations
(3) Lack of standardized nomenclature
•
Variations in naming conventions
•
Synonyms, homonyms, and abbreviations
(4) Inadequacy of data quality control mechanisms
•
Systematic approaches to data cleaning are lacking
DBiBD: A classification of biological data artifacts
Classification of data artifacts
HEADER – General information of the
record.
FEATURE - Descriptions of the
structural, functional, and other
physico-chemical properties of the
sequence and regions of interest.
SEQUENCE – Nucleotide or Protein
sequence.
DBiBD: A classification of biological data artifacts
Invalid
values
Spelling errors
Numerical names
Format violation
Undersized or
oversized names
Synonyms
HEADER
Ambiguity
Homonyms/Abbreviations
Misuse of fields
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Concatenated values
Mis-fielded values
Conflicting features across
different database records
Features do not correspond with sequence
Over-prediction
Putative
features
Under-prediction
Sequence structure
violation
Uninformative sequences
Dubious
sequences
CDS miscoding
Sequence entry error
Undersized sequences
Annotation error
Fragments
SEQUENCE
Fragments
Vector contaminated
sequences
Dubious records
Replication of sequence information
Duplicates
Different views
Overlapping annotations of the same sequence
Outline of presentation
•
•
•
•
•
•
Sources of Biological Data Artifacts
HEADER Artifacts
FEATURE Artifacts
SEQUENCE Artifacts
Data Cleaning Framework
Conclusion
DBiBD: A classification of biological data artifacts
Invalid
values
Spelling errors
Format violation
HEADER
Ambiguity
• Usually typo errors
• Occurs in different fields of the record
Incompatible
schema
Crossannotation
error
FEATURE
• We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez.
Misspellings
Corrections
Immunoglobin
Immunoglobulin
Cassete
Cassette
tranmembrane
transmembrane
asociated
associated
immunoglobin
immunoglobulin
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
Context of the misspellings
GenBank:AAD26534
nectin-1 [Rattus norvegicus]
TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited
to Cadherin-based Adherens Junctions through Interaction with
Afadin, a PDZ Domain-containing Protein
gi|4590334|gb|AAD26534.1
Patent Database:A76783
Sequence 11 from Patent WO9315210
CDS <1..150
/note="gene cassete encoding intercalating jun-zipper and
linker"
gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638]
Swiss-Prot:P03385
Env polyprotein precursor
DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70);
Tranmembrane protein (TM) (p15E); R protein].
gi|119478|sp|P03385|ENV_MLVMO
EMBL:Y18050
E.faecium pbp5 gene
TITLE Modification of penicillin-binding protein 5 asociated with high
level ampicillin resistance in Enterococcus faecium
gi|1143442|emb|X92687.1|EFPBP5G
PIR:S02083
Ig lambda chain V-IV region - human (tentative sequence) (fragments)
TITLE The primary structure of the variable region of an immunoglobin IV
light-chain amyloid-fibril protein (AL GIL)
gi|87901|pir||S02083[87901]
Spelling errors
Invalid
values
Format violation
HEADER
Ambiguity
Incompatible
schema
Undersized/Oversized fields
• 0.05% or 83 protein names gathered from the UniProt data records (Release 2.3) are longer than 400
characters.
• Protein record DCB2_HUMAN in UniProt
Crossannotation
error
FEATURE
Annotation
error
Definition “Discoidin, CUB and LCCL domain containing protein 2 precursor”
Synonym 1 “Endothelial and smooth muscle cell-derived neuropilin-like protein”
Synonym 2 “CUB, LCCL and coagulation factor V/VIII-homology domains protein 1”.
• Protein record ACTM_LELER and ACTM_HELTB
Synonym “M”
Sequence
structure
violation
String length of protein definitions
7000
5000
4000
String length of protein synonyms
3000
2000
Fragments
431
356
324
295
271
248
226
204
182
138
94
116
160
String length
No. of records
SEQUENCE
72
30000
50
35000
0
6
1000
28
Dubious
sequences
No. of records
6000
25000
20000
15000
10000
5000
0
Duplicates
1
6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91
String length
Invalid
values
Synonym / Homonym
Abbreviation
Misuse of fields
HEADER
Ambiguity
Synonyms : Different names given to the same sequence
Homonyms : Different sequences given the same name
Incompatible
schema
The scorpion neurotoxin BmK-X precursor has a permutation of synonyms
Crossannotation
error
FEATURE
•It is also known as “BmKX”, “BmK10”, “BmK-M10”, “Bmk M10”, “Neurotoxin M10”, “Alpha-neurotoxin
TX9”, and “BmKalphaTx9”.
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
http://www.expasy.org/cgi-bin/niceprot.pl?P45697
Invalid
values
Synonym / Homonym
Abbreviation
Misuse of fields
HEADER
Ambiguity
Different types of sequences can have the same abbreviation.
Incompatible
schema
Crossannotation
error
FEATURE
• BMK stands for “Big Map Kinase”, “B-cell/myeloid kinase”, “bovine midkine”, as well as for “Bradykininpotentiating peptide”.
• GK is the abbreviation for both “Glycerol Kinase” and “Geko” gene of Drosophila melanogaster (Fruit fly).
Annotation
error
Sequence
structure
violation
Dubious
sequences
http://www.expasy.org/cgi-bin/niceprot.pl?Q9R1D9
SEQUENCE
Fragments
Duplicates
The manifestation of synonyms, homonyms and abbreviations results in
information ambiguities which cause problems in sequence identification and
keyword searching.
Invalid
values
Synonym / Homonym
Abbreviation
Misuse of fields
HEADER
Ambiguity
Ambiguous field values
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Definition
includes
species,
length of
sequence, etc.
Dubious
sequences
SEQUENCE
Fragments
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=639947&dopt=GenPept
Duplicates
Invalid
values
Concatenated values
Mis-fielded values
HEADER
Ambiguity
Incompatible
schema
Field concatenations occur during data transformations.
• When data fields of finer granularity are transformed into schema with corresponding data fields of coarser
granularity, the field values are concatenated.
• Multiple field values can be concatenated using “and” or “or”.
Crossannotation
error
FEATURE
The gene name of the Swiss-Prot entry P29834 was “GRP 0.9 or GRP-1”. This was recently corrected.
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
http://www.expasy.org/cgi-bin/niceprot.pl?P15228
Invalid
values
Concatenated values
Mis-fielded values
HEADER
Ambiguity
Flaws in schema mapping
Incompatible
schema
Source fields not taken into account in the transformed data schema may be incorrectly mapped to a wrong
field.
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Sequence is
directly
submitted to
GENBANK
Dubious
sequences
SEQUENCE
Fragments
Duplicates
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=18071172&dopt=GenPept
Outline of presentation
•
•
•
•
•
•
Sources of Biological Data Artifacts
HEADER Artifacts
FEATURE Artifacts
SEQUENCE Artifacts
Data Cleaning Framework
Conclusion
DBiBD: A classification of biological data artifacts
Invalid
values
HEADER
Conflicting features across different databases
Ambiguity
Incompatible
schema
Crossannotation
error
Multiple database records of the same nucleotide or protein sequences contain inconsistent or
conflicting feature annotations.
• data entry errors,
• mis-annotation of sequence functions,
• different expert interpretations, and
• inference of features or annotation transfer based on best matches of low sequence similarity.
FEATURE
Annotation
error
Different annotation groups:
Sequence
structure
violation
A comparative study of the annotations by three different groups of 340 genes of Mycoplasma
genitalium genome showed that incompatible descriptions were assigned to 8% of these genes.
Brenner SE (1999) Errors in genome annotation. TIG 15: 132-133.
Same annotation group:
Dubious
sequences
SEQUENCE
Fragments
Duplicates
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=11692004
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=11692006
Invalid
values
HEADER
Putative features
Ambiguity
•Functional annotation sometimes involve searching for the highest matching annotated sequence
Incompatible
schema
in the database.
•Extrapolate features from the most similar known searched sequences.
Crossannotation
error
•In some cases, even the highest matching sequence from database search may have weak
FEATURE
Annotation
error
sequence similarities and therefore does not share similar functions as the query sequence (Bork,
2000 and Guigo et al., 2000).
Sequence
structure
violation
•“Blind” inference can cause erroneous functional assignment.
•A study found that 24% of the Chlamydia trachomatis sequences contained erroneous functional
Dubious
sequences
SEQUENCE
Fragments
Duplicates
assignments (Iliopoulos et al., 2003).
Invalid
values
HEADER
Intron/Exon overlaps
Ambiguity
• Illogical feature entities that do not correspond to the logical constraints of the gene
structure.
Incompatible
schema
• 12 out of 42,359 nucleotide sequences have overlapping intron/exon region.
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Introns and exons must be non-overlapping except in cases of alternative splicing.
Dubious
sequences
SEQUENCE
Fragments
Duplicates
Invalid
values
HEADER
Intron/Exon overlaps
Ambiguity
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping
intron 5 and exon 6.
rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1
and exon 2.
Duplicates
Outline of presentation
•
•
•
•
•
•
Sources of Biological Data Artifacts
HEADER Artifacts
FEATURE Artifacts
SEQUENCE Artifacts
Data Cleaning Framework
Conclusion
DBiBD: A classification of biological data artifacts
Invalid
values
Uninformative sequence
Undersized sequence
Vector contaminated sequence
HEADER
Ambiguity
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
Sequences have meaningless content
• A profuse percentage of the unknown residues (“X”) or unknown bases (“N”) can reduce the complexity of
the sequence and thus, the information content of the sequence.
• Three out of the nine residues of the unknown protein CP19 “XXFESXEMR” in UniProt record
UN19_CLOPA are unknown.
• The chain C of a MHC protein “XFVKQNAXALX” in PDB contains 30% unknown residues.
Uninformative sequence
Invalid
values
Undersized sequence
Vector contaminated sequence
HEADER
Ambiguity
Incompatible
schema
Sequences have meaningless content
• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide
databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).
• In Nov 2004, the total number of undersized protein sequences increases to 3,350.
Crossannotation
error
FEATURE
Annotation
error
• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records
contain sequences shorter than six bases (as of Sep, 2004).
• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.
Undersized protein sequences in major databases
1015
1000
DDBJ
800
EMBL
600
400
200
GenBank
528
383
364
218
171
116
123
3 0
SwissProt
51
2 0
2
Sequence Length
Fragments
151
42
125
12 23
PIR
Undersized nucleotide sequences in major
databases
0
1
SEQUENCE
PDB
3
Number of records
Dubious
sequences
1200
Number of records
Sequence
structure
violation
200
DDBJ
150
50
115108
108
100
73
69
6
2
104
81
45
40
9
3
77
51
55
67
2
3
Sequence Length
4
EMBL
GenBank
PDB
24
0
1
Duplicates
233
228
250
5
Invalid
values
Uninformative sequence
Undersized sequence
Vector contaminated sequence
HEADER
Ambiguity
•Vectors are agents that carry DNA fragments into a host cell.
•The vector sequences probe and bind the DNA fragments at the 5’ and 3’ sites.
Incompatible
schema
•The DNA fragment is then isolated from its vectors by cutting at the restriction enzyme sites.
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
8 out of 8,850 Candida Albicans sequences are possibly contaminated with vectors commonly
used for the cloning of Candida Albicans sequences.
We used BLAST to search for regions in the Candida Albicans sequences which matches any
of the 18 cloning vectors. From the matched results, we selected those with matches at the 3’
or 5’ ends of Candida Albicans sequences. Matching sections of the sequences extend from
30 bases to 1,154 bases.
Sequence
Len
Cloning vector
Region cont.
Candida albicans orotidine-5'-monophosphate decarboxylase
(URA3) gene, complete cds
GI: 6468322 GenBank accession: AF109400.1
3,891
pGT-GFP-URA3-14
GI:50363243 GenBank accession: AY656808.1
1,900 – 3,047
(1,154 bases)
Candida albicans URA3 gene for orotidine-5'-monophosphate
Decarboxylase
GI: 2523 EMBL accession: X14198.1
1,365
pGT-GFP-URA3-14
GI:50363243 GenBank accession: AY656808.1
133 – 1,280
(1,154 bases)
Invalid
values
HEADER
Ambiguity
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
Fragmented sequences in different records
• Extensive redundancy is caused by records containing fragmented or overlapping sequences with more
complete sequences in other records.
Invalid
values
Replication of sequence information
Different views
Overlapping annotations of the same sequence
HEADER
Ambiguity
Incompatible
schema
Identical sequences with the same annotations
• Submission of the same sequence to different databases
• Repeated submission of the same sequence to the same database
• Initially submitted by different groups
Crossannotation
error
FEATURE
• Protein sequences may be translated from duplicate nucleotide sequences
Annotation
error
Sequence
structure
violation
Dubious
sequences
SEQUENCE
Fragments
Duplicates
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db
=protein&list_uids=11692005&dopt=GenPept
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db
=protein&list_uids=11692005&dopt=GenPept
Invalid
values
Replication of sequence information
Different views
Overlapping annotations of the same sequence
HEADER
Ambiguity
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Sequence
structure
violation
http://www.expasy.org/cgi-bin/niceprot.pl?Q95P69
Dubious
sequences
SEQUENCE
Fragments
Duplicates
http://www.expasy.org/cgi-bin/niceprot.pl?Q9GNG8
Outline of presentation
•
•
•
•
•
•
Sources of Biological Data Artifacts
HEADER Artifacts
FEATURE Artifacts
SEQUENCE Artifacts
Data Cleaning Framework
Conclusion
DBiBD: A classification of biological data artifacts
Invalid
values
Spelling errors
Numerical names
Format violation
Undersized or
oversized names
Synonyms
HEADER
Ambiguity
Homonyms/Abbreviations
Misuse of fields
Incompatible
schema
Crossannotation
error
FEATURE
Annotation
error
Concatenated values
Mis-fielded values
Conflicting features across
different database records
Features do not correspond with sequence
Over-prediction
Putative
features
Under-prediction
Sequence structure
violation
Uninformative sequences
Dubious
sequences
CDS miscoding
Sequence entry error
Undersized sequences
Annotation error
Fragments
SEQUENCE
Fragments
Vector contaminated
sequences
Dubious records
Replication of sequence information
Duplicates
Different views
Overlapping annotations of the same sequence
Spelling errors
Dictionary
lookup
Synonyms
Homonyms/Abbreviations
ATTRIBUTE
Uninformative sequences
Undersized sequences
Integrity
constraints
Format violation
Misuse of fields
Vector
screening
RECORD
Sequence
Structure
Parser
Schema
remapping
Vector contaminated sequences
Features do not correspond with sequence
Sequence structure violation
Concatenated values
Mis-fielded values
SINGLESOURCE
DATABASE
Replication of sequence information
Duplicate
detection
Different views
Overlapping annotations of the same sequence
MULTISOURCE
DATABASE
Fragments
Comparative
analysis
Putative features
Cross-annotation error
Outline of presentation
•
•
•
•
•
•
Sources of Biological Data Artifacts
HEADER Artifacts
FEATURE Artifacts
SEQUENCE Artifacts
Data Cleaning Framework
Conclusion
DBiBD: A classification of biological data artifacts
Conclusion
 9 types of data artifacts.
 A combination of critical artifacts (vector contaminated sequences, duplicates,
sequence structure violations) and non-critical artifacts (misspellings, synonyms).
 At least 20,000 sequence records in public databases contain some form of
artifacts.
 Depreciating data quality requires more attention.
 The identification of these artifacts are important pre-step to accurate data mining
and knowledge discovery.
 This classification provides a basis for design of biological data cleaning
methods.
DBiBD: A classification of biological data artifacts
Acknowledgement
Supervisors: Prof. Vladimir Brusic, Dr. Lee Mong Li
Biologists: Asif M. Khan, Paul T.J. Tan, Heiny Tan, Kenneth Lee, Songsak Tongchusak,
Wilson Goh
Engineer: Kavitha Gopalakrishnan
DBiBD: A classification of biological data artifacts
Download