A Classification of Biological Data Artifacts 1,2Judice L.Y. Koh, 2Mong Li Lee, 1Vladimir Brusic 1Knowledge Discovery Department, Institute for Infocomm Research 2School of Computing, National University of Singapore DBiBD: Workshop on Database Issues in Biological Databases http://research.i2r.a-star.edu.sg/Templar/ GenPept, EMBL, TrEMBL, NCBI RefSeq, Patent database … Data warehousing process The analysis and publishing of specialist database (SSDW) takes 2 weeks to 1 month. But they typically spent up to ¼ of their time cleaning biological data records. Biological data quality Public molecular databases (GenBank, Swiss-Prot, DDBJ, EMBL, PIR, among others) provide rich sources of biological data. Information for data analysis and data sources for knowledge discovery in the BioWare data warehouse. The accuracy of data analysis and the ability to produce correct results from data mining relies on the quality of data. But how can we ensure high quality data in our data warehouse? DBiBD: A classification of biological data artifacts Objectives of our study of biological data artifacts in biological databases Critical assessment of the quality of data which biologists/computer scientists have been using for data analysis and data mining. Roadmap to improving the quality of data in molecular databases. Form the basis of biological data cleaning. DBiBD: A classification of biological data artifacts Biological data artifacts Errors, discrepancies, redundancies, ambiguities, and incompleteness in molecular databases reducing the quality of the biological data. DBiBD: A classification of biological data artifacts Outline of presentation • • • • • • Sources of Biological Data Artifacts HEADER Artifacts FEATURE Artifacts SEQUENCE Artifacts Data Cleaning Framework Conclusion DBiBD: A classification of biological data artifacts Sources of biological data artifacts (1) Diverse sources of data • Extensive duplication • Repeated submissions of the sequences to same or different databases • Cross-updating of databases (Propagation of errors) (2) Data Annotation • Enrichment of sequences with descriptions of their structural and functional features, related references and other sequence information • By database annotators or sequence submitters • Databases have different mechanisms for data annotation (GENBANK – only direct submission; SWISS-PROT – all sequence records) • Data entry errors can be introduced • Different interpretations (3) Lack of standardized nomenclature • Variations in naming conventions • Synonyms, homonyms, and abbreviations (4) Inadequacy of data quality control mechanisms • Systematic approaches to data cleaning are lacking DBiBD: A classification of biological data artifacts Classification of data artifacts HEADER – General information of the record. FEATURE - Descriptions of the structural, functional, and other physico-chemical properties of the sequence and regions of interest. SEQUENCE – Nucleotide or Protein sequence. DBiBD: A classification of biological data artifacts Invalid values Spelling errors Numerical names Format violation Undersized or oversized names Synonyms HEADER Ambiguity Homonyms/Abbreviations Misuse of fields Incompatible schema Crossannotation error FEATURE Annotation error Concatenated values Mis-fielded values Conflicting features across different database records Features do not correspond with sequence Over-prediction Putative features Under-prediction Sequence structure violation Uninformative sequences Dubious sequences CDS miscoding Sequence entry error Undersized sequences Annotation error Fragments SEQUENCE Fragments Vector contaminated sequences Dubious records Replication of sequence information Duplicates Different views Overlapping annotations of the same sequence Outline of presentation • • • • • • Sources of Biological Data Artifacts HEADER Artifacts FEATURE Artifacts SEQUENCE Artifacts Data Cleaning Framework Conclusion DBiBD: A classification of biological data artifacts Invalid values Spelling errors Format violation HEADER Ambiguity • Usually typo errors • Occurs in different fields of the record Incompatible schema Crossannotation error FEATURE • We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez. Misspellings Corrections Immunoglobin Immunoglobulin Cassete Cassette tranmembrane transmembrane asociated associated immunoglobin immunoglobulin Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates Context of the misspellings GenBank:AAD26534 nectin-1 [Rattus norvegicus] TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited to Cadherin-based Adherens Junctions through Interaction with Afadin, a PDZ Domain-containing Protein gi|4590334|gb|AAD26534.1 Patent Database:A76783 Sequence 11 from Patent WO9315210 CDS <1..150 /note="gene cassete encoding intercalating jun-zipper and linker" gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638] Swiss-Prot:P03385 Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70); Tranmembrane protein (TM) (p15E); R protein]. gi|119478|sp|P03385|ENV_MLVMO EMBL:Y18050 E.faecium pbp5 gene TITLE Modification of penicillin-binding protein 5 asociated with high level ampicillin resistance in Enterococcus faecium gi|1143442|emb|X92687.1|EFPBP5G PIR:S02083 Ig lambda chain V-IV region - human (tentative sequence) (fragments) TITLE The primary structure of the variable region of an immunoglobin IV light-chain amyloid-fibril protein (AL GIL) gi|87901|pir||S02083[87901] Spelling errors Invalid values Format violation HEADER Ambiguity Incompatible schema Undersized/Oversized fields • 0.05% or 83 protein names gathered from the UniProt data records (Release 2.3) are longer than 400 characters. • Protein record DCB2_HUMAN in UniProt Crossannotation error FEATURE Annotation error Definition “Discoidin, CUB and LCCL domain containing protein 2 precursor” Synonym 1 “Endothelial and smooth muscle cell-derived neuropilin-like protein” Synonym 2 “CUB, LCCL and coagulation factor V/VIII-homology domains protein 1”. • Protein record ACTM_LELER and ACTM_HELTB Synonym “M” Sequence structure violation String length of protein definitions 7000 5000 4000 String length of protein synonyms 3000 2000 Fragments 431 356 324 295 271 248 226 204 182 138 94 116 160 String length No. of records SEQUENCE 72 30000 50 35000 0 6 1000 28 Dubious sequences No. of records 6000 25000 20000 15000 10000 5000 0 Duplicates 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 String length Invalid values Synonym / Homonym Abbreviation Misuse of fields HEADER Ambiguity Synonyms : Different names given to the same sequence Homonyms : Different sequences given the same name Incompatible schema The scorpion neurotoxin BmK-X precursor has a permutation of synonyms Crossannotation error FEATURE •It is also known as “BmKX”, “BmK10”, “BmK-M10”, “Bmk M10”, “Neurotoxin M10”, “Alpha-neurotoxin TX9”, and “BmKalphaTx9”. Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates http://www.expasy.org/cgi-bin/niceprot.pl?P45697 Invalid values Synonym / Homonym Abbreviation Misuse of fields HEADER Ambiguity Different types of sequences can have the same abbreviation. Incompatible schema Crossannotation error FEATURE • BMK stands for “Big Map Kinase”, “B-cell/myeloid kinase”, “bovine midkine”, as well as for “Bradykininpotentiating peptide”. • GK is the abbreviation for both “Glycerol Kinase” and “Geko” gene of Drosophila melanogaster (Fruit fly). Annotation error Sequence structure violation Dubious sequences http://www.expasy.org/cgi-bin/niceprot.pl?Q9R1D9 SEQUENCE Fragments Duplicates The manifestation of synonyms, homonyms and abbreviations results in information ambiguities which cause problems in sequence identification and keyword searching. Invalid values Synonym / Homonym Abbreviation Misuse of fields HEADER Ambiguity Ambiguous field values Incompatible schema Crossannotation error FEATURE Annotation error Sequence structure violation Definition includes species, length of sequence, etc. Dubious sequences SEQUENCE Fragments http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=639947&dopt=GenPept Duplicates Invalid values Concatenated values Mis-fielded values HEADER Ambiguity Incompatible schema Field concatenations occur during data transformations. • When data fields of finer granularity are transformed into schema with corresponding data fields of coarser granularity, the field values are concatenated. • Multiple field values can be concatenated using “and” or “or”. Crossannotation error FEATURE The gene name of the Swiss-Prot entry P29834 was “GRP 0.9 or GRP-1”. This was recently corrected. Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates http://www.expasy.org/cgi-bin/niceprot.pl?P15228 Invalid values Concatenated values Mis-fielded values HEADER Ambiguity Flaws in schema mapping Incompatible schema Source fields not taken into account in the transformed data schema may be incorrectly mapped to a wrong field. Crossannotation error FEATURE Annotation error Sequence structure violation Sequence is directly submitted to GENBANK Dubious sequences SEQUENCE Fragments Duplicates http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=18071172&dopt=GenPept Outline of presentation • • • • • • Sources of Biological Data Artifacts HEADER Artifacts FEATURE Artifacts SEQUENCE Artifacts Data Cleaning Framework Conclusion DBiBD: A classification of biological data artifacts Invalid values HEADER Conflicting features across different databases Ambiguity Incompatible schema Crossannotation error Multiple database records of the same nucleotide or protein sequences contain inconsistent or conflicting feature annotations. • data entry errors, • mis-annotation of sequence functions, • different expert interpretations, and • inference of features or annotation transfer based on best matches of low sequence similarity. FEATURE Annotation error Different annotation groups: Sequence structure violation A comparative study of the annotations by three different groups of 340 genes of Mycoplasma genitalium genome showed that incompatible descriptions were assigned to 8% of these genes. Brenner SE (1999) Errors in genome annotation. TIG 15: 132-133. Same annotation group: Dubious sequences SEQUENCE Fragments Duplicates http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=11692004 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=11692006 Invalid values HEADER Putative features Ambiguity •Functional annotation sometimes involve searching for the highest matching annotated sequence Incompatible schema in the database. •Extrapolate features from the most similar known searched sequences. Crossannotation error •In some cases, even the highest matching sequence from database search may have weak FEATURE Annotation error sequence similarities and therefore does not share similar functions as the query sequence (Bork, 2000 and Guigo et al., 2000). Sequence structure violation •“Blind” inference can cause erroneous functional assignment. •A study found that 24% of the Chlamydia trachomatis sequences contained erroneous functional Dubious sequences SEQUENCE Fragments Duplicates assignments (Iliopoulos et al., 2003). Invalid values HEADER Intron/Exon overlaps Ambiguity • Illogical feature entities that do not correspond to the logical constraints of the gene structure. Incompatible schema • 12 out of 42,359 nucleotide sequences have overlapping intron/exon region. Crossannotation error FEATURE Annotation error Sequence structure violation Introns and exons must be non-overlapping except in cases of alternative splicing. Dubious sequences SEQUENCE Fragments Duplicates Invalid values HEADER Intron/Exon overlaps Ambiguity Incompatible schema Crossannotation error FEATURE Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6. rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2. Duplicates Outline of presentation • • • • • • Sources of Biological Data Artifacts HEADER Artifacts FEATURE Artifacts SEQUENCE Artifacts Data Cleaning Framework Conclusion DBiBD: A classification of biological data artifacts Invalid values Uninformative sequence Undersized sequence Vector contaminated sequence HEADER Ambiguity Incompatible schema Crossannotation error FEATURE Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates Sequences have meaningless content • A profuse percentage of the unknown residues (“X”) or unknown bases (“N”) can reduce the complexity of the sequence and thus, the information content of the sequence. • Three out of the nine residues of the unknown protein CP19 “XXFESXEMR” in UniProt record UN19_CLOPA are unknown. • The chain C of a MHC protein “XFVKQNAXALX” in PDB contains 30% unknown residues. Uninformative sequence Invalid values Undersized sequence Vector contaminated sequence HEADER Ambiguity Incompatible schema Sequences have meaningless content • Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004). • In Nov 2004, the total number of undersized protein sequences increases to 3,350. Crossannotation error FEATURE Annotation error • Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004). • In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711. Undersized protein sequences in major databases 1015 1000 DDBJ 800 EMBL 600 400 200 GenBank 528 383 364 218 171 116 123 3 0 SwissProt 51 2 0 2 Sequence Length Fragments 151 42 125 12 23 PIR Undersized nucleotide sequences in major databases 0 1 SEQUENCE PDB 3 Number of records Dubious sequences 1200 Number of records Sequence structure violation 200 DDBJ 150 50 115108 108 100 73 69 6 2 104 81 45 40 9 3 77 51 55 67 2 3 Sequence Length 4 EMBL GenBank PDB 24 0 1 Duplicates 233 228 250 5 Invalid values Uninformative sequence Undersized sequence Vector contaminated sequence HEADER Ambiguity •Vectors are agents that carry DNA fragments into a host cell. •The vector sequences probe and bind the DNA fragments at the 5’ and 3’ sites. Incompatible schema •The DNA fragment is then isolated from its vectors by cutting at the restriction enzyme sites. Crossannotation error FEATURE Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates 8 out of 8,850 Candida Albicans sequences are possibly contaminated with vectors commonly used for the cloning of Candida Albicans sequences. We used BLAST to search for regions in the Candida Albicans sequences which matches any of the 18 cloning vectors. From the matched results, we selected those with matches at the 3’ or 5’ ends of Candida Albicans sequences. Matching sections of the sequences extend from 30 bases to 1,154 bases. Sequence Len Cloning vector Region cont. Candida albicans orotidine-5'-monophosphate decarboxylase (URA3) gene, complete cds GI: 6468322 GenBank accession: AF109400.1 3,891 pGT-GFP-URA3-14 GI:50363243 GenBank accession: AY656808.1 1,900 – 3,047 (1,154 bases) Candida albicans URA3 gene for orotidine-5'-monophosphate Decarboxylase GI: 2523 EMBL accession: X14198.1 1,365 pGT-GFP-URA3-14 GI:50363243 GenBank accession: AY656808.1 133 – 1,280 (1,154 bases) Invalid values HEADER Ambiguity Incompatible schema Crossannotation error FEATURE Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates Fragmented sequences in different records • Extensive redundancy is caused by records containing fragmented or overlapping sequences with more complete sequences in other records. Invalid values Replication of sequence information Different views Overlapping annotations of the same sequence HEADER Ambiguity Incompatible schema Identical sequences with the same annotations • Submission of the same sequence to different databases • Repeated submission of the same sequence to the same database • Initially submitted by different groups Crossannotation error FEATURE • Protein sequences may be translated from duplicate nucleotide sequences Annotation error Sequence structure violation Dubious sequences SEQUENCE Fragments Duplicates http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept Invalid values Replication of sequence information Different views Overlapping annotations of the same sequence HEADER Ambiguity Incompatible schema Crossannotation error FEATURE Annotation error Sequence structure violation http://www.expasy.org/cgi-bin/niceprot.pl?Q95P69 Dubious sequences SEQUENCE Fragments Duplicates http://www.expasy.org/cgi-bin/niceprot.pl?Q9GNG8 Outline of presentation • • • • • • Sources of Biological Data Artifacts HEADER Artifacts FEATURE Artifacts SEQUENCE Artifacts Data Cleaning Framework Conclusion DBiBD: A classification of biological data artifacts Invalid values Spelling errors Numerical names Format violation Undersized or oversized names Synonyms HEADER Ambiguity Homonyms/Abbreviations Misuse of fields Incompatible schema Crossannotation error FEATURE Annotation error Concatenated values Mis-fielded values Conflicting features across different database records Features do not correspond with sequence Over-prediction Putative features Under-prediction Sequence structure violation Uninformative sequences Dubious sequences CDS miscoding Sequence entry error Undersized sequences Annotation error Fragments SEQUENCE Fragments Vector contaminated sequences Dubious records Replication of sequence information Duplicates Different views Overlapping annotations of the same sequence Spelling errors Dictionary lookup Synonyms Homonyms/Abbreviations ATTRIBUTE Uninformative sequences Undersized sequences Integrity constraints Format violation Misuse of fields Vector screening RECORD Sequence Structure Parser Schema remapping Vector contaminated sequences Features do not correspond with sequence Sequence structure violation Concatenated values Mis-fielded values SINGLESOURCE DATABASE Replication of sequence information Duplicate detection Different views Overlapping annotations of the same sequence MULTISOURCE DATABASE Fragments Comparative analysis Putative features Cross-annotation error Outline of presentation • • • • • • Sources of Biological Data Artifacts HEADER Artifacts FEATURE Artifacts SEQUENCE Artifacts Data Cleaning Framework Conclusion DBiBD: A classification of biological data artifacts Conclusion 9 types of data artifacts. A combination of critical artifacts (vector contaminated sequences, duplicates, sequence structure violations) and non-critical artifacts (misspellings, synonyms). At least 20,000 sequence records in public databases contain some form of artifacts. Depreciating data quality requires more attention. The identification of these artifacts are important pre-step to accurate data mining and knowledge discovery. This classification provides a basis for design of biological data cleaning methods. DBiBD: A classification of biological data artifacts Acknowledgement Supervisors: Prof. Vladimir Brusic, Dr. Lee Mong Li Biologists: Asif M. Khan, Paul T.J. Tan, Heiny Tan, Kenneth Lee, Songsak Tongchusak, Wilson Goh Engineer: Kavitha Gopalakrishnan DBiBD: A classification of biological data artifacts