Strategies towards improving the utility of scientific big data

Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for Biotechnology Information (NCBI) National Library of Medicine (NLM) National Institutes of Health (NIH) Sep. 4, 2014 http://www.nlm.nih.gov/ U.S. National Center for Biotechnology Information https://www.ncbi.nlm.nih.gov/ PubChem website https://pubchem.ncbi.nlm.nih.gov/ PubChem primary goal … to be an on-line resource providing comprehensive information on the biological activities of substances where “substance” means any biologically testable entity Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc. PubChem data growth over ten years Contributors Chemicals Protein Targets Tested Chemicals Biological Assays Bioactivity Results +280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets [M=millions, T=thousands, MLP = Molecular Libraries Program] CAVEAT! All data has “errors” Big data has “big errors” Hypothetical If your average data error rate is 1 in 1,000,000, you have 99.999% data accuracy If you have one trillion facts (10^12), can you accept one million errors (10^9)? Strategies to mitigate errors? Manual curation has its limits (accuracy, cost, time) So .. what do you do? Error suppression strategies for scientific big data 1. Identify quality {un}known known/unknowns use to formulate an error suppression strategy 2. Perform data normalization improves utility by helping to refine identification 3. “Trust but verify” cross compare authoritative and curated data 4. Consistency filtering improves precision by removal of outliers 5. Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists Error suppression strategies for scientific big data 1. Identify quality {un}known known/unknowns use to formulate an error suppression strategy there are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know Feb. 2002 news briefing Ring Closed Ring Open (+)-Iridodial Salt-form variations arechange common Chemical meaningdrawing of a substance may upon Image credit: http://en.wikipedia.org/wiki/Donald_Rumsfeld context Defense chemicals from abdominal glands of 13 rove beetle species of subtribe Staphylinina Tautomers and resonance forms of same chemical structure are prolific Error suppression strategies for scientific big data 2. Perform data normalization improves utility by helping to refine identification • Verify chemical content – – – – Atoms defined/real Implicit hydrogen Functional group Atom valence sanity • Calculate – Coordinates – Properties – Descriptors • Normalize representation – – – – Tautomer invariance Aromaticity detection Stereochemistry Explicit hydrogen • Detect components – – – – Isolate covalent units Neutralize (+/- proton) Reprocess Detect unique Error suppression strategies for scientific big data 3. “Trust but verify” cross compare authoritative and curated data Cross concept count % CTD HDO KEG MED NDF ORD CTD 100.0 14.3 79.1 40.7 49.7 35.8 Доверяй, но проверяй no proveryai) HDO 26.0 100.0 38.7 52.4 (doveryai, 48.3 26.2 extensively Ronald Regan KEG 24.8 Russian 6.7 proverb 100.0 used 10.7 6.4 by 25.2 when relations with MED 97.2 68.9 discussing 81.6 100.0 93.8 the Soviet 79.6 Union NDF 30.4 16.3 12.5 24.0 or 100.0 10.8 Kerry’s71.6 more recent the phrase when ORD 31.9 John12.8 29.7 adaption 15.7 of 100.0 discussing Syria’s chemical weapons disposal: “Verify verify” Cross-reference overlaps between various diseaseand resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Image credit: http://en.wikipedia.org/wiki/Ronald_Reagan Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison. Image credit: http://en.wikipedia.org/wiki/John_Kerry Error suppression strategies for scientific big data 4. Consistency filtering improves precision by removal of outliers Histogram of MNIDs per CID 1,000,000 100,000 Original 10,000 Histogram of Fate of CID-MNID Pairs Many votes, 70% 1,000 100 Many votes, 60% 120,000 One Vote, 70% 100,000 One Vote, 60% 10 80,000 60,000 1 1 2 3 4 5 6 7 40,000 20,000 - Keep consensus, remove the rest Image credit: http://withfriendship.com/images/c/11229/Accuracy-and-precision-picture.png Many votes, 70% Many votes, 60% One Vote, 70% One Vote, 60% Error suppression strategies for scientific big data 5. Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists Prevent error proliferation at the data source, when possible Error suppression strategies for scientific big data 1. Identify quality {un}known known/unknowns use to formulate an error suppression strategy 2. Perform data normalization improves utility by helping to refine identification 3. “Trust but verify” cross compare authoritative and curated data 4. Consistency filtering improves precision by removal of outliers 5. Address error feedback loops use “is”, “can be”, and, if all else fails, “is not” lists Okay … now what? … you have cleaned up your data … but it is huge, unwieldy, unstructured How can it be made more useful? Data organization strategies for scientific big data 1. Crosslink and annotate data provides context and identifies associated concepts 2. Establish similarity schemes enables identification of related records 3. Associate to concept hierarchies improves navigation between related records 4. Perform data reduction suppresses “redundant” information 5. Be succinct simplifies presentation by hiding details Data organization strategies for scientific big data 1. Crosslink and annotate data provides context and identifies associated concepts Substance Protein Patent Gene Pathway Compound Disease treat Publication cites ingredient Drug Data organization strategies for scientific big data 2. Establish similarity schemes enables identification of related records Vioxx Data organization strategies for scientific big data 3. Associate to concept hierarchies improves navigation between related records Match to concept Independent hierarchy = chemical protein gene patent publication pathway … … Organized records Data organization strategies for scientific big data 4. Perform data reduction suppresses “redundant” information 5. Be succinct simplifies presentation by hiding details “subject-predicate-object” “atorvastatin may treat hypercholesterolemia” subject Provenance information predicate Evidence citation (PMID) From whom? (Data Source) object Data organization strategies for scientific big data 1. Crosslink and annotate data provides context and identifies associated concepts 2. Establish similarity schemes enables identification of related records 3. Associate to concept hierarchies improves navigation between related records 4. Perform data reduction suppresses “redundant” information 5. Be succinct simplifies presentation by hiding details Concluding remarks Scientific “big data” … … contains an amazing amount of information … provides opportunities to make discoveries … benefits from strategies to massage it PubChem is doing its part … … making chemical substance data broadly accessible … cross-integrating it to key scientific resources … suppressing errors and their propagation … organizing the data and making it available https://pubchem.ncbi.nlm.nih.gov PubChem Crew … Steve Bryant Tiejun Chen Siqian He Gang Fu Sunghwan Kim Lewis Geer Ben Shoemaker Renata Geer Paul Thiessen Asta Gindulyte Jiyao Wang Volker Hahnke Yanli Wang Lianyi Han Bo Yu Jane He Jian Zhang Special thanks to the NCBI Help Desk, especially Rana Morris Any questions? If you think of one later, email me: bolton@ncbi.nlm.nih.gov

Strategies towards improving the utility of scientific big data

Related documents

Products

Support

Strategies towards improving the utility of scientific big data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib