Alternative Tools for Mining the Biomedical Literature

advertisement
Alternative Tools for Mining the Biomedical
Literature
Rolando Garcia-Milian
Rolando.milian@ufl.edu
Biomedical & Health Information Services Department
Health Sciences Center Library
February 14, 2014
In this session
Introduction
Novel online tools for mining the literature
Unified Medical Language System
Quertle
NextBio
Semantic MEDLINE
Problem – Rapid Growth of Biomedical data
Samples Submitted to Gene Expression
Omnibus Database
3.50
3.00
Millions
2.50
2.00
1.50
1.00
0.50
0.00
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
GenBank Statistics
http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/
Compiled from GEO historic data
http://www.ncbi.nlm.nih.gov/geo/summary/?type=history
Problem – Growth of the Biomedical Literature
Number of Records in PubMed
25.00
Biomedical Literature
Millions
20.00
• Huge volume (PubMed
23132342 citations)
15.00
10.00
• High diversity
5.00
0.00
1940
• High quality (peer review)
1950
1960
1970
1980
1990
2000
2010
2020
Compiled by from PubMed
http://www.ncbi.nlm.nih.gov/pubmed
• Users overwhelmed by long list of search results
• 1/3 of Pubmed queries resulted in 100 or more citations
(Islamaj, 2009)
Problem – Querying the Biomedical Literature
Querying the biomedical literature becomes more difficult
Boolean operators
Filters
Medical Subject Headings
Alternative Tools for Mining the Biomedical Literature
Alternative tools for mining the biomedical literature
combine:
Statistical methods,
Ontologies,
Natural Language Processing tools,
Visualization tools
Reduced time for discovering meaningful results.
Information Retrieval and Information Extraction
Information Retrieval
retrieves documents/ records
EGFR
records
Information Extraction
extracts facts
T14D inhibited EGF receptor internalization
records
EGFR regulates tumor cell proliferation
EGFR is expressed in SCCHN
Modified from OpenHelix
Text Processing
paper
Sentence 1
Word
Word
Word
Word
Sentence 2
Word
Word
Word
Word
Sentence 3
Word
Word
Word
Word
Sentence 4
Word
Word
Word
Word
Sentence 5
Word
Word
Word
Word
Sentence 6
Word
Word
Word
Word
= association
Query = phenotype ( ) + anatomy ( ):
ontology
category tags
Extract =
Sentence 1
Sentence 4
Modified from OpenHelix
= molecular function
= phenotype
= anatomy
etc...
The Process of Marking up a Sentence
From Müller H-M, Kenny EE,
Sternberg PW (2004)
Unified Medical Language System (UMLS)
Started in 1986 - National Library of Medicine
A set of files and software that brings together many health and
biomedical vocabularies and standards to enable
interoperability between computer systems (e.g. doctor,
pharmacy, billing, biomedical literature mining.
Biomedical terminologies:
Anatomy (FMA)
Drugs (RxNorm)
Medical devices (UMD)
Clinical terms (SNOMED CT)
Information sciences (MeSH)
Administrative terminologies (ICD-9-CM, CPT)
Data exchange terminologies (HL7, LOINC)
From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole
Unified Medical Language System - Integrating Terminologies
From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole
Unified Medical Language System - Integrating Terminologies
From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole
Unified Medical Language System (UMLS) - Overview
Text
Lexical
Look-up
Specialist
Lexicon
Syntactic
Analysis
MetaMap
Metathesaurus
UMLS
From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole
SemRep
Semantic
Network
Semantic
Proposition
Unified Medical Language System (UMLS) - Overview
• Pharmamacologic Substance TREATS
Text
Sign or Symptom
Albuterol (phsu) TREATS Dyspnea (sosy)
• Gene or Genome ASSOCIATED_WITH
Disease or Syndrome
BRCA1 gene (gngm) ASSOCIATED_WITH
Breast carcinoma (dsyn)
From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole
Novel Online Tools for
Mining the Biomedical
Literature
From Luz, 2011
http://www.ncbi.nlm.nih.gov/pubmed/21245076
Comparison of three different literature mining tool
Account
Presentation
of Results
Quertle
MEDLINE/PubMed; Full-text publications from
PubMed Central; NIH RePORTER database of
grants applications; NLM TOXLINE database:
biochemical, pharmacological, toxicological
effects of drugs/chemicals; News (FierceMarkets
Life Sciences and Health Care); Scientific
whitepapers and research posters submitted to
Quertle
Not required
Highlighted
concepts in
sentences
Semantic
MEDLINE
MEDLINE/PubMed
Required - use
of UMLS license
Network of
concepts
Academicrecognized
email required
Tag cloud
Tool
Coverage
NextBio
MEDLINE/PubMed; Full-text publications from
PubMed Central; Clinical trials from
ClinicalTrials.gov; Elsevier full text journal articles
(23 million - available to NextBio Enterprise
customers who subscribe to ScienceDirect);
News - sourced from publicly available biologyand health-related news publications
References
Campillos M*, Kuhn M*, Gavin AC, Jensen LJ, Bork P. Drug target identification using sideeffect similarity. Science. 2008 Jul 11;321(5886):263-6.
http://www.ncbi.nlm.nih.gov/pubmed/18621671
Islamaj Dogan R, Murray GC, Névéol A, Lu Z. (2009) Understanding PubMed user search
behavior. Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/20157491
Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture
phenotypic effects of drugs. Mol Syst Biol. 2010;6:343. Epub 2010 Jan 19.
http://sideeffects.embl.de/drugs/56338/
Luz C (2011) PubMed and beyond: a survey of web tools for searching biomedical
literature Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/21245076
http://www.ncbi.nlm.nih.gov/pubmed/21245076
Müller H-M, Kenny EE, Sternberg PW (2004) Textpresso: An Ontology-Based Information
Retrieval and Extraction System for Biological Literature. PLoS Biol 2(11): e309.
doi:10.1371/journal.pbio.0020309
http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0020309
Rindflesch, T.C. et al. (2011) Semantic MEDLINE: An advanced information management
application for biomedicine. Information Services & Use, 31, 15-21.
http://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdf
Jensen LJ, Saric J, and Bor P (2006) Literature mining for the biologist: from information
retrieval to biological discovery. Nature Reviews Genetics 7: 119-129. Retrieved from
http://www.nature.com/nrg/journal/v7/n2/pdf/nrg1768.pdf
Download