Alternative Tools for Mining the Biomedical Literature Rolando Garcia-Milian Rolando.milian@ufl.edu Biomedical & Health Information Services Department Health Sciences Center Library February 14, 2014 In this session Introduction Novel online tools for mining the literature Unified Medical Language System Quertle NextBio Semantic MEDLINE Problem – Rapid Growth of Biomedical data Samples Submitted to Gene Expression Omnibus Database 3.50 3.00 Millions 2.50 2.00 1.50 1.00 0.50 0.00 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 GenBank Statistics http://www.ncbi.nlm.nih.gov/genbank/genbankstats-2008/ Compiled from GEO historic data http://www.ncbi.nlm.nih.gov/geo/summary/?type=history Problem – Growth of the Biomedical Literature Number of Records in PubMed 25.00 Biomedical Literature Millions 20.00 • Huge volume (PubMed 23132342 citations) 15.00 10.00 • High diversity 5.00 0.00 1940 • High quality (peer review) 1950 1960 1970 1980 1990 2000 2010 2020 Compiled by from PubMed http://www.ncbi.nlm.nih.gov/pubmed • Users overwhelmed by long list of search results • 1/3 of Pubmed queries resulted in 100 or more citations (Islamaj, 2009) Problem – Querying the Biomedical Literature Querying the biomedical literature becomes more difficult Boolean operators Filters Medical Subject Headings Alternative Tools for Mining the Biomedical Literature Alternative tools for mining the biomedical literature combine: Statistical methods, Ontologies, Natural Language Processing tools, Visualization tools Reduced time for discovering meaningful results. Information Retrieval and Information Extraction Information Retrieval retrieves documents/ records EGFR records Information Extraction extracts facts T14D inhibited EGF receptor internalization records EGFR regulates tumor cell proliferation EGFR is expressed in SCCHN Modified from OpenHelix Text Processing paper Sentence 1 Word Word Word Word Sentence 2 Word Word Word Word Sentence 3 Word Word Word Word Sentence 4 Word Word Word Word Sentence 5 Word Word Word Word Sentence 6 Word Word Word Word = association Query = phenotype ( ) + anatomy ( ): ontology category tags Extract = Sentence 1 Sentence 4 Modified from OpenHelix = molecular function = phenotype = anatomy etc... The Process of Marking up a Sentence From Müller H-M, Kenny EE, Sternberg PW (2004) Unified Medical Language System (UMLS) Started in 1986 - National Library of Medicine A set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems (e.g. doctor, pharmacy, billing, biomedical literature mining. Biomedical terminologies: Anatomy (FMA) Drugs (RxNorm) Medical devices (UMD) Clinical terms (SNOMED CT) Information sciences (MeSH) Administrative terminologies (ICD-9-CM, CPT) Data exchange terminologies (HL7, LOINC) From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole Unified Medical Language System - Integrating Terminologies From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole Unified Medical Language System - Integrating Terminologies From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole Unified Medical Language System (UMLS) - Overview Text Lexical Look-up Specialist Lexicon Syntactic Analysis MetaMap Metathesaurus UMLS From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole SemRep Semantic Network Semantic Proposition Unified Medical Language System (UMLS) - Overview • Pharmamacologic Substance TREATS Text Sign or Symptom Albuterol (phsu) TREATS Dyspnea (sosy) • Gene or Genome ASSOCIATED_WITH Disease or Syndrome BRCA1 gene (gngm) ASSOCIATED_WITH Breast carcinoma (dsyn) From Fitzman, 2011 Presentation at Biomedical Informatics course, MBL Woods Hole Novel Online Tools for Mining the Biomedical Literature From Luz, 2011 http://www.ncbi.nlm.nih.gov/pubmed/21245076 Comparison of three different literature mining tool Account Presentation of Results Quertle MEDLINE/PubMed; Full-text publications from PubMed Central; NIH RePORTER database of grants applications; NLM TOXLINE database: biochemical, pharmacological, toxicological effects of drugs/chemicals; News (FierceMarkets Life Sciences and Health Care); Scientific whitepapers and research posters submitted to Quertle Not required Highlighted concepts in sentences Semantic MEDLINE MEDLINE/PubMed Required - use of UMLS license Network of concepts Academicrecognized email required Tag cloud Tool Coverage NextBio MEDLINE/PubMed; Full-text publications from PubMed Central; Clinical trials from ClinicalTrials.gov; Elsevier full text journal articles (23 million - available to NextBio Enterprise customers who subscribe to ScienceDirect); News - sourced from publicly available biologyand health-related news publications References Campillos M*, Kuhn M*, Gavin AC, Jensen LJ, Bork P. Drug target identification using sideeffect similarity. Science. 2008 Jul 11;321(5886):263-6. http://www.ncbi.nlm.nih.gov/pubmed/18621671 Islamaj Dogan R, Murray GC, Névéol A, Lu Z. (2009) Understanding PubMed user search behavior. Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/20157491 Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6:343. Epub 2010 Jan 19. http://sideeffects.embl.de/drugs/56338/ Luz C (2011) PubMed and beyond: a survey of web tools for searching biomedical literature Database (Oxford) http://www.ncbi.nlm.nih.gov/pubmed/21245076 http://www.ncbi.nlm.nih.gov/pubmed/21245076 Müller H-M, Kenny EE, Sternberg PW (2004) Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biol 2(11): e309. doi:10.1371/journal.pbio.0020309 http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0020309 Rindflesch, T.C. et al. (2011) Semantic MEDLINE: An advanced information management application for biomedicine. Information Services & Use, 31, 15-21. http://lhncbc.nlm.nih.gov/system/files/pub-lhncbc-2011-109.pdf Jensen LJ, Saric J, and Bor P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 7: 119-129. Retrieved from http://www.nature.com/nrg/journal/v7/n2/pdf/nrg1768.pdf