BioMedical Data Everywhere: Recent Developments in Data Management and Policy at NIH Jerry Sheehan Assistant Director for Policy Development National Library of Medicine - National Institutes of Health sheehanjr@nlm.nih.gov CASC Fall Meeting September 8, 2011, Arlington, VA National Library of Medicine: More than a Library • World’s largest medical library – – – • Intramural research laboratories – – • www.nlm.nih.gov Lister Hill Nat’l Center for Biomedical Comms. National Center for Biotechnology Information Extramural research and training – – • • >12 million physical artifacts (books, journals, technical reports, photographs) >22,000 print and electronic serial subscriptions Historical collection of rare and old medical works ~ 100 research projects per year, $36M 18 funded research training sites, 250 trainees Health data standards and vocabularies Information resources and services – – – – – Publications and metadata Genomic, chemical, clinical trial data Environmental health and toxicology data Disaster information services & systems Medical images, analytical tools 2 NLM Information Resources • Publications – Citations/metadata (PubMed) – Full-text articles (PubMed Central) • Data – Genomic (GenBank, dbGaP, GEO, GeneTest) – Clinical trials (ClinicalTrials.gov) – Drug (RxNorm, Daily Med, Pillbox) – Chemical (PubChem) – Environmental & toxicology • Images – Visible Human – Spine x-rays, cervical images – Historical photos • Synthesized information – Evidence summaries – Guidelines – Consumer health information (MedlinePlus) • Vocabulary resources – – – – Unified Medical Language System Standard clinical terms (SNOMED) Health data interchange Biomedical terms • Software & Tools – – – – APIs Natural language processing Image analysis Mobile apps 3 4 PubMed/Medline: Journal Citations http://www.pubmed.gov CONTENT • 21+ million citations and abstracts – 700,000 added per year – 50%+ link to full text • 5500+ journals – 120-130 added per year USAGE (2010) • 120+ million visitors • 2 million searches per day • 2.4 billion page views • Google, Bing, others • Content used by outside developers • Mobile version Growth in Medline, the fully indexed subset of PubMed which accounts for approximately 90% of all PubMed citations. Original graph: http://www.nlm.nih.gov/bsd/stats/cit_added.html QUALITY 5 PubMed Central: Full-Text Articles www.pubmedcentral.gov + 2.2 million full-text articles, 26 thousand more added per month Typical weekday usage: • 420,000 different users • 740,000 articles retrieved Annually • ~ 99% of articles downloaded at least once • 28% downloaded more than 100 times 6 ClincalTrials.gov http://clinicaltrials.gov/ Registry and Results Database • Federally and privately supported trials • Conducted in the United States and 170+ countries • Mandatory submission for some trials Current content • 100,000+ registered trials • 330 new registrations/week • 3,000+ results (summary) of approved products o Outcome measures o Statistical analyses o Adverse events Studies Registered at ClinicalTrials.gov since May 1, 2005 120,000 100,000 80,000 60,000 40,000 20,000 0 Usage (2010) • 28,000 visitors per day 7 08-SEP-2011 CASC Fall Meeting 8 Repository for NIH-funded GWA studies As of Aug 2011: • 161 studies • 2045 data sets • 2727 documents • 5890 Analyses • 128190 Variables 9 • Database of biological activities of small molecules • Repository for data from NIH Molecular Libraries program As of August, 2011: • 85 million deposited substance records o Representing more than 30 million chemically unique compounds • 500 thousand bioassay records o Representing more than 130 million experimental bioactivity results 10 08-SEP-2011 CASC Fall Meeting 11 ToxMap: Environmental Health Maps 12 Almost 900 In English & Spanish > 170 tutorials > 75 anatomy videos > 125 surgery videos ~ 40,000 links ~1,000 drugs 100 supplements >1,200 links to ClinicalTrials.gov 15-20 stories added daily Since 2006 English & bilingual issues >40 languages >250 topics >3,300 links Over 100 directories of doctors, hospitals, clinics & libraries ~ 3,500 articles > 2,000 images 13 MedlinePlus: Trusted Health Information www.medlineplus.gov 2.3M 128K 179K 1.5M 906K 208K 1.2M 174K 5.4M 1.8M 109K 3.2M 462K 403K 2.4M 436K 3.5M 507K 25.8M 298K ME 270K NH 240K VT 2.2M MA 307K RI 834K CT 4.1M NJ 117K DE 1.7M MD 120K 1.5M 1.6M 656K 1M 210K 10M 651K 1.9M 1.3M 306K 1.4M 623K 296K 711K 343K 725K 3.1M 322K 6.1M 765K 4.2M Map of 100+ Million visits in the United States in 2010 MEDLINEPLUS USAGE 150 million visitors in 2010 420,000 visitors per day. MEDLINEPLUS MOBILE Streamlines content specifically tailored for users particular type of cell phone or tablet. MEDLINEPLUS CONNECT Links from diagnosis, drug, and laboratory information in EHR/PHR to relevant material in MedlinePlus, 14 Genetic test means an analysis of human DNA, RNA, chromosomes, proteins, or metabolites, if the analysis detects genotypes, mutations, or chromosomal changes. Genetic test does not include an analysis of proteins or metabolites that is directly related to a manifested disease, disorder, or pathological condition. 08-SEP-2011 15 08-SEP-2011 CASC Fall Meeting 16 NLM is Not Alone: Growing interest in data at NIH “[High throughput technologies] provide us with the opportunity to ask questions that have the word ‘ALL’ in them. What are ALL the transcripts in a cell? What are ALL the protein interactions? . . Those kinds of questions are now approachable, especially if we do the right job of making really powerful databases publicly accessible to all those who need them and empower investigators in small labs as well as big labs to plunge into that kind of mindset.” - Francis S. Collins, MD, PhD [Director, NIH] 17 http://report.nih.gov/biennialreport/ http://report.nih.gov/UploadDocs/Biomed_Info_Resources_FY08_09.pdf 08-SEP-2011 18 http://report.nih.gov/UploadDocs/Biomed_Info_Resources_FY08_09.pdf 08-SEP-2011 19 Select NIH Data Initiatives • NDAR – National Database for Autism Research (NIMH) – Repository for NIH-funded autism studies and centers of excellence – Genomic, phenotypic, imaging data and associated information • ADNI – Alzheimer’s Disease Neuroimaging Initiative (NIA) – Multisite study, public-private partership, validated biomarkers – Centralized FMRI and PET data, linked clinical database • NIDDK Data Repository – Archival datasets from NIDDK-funded studies (diabetes, digestive, kidney) – 29 datasets to-date; more than 100 access requests in 2009-10 • BTRIS – Biomedical Translational Research Information System (CC) – Repository for data from NIH intramural clinical studies – Allow aggregation and analysis across multiple Institute studies 20 Data Sharing Policies NIH Public Access Policy (journal articles) NIH GWAS Policy dbGaP NIH Sequence Data Sharing Policy GenBank GEO Clinical Trials Info Clinical Trials.gov IC or domain-specific policies • Autism Research – National Database for Autism Research • NIAAA Genetics of Alzheimer’s • Alzheimer’s Disease Neuroimaging Initiative (LONI Repository) • Others. . . NIH Data Sharing Policy (data sharing plan) 21 Recent Guidance for NIH Data Sharing Plans http://grants.nih.gov/grants/sharing_key_elements_data_sharing_plan.pdf 22 NLM 175th Anniversary 08-SEP-2011 23