Information Extraction Group Health David Carrell, PhD Group Health Research Institute June 29, 2010 David’s background ... Group Health Research Institute (GHRI) ● Group Health (www.ghc.org) ● Founded 1947, Seattle, WA ● Integrated delivery system (“HMO”) ● ~600K patients in WA (some OR, ID) ● Comprehensive EMR & patient portal (2004+) ● GHRI (www.grouphealthresearch.org) ● Founded 1983 ● 300 staff (50 investigators) ● 2009: >250 active grants ($39M) Group Health Research Institute (GHRI) ● Applied research ● Epidemiology, health systems, clinical trials, economics ... ● Limited bio-informatics expertise ● Collaborative ● HMO-Research Network, Cancer-RN, ... MH-RN ● Federated data systems ● NLP vision ● NLP expertise through collaboration ● Bring NLP to the text—locally ... other network sites GHRI & Research Consortia HMO Research Network • Large data repositories • Common EMR platforms • Virtual Data Warehouse (VDW) GHRI & Virtual Data Warehouse (VDW) Enrollment Demographics MRN enr_start enr_end ins_medicare ins_medicaid ins_commercial ins_privatepay ins_other drugcov MRN birth_date gender race1-5 hispanic Tumor Pharmacy MRN MRN ndc rxdate rxsup rxamt rxmd dxdate staging vars… tumor vars… treatment vars… etc. NDC ndc GenericName BrandName Procedures MRN provider adate enctype px codetype performingprovider pxcnt origpx • Structured data (legacy + Epic/EMR) Encounters MRN provider adate enctype ddate encounter_subtype facility_code discharge_disposition discharge_status DRG admitting_source department Vital Signs Census MRN measure_date ht wt bmi days_diff diastolic systolic position MRN block blockgp county state tract zip education vars... income var... race vars... Provider Diagnoses Provider Specialty MRN provider adate enctype dx pdx diagprovider origdx • Minimum 1990+ • Integrated care delivery (some claims) • Diagnoses, procedures, pharmacy, tumor, vitals, census/geocode, etc. GHRI & Virtual Data Warehouse (VDW) HMO Research Network GHRI & NLP Adoption Enrollment Demographics MRN enr_start enr_end ins_medicare ins_medicaid ins_commercial ins_privatepay ins_other drugcov MRN birth_date gender race1-5 hispanic Tumor Pharmacy MRN MRN ndc rxdate rxsup rxamt rxmd dxdate staging vars… tumor vars… treatment vars… etc. NDC ndc GenericName BrandName Structured Information from Text Pathology Imaging MRN accession_number collection_date MRN image_number image_date coding_date thesaurus_version provider coding_date thesaurus_version Pathology Concepts accession_number concept_code concept_type negated Imaging Concepts Clinical Notes Concepts image_number MRN provider adate enctype concept_code concept_type negated concept_code concept_type negated Procedures MRN provider adate enctype px codetype performingprovider pxcnt origpx Encounters MRN provider adate enctype ddate encounter_subtype facility_code discharge_disposition discharge_status DRG admitting_source department Vital Signs Census MRN measure_date ht wt bmi days_diff diastolic systolic position MRN block blockgp county state tract zip education vars... income var... race vars... Provider Diagnoses MRN provider adate enctype dx pdx diagprovider origdx Provider Specialty GHRI & NLP Adoption HMO Research Network GHRI & NLP Adoption • • • • • • • • • caBIG TBPT adoption proposal, Jun 2006 caTIES for pathology & radiology text, ~2007 Chart note text, May 2007 GWAS (eMERGE) proposal, Aug 2007 GATE experimentation, Feb 2008 Strategic planning conference, Dec 2008 ARRA Challenge Grant, Apr 2009 UIMA/cTAKES adoption, Aug 2009 Proposals... e.g.,HMORN multi-site, Jan 2010 GHRI & NLP Adoption ● How to bring NLP capacity to clinical text? ● “Cookbooks” (SAS Java programmers) ● “Parachuted” hardware ● Parachuted virtual machine (?) ● Cloud-based processing ● Security issues ● Other? GHRI & NLP Adoption GHRI & NLP Adoption Challenges of Cloud-based Solutions: Unfamiliar technologies Responsibility sharing (e.g., security) Patient privacy Institutional risk De-identification Graduated adoption? SHARP -- Exploring deployment strategies SHARP Cloud Security Workshop Spring 2011 Educational focus Challenges of processing clinical text in a novel security space (virtual firewall?) Security best practices IRB engagement Graduated adoption strategies NLP Challenge Grant Natural Language Processing for Cancer Research Network Surveillance Studies • Aim 1: Deploy open-source NLP software Develop ETL connective tissue Build “human capital” (Java, NLP) • Aim 2: NLP algorithm boot camp: Recurrent breast cancer diagnoses >3000 existing gold standard cases (human reviewed) • Approach: Local deployment/programming support High-level NLP/bioinformatics expertise via external collaboration • Participants: GHRI (Carrell, Buist, Chubak), Mayo Clinic/Harvard (Savova), Pittsburgh (Chapman), Vanderbilt (Xu). NLP Challenge Grant – Aim 1 Document_Identifier Concept_Code Radiology_Report_000001 2877143 Radiology_Report_000001 8600231 Radiology_Report_000001 3134988 Radiology_Report_000001 5287109 NLP Challenge Grant – Aim 1 Document Type Available Documents Percent NLP Concept-Coded Chart Notes 20M 25% Radiology 4M 33% Pathology 1.2M 2% NLP Challenge Grant – Aim 2 NLP Challenge Grant – Aim 2 NLP Challenge Grant – Aim 2 Rec Br Ca? AE1 AE2 Progress Notes AE3 AE1 AE2 Oncology Notes AE1 AE2 Radiology Reports AE3 AE1 Pathology Reports eMERGE consortium • Vanderbilt, Mayo, Northwestern, Marshfield, Group Health • Can EMRs from multiple institutions provide comparable phenotype data for GWAS? • 14 phenotypes • Group Health • structured data • Adoption of NLP algorithms developed by others • “Low-tech” NLP • Text explorer, Assisted chart abstraction Clinical Text ExplorerSelect text source (chart notes, radiology, pathology, etc.) Date range Sample spec’s Search: recurrent NEAR breast NEAR cancer. N documents, N patients found Search terms highlighted Assisted Chart Abstraction Assisted Chart Abstraction Data Indexes Full-text A-Z Date ID A-Z SQL Server Chart notes • 550K pts • 17M notes • 0.8B lines Chart notes • 550K pts • 17M notes • 0.8B lines A-Z Etc . A-Z Data Warehouse NLP Concept Codes Cohort Lists Assisted Chart Abstraction GUI • Outside EMR • Pre-processed • Point-and-click • Text capture Assisted Chart Abstraction Note Date Note By Pt Demog Note Type Pt Visits Pt Dx/Px/Rx Identify Cohort Selection criteria applied to the patient Note Text Selection criteria applied to the notes Assign note priority Assisted Chart Abstractio n Data Traditional chart abstraction Assisted chart abstraction Assisted Chart Abstraction Stage Patients Initial cohort identification: 2903 (100%) Inclusion criteria (demog., dx, px, etc.): 671 (23%) Electronic text: Preprocessed text: Chart Notes 137,019 (100%) 70,119 (51%) 228 (8%) 122 (4%) 28,186 (21%) 284 (0.2 %) • Text: “CATARACT” • Note: Op/Ophthal exam • Near: Cataract procedure Potential SHARP synergy ... National Cancer Institute FOA: Tools for Electronic Data Extraction • Funding: NCI Contract for software development • Aim: Enhance/automate existing SEER cancer case identification (largely manual abstraction of EHR/paper charts) • Approach: Assess, propose, test, modify, develop, deploy technologies that leverage NLP to automate some aspects of SEER workflow • Participants: IMS, Inc., SEER sites (4), Group Health, Harvard SHARP – NLP research lab Questions – Discussion