Information Extraction Group Health

advertisement
Information Extraction
Group Health
David Carrell, PhD
Group Health Research Institute
June 29, 2010
David’s background ...
Group Health Research Institute (GHRI)
● Group Health (www.ghc.org)
● Founded 1947, Seattle, WA
● Integrated delivery system (“HMO”)
● ~600K patients in WA (some OR, ID)
● Comprehensive EMR & patient portal (2004+)
● GHRI (www.grouphealthresearch.org)
● Founded 1983
● 300 staff (50 investigators)
● 2009: >250 active grants ($39M)
Group Health Research Institute (GHRI)
● Applied research
● Epidemiology, health systems, clinical trials, economics ...
● Limited bio-informatics expertise
● Collaborative
● HMO-Research Network, Cancer-RN, ... MH-RN
● Federated data systems
● NLP vision
● NLP expertise through collaboration
● Bring NLP to the text—locally ... other network sites
GHRI & Research Consortia
HMO Research Network
• Large data repositories • Common EMR platforms
• Virtual Data Warehouse (VDW)
GHRI & Virtual Data Warehouse (VDW)
Enrollment
Demographics
MRN
enr_start
enr_end
ins_medicare
ins_medicaid
ins_commercial
ins_privatepay
ins_other
drugcov
MRN
birth_date
gender
race1-5
hispanic
Tumor
Pharmacy
MRN
MRN
ndc
rxdate
rxsup
rxamt
rxmd
dxdate
staging vars…
tumor vars…
treatment vars…
etc.
NDC
ndc
GenericName
BrandName
Procedures
MRN
provider
adate
enctype
px
codetype
performingprovider
pxcnt
origpx
• Structured data (legacy + Epic/EMR)
Encounters
MRN
provider
adate
enctype
ddate
encounter_subtype
facility_code
discharge_disposition
discharge_status
DRG
admitting_source
department
Vital Signs
Census
MRN
measure_date
ht
wt
bmi
days_diff
diastolic
systolic
position
MRN
block
blockgp
county
state
tract
zip
education vars...
income var...
race vars...
Provider
Diagnoses
Provider
Specialty
MRN
provider
adate
enctype
dx
pdx
diagprovider
origdx
• Minimum 1990+
• Integrated care delivery (some claims)
• Diagnoses, procedures, pharmacy, tumor, vitals, census/geocode, etc.
GHRI & Virtual Data Warehouse (VDW)
HMO Research Network
GHRI & NLP Adoption
Enrollment
Demographics
MRN
enr_start
enr_end
ins_medicare
ins_medicaid
ins_commercial
ins_privatepay
ins_other
drugcov
MRN
birth_date
gender
race1-5
hispanic
Tumor
Pharmacy
MRN
MRN
ndc
rxdate
rxsup
rxamt
rxmd
dxdate
staging vars…
tumor vars…
treatment vars…
etc.
NDC
ndc
GenericName
BrandName
Structured Information from Text
Pathology
Imaging
MRN
accession_number
collection_date
MRN
image_number
image_date
coding_date
thesaurus_version
provider
coding_date
thesaurus_version
Pathology
Concepts
accession_number
concept_code
concept_type
negated
Imaging
Concepts
Clinical Notes
Concepts
image_number
MRN
provider
adate
enctype
concept_code
concept_type
negated
concept_code
concept_type
negated
Procedures
MRN
provider
adate
enctype
px
codetype
performingprovider
pxcnt
origpx
Encounters
MRN
provider
adate
enctype
ddate
encounter_subtype
facility_code
discharge_disposition
discharge_status
DRG
admitting_source
department
Vital Signs
Census
MRN
measure_date
ht
wt
bmi
days_diff
diastolic
systolic
position
MRN
block
blockgp
county
state
tract
zip
education vars...
income var...
race vars...
Provider
Diagnoses
MRN
provider
adate
enctype
dx
pdx
diagprovider
origdx
Provider
Specialty
GHRI & NLP Adoption
HMO Research Network
GHRI & NLP Adoption
•
•
•
•
•
•
•
•
•
caBIG TBPT adoption proposal, Jun 2006
caTIES for pathology & radiology text, ~2007
Chart note text, May 2007
GWAS (eMERGE) proposal, Aug 2007
GATE experimentation, Feb 2008
Strategic planning conference, Dec 2008
ARRA Challenge Grant, Apr 2009
UIMA/cTAKES adoption, Aug 2009
Proposals... e.g.,HMORN multi-site, Jan 2010
GHRI & NLP Adoption
● How to bring NLP capacity to clinical text?
● “Cookbooks” (SAS  Java programmers)
● “Parachuted” hardware
● Parachuted virtual machine (?)
● Cloud-based processing
● Security issues
● Other?
GHRI & NLP Adoption
GHRI & NLP Adoption
Challenges of Cloud-based Solutions:

Unfamiliar technologies

Responsibility sharing (e.g., security)

Patient privacy

Institutional risk

De-identification

Graduated adoption?
SHARP -- Exploring deployment strategies
SHARP Cloud Security Workshop

Spring 2011

Educational focus

Challenges of processing clinical text in a novel
security space (virtual firewall?)

Security best practices

IRB engagement

Graduated adoption strategies
NLP Challenge Grant
Natural Language Processing for Cancer Research
Network Surveillance Studies
• Aim 1:
Deploy open-source NLP software
Develop ETL connective tissue
Build “human capital” (Java, NLP)
• Aim 2:
NLP algorithm boot camp: Recurrent breast cancer diagnoses
>3000 existing gold standard cases (human reviewed)
• Approach:
Local deployment/programming support
High-level NLP/bioinformatics expertise via external collaboration
• Participants:
GHRI (Carrell, Buist, Chubak), Mayo Clinic/Harvard (Savova),
Pittsburgh (Chapman), Vanderbilt (Xu).
NLP Challenge Grant – Aim 1
Document_Identifier
Concept_Code
Radiology_Report_000001
2877143
Radiology_Report_000001
8600231
Radiology_Report_000001
3134988
Radiology_Report_000001
5287109
NLP Challenge Grant – Aim 1
Document
Type
Available
Documents
Percent NLP
Concept-Coded
Chart Notes
20M
25%
Radiology
4M
33%
Pathology
1.2M
2%
NLP Challenge Grant – Aim 2
NLP Challenge Grant – Aim 2
NLP Challenge Grant – Aim 2
Rec Br Ca?
AE1
AE2
Progress
Notes
AE3
AE1
AE2
Oncology
Notes
AE1
AE2
Radiology
Reports
AE3
AE1
Pathology
Reports
eMERGE consortium
• Vanderbilt, Mayo, Northwestern, Marshfield, Group
Health
• Can EMRs from multiple institutions provide
comparable phenotype data for GWAS?
• 14 phenotypes
• Group Health
• structured data
• Adoption of NLP algorithms developed by others
• “Low-tech” NLP
• Text explorer, Assisted chart abstraction
Clinical Text ExplorerSelect text source (chart notes,
radiology, pathology, etc.)
Date range
Sample
spec’s
Search: recurrent NEAR
breast NEAR cancer.
N documents,
N patients found
Search terms highlighted
Assisted Chart Abstraction
Assisted Chart Abstraction
Data
Indexes
Full-text
A-Z
Date
ID
A-Z
SQL
Server
Chart notes
• 550K pts
• 17M notes
• 0.8B lines
Chart notes
• 550K pts
• 17M notes
• 0.8B lines
A-Z
Etc
.
A-Z
Data
Warehouse
NLP
Concept
Codes
Cohort
Lists
Assisted
Chart
Abstraction
GUI
• Outside EMR • Pre-processed • Point-and-click • Text capture
Assisted Chart Abstraction
Note Date
Note By
Pt Demog
Note Type
Pt Visits
Pt Dx/Px/Rx
Identify
Cohort
Selection criteria
applied to the patient
Note Text
Selection criteria
applied to the notes
Assign
note
priority
Assisted
Chart
Abstractio
n
Data
Traditional chart abstraction
Assisted chart abstraction
Assisted Chart Abstraction
Stage
Patients
Initial cohort
identification:
2903
(100%)
Inclusion
criteria
(demog., dx,
px, etc.):
671
(23%)
Electronic
text:
Preprocessed
text:
Chart Notes
137,019 (100%)
70,119 (51%)
228
(8%)
122
(4%)
28,186 (21%)
284 (0.2 %)
• Text: “CATARACT”
• Note: Op/Ophthal exam
• Near: Cataract procedure
Potential SHARP synergy ...
National Cancer Institute FOA:
Tools for Electronic Data Extraction
• Funding:
NCI Contract for software development
• Aim:
Enhance/automate existing SEER cancer case identification
(largely manual abstraction of EHR/paper charts)
• Approach:
Assess, propose, test, modify, develop, deploy technologies that
leverage NLP to automate some aspects of SEER workflow
• Participants:
IMS, Inc., SEER sites (4), Group Health, Harvard
SHARP – NLP research lab
Questions – Discussion
Download