Phillips_i2b2 - Buffalo Ontology Site

advertisement
Developing i2b2 Ontologies for
the Long Haul
Lori Phillips, MS
Partners HealthCare Systems, Inc
April 25, 2012
National Centers for Biomedical Computing
What is i2b2?
 Software
for explicitly organizing and transforming personoriented clinical data in a way that is optimized for research

A
Allows integration of clinical data, trials data, and genotypic data
portable and extensible application framework


Modular software architecture allows additions without disturbing core
parts
Available as open source at https://www.i2b2.org
Where is it used?
CTSA’s

Boston University

Case Western Reserve University (including Cleveland Clinic)

Children's National Medical Center (GWU), Washington D.C.

Duke University

Emory University (including Morehouse School of Medicine and Georgia Tech )

Harvard University (including Beth Israel Deaconness Medical Center, Brigham and
Women's Hospital, Children's Hospital Boston, Dana Farber Cancer Center, Joslin
Diabetes Center, Massachusetts General Hospital)

Medical University of South Carolina

Medical College of Wisconsin

Oregon Health & Science University

Penn State MIlton S. Hershey Medical Center

Tufts University

University of Alabama at Birmingham

University of Arkansas for Medical Sciences

University of California Davis

University of California, Irvine

University of California, Los Angeles*

University of California, San Diego*

University of California San Francisco

University of Chicago

University of Cincinnati (including Cinncinati Children's Hospital Medical Center)

University of Colorado Denver (including Children's Hospital Colorado)

University of Florida

University of Kansas Medical Center

University of Kentucky Research Foundation

University of Massachusetts Medical School, Worcester

University of Michigan

University of Pennsylvania (including Children's Hospital of Philadelphia)

University of Pittsburgh (including their Cancer Institute)

University of Rochester School of Medicine and Dentistry

University of Texas Health Sciences Center at Houston

University of Texas Health Sciences Center at San Antonio

University of Texas Medical Branch (Galveston)

University of Texas Southwestern Medical Center at Dallas

University of Utah

University of Washington

University of Wisconsin - Madison (including Marshfield Clinic)

Virginia Commonwealth University

Weill Cornell Medical College




































Academic Health Centers (does not include AHCs that are part of a CTSA):
Arizona State University
City of Hope, Los Angeles
Georgia Health Sciences University, Augusta
Hartford Hospital, CN
HealthShare Montana
Massachusetts Veterans Epidemiology Research and Information Center
(MAVERICK), Boston
Nemours
Phoenix Children's Hospital
Regenstrief Institute
Thomas Jefferson University
University of Connecticut Health Center
University of Missouri School of Medicine
University of Tennessee Health Sciences Center
Wake Forest University Baptist Medical Center
HMOs:
Group Health Cooperative
Kaiser Permanente
International:
Georges Pompidou Hospital, Paris, France
Hospital of the Free University of Brussels, Belgium
Inserm U936, Rennes, France
Institute for Data Technology and Informatics (IDI), NTNU, Norway
Institute for Molecular Medicine Finland (FIMM)
Karolinska Institute, Sweden
Landspitali University Hospital, Reykjavik, Iceland
Tokyo Medical and Dental University, Japan
University of Bordeau Segalen, France
University of Erlangen-Nuremberg, Germany
University of Goettingen, Goettingen, Germany
University of Leicester and Hospitals, England (Biomed. Res. Informatics Ctr. for
Clin. Sci)
University of Pavia, Pavia, Italy
University of Seoul, Seoul, Korea
Companies:
Johnson and Johnson (TransMART)
GE Healthcare Clinical Data Services
Why use i2b2?
 Cohort

discovery
Enables and simplifies research cohort discovery across an institution’s
large, heterogeneous clinical datasets
 Hypothesis

generation
Enables and simplifies analysis of data to support a hypothesis
 Retrospective

data analysis
Enables the retrospective analysis of data to support/refute claims.
i2b2 Workbench
Data Model
 FACTS

The quantitative or factual data being queried
 DIMENSIONS

Groups of hierarchies and descriptors that define the facts.
 STAR

SCHEMA
A single fact table surrounded by numerous dimension tables.
i2b2 Star Schema
visit_dimension
patient_dimension
PK
Patient_Num
Birth_Date
Death_Date
Vital_Status_CD
Age_Num*
Gender_CD*
Race_CD*
Ethnicity_CD*
1
∞
∞
Patient_Num
Encounter_Num
Concept_CD
Observer_CD
Start_Date
Modifier_CD
Instance_Num
End_Date
ValType_CD
TVal_Char
NVal_Num
ValueFlag_CD
Observation_Blob
Concept_Path
Concept_CD
Name_Char
PK
PK
PK
PK
PK
PK
PK
PK
∞
Encounter_Num
Start_Date
End_Date
Active_Status_CD
Location_CD*
∞
∞
∞ ∞ observer_dimension
PK
concept_dimension
PK
1
observation_fact
Observer_Path
Observer_CD
Name_Char
∞
modifier_dimension
PK
Modifier_Path
Modifier_CD
Name_Char
Observation (fact table) Primary Keys
Patient_num
Distinct number for every patient
Encounter_num
Distinct number for every visit
Concept_cd
Distinct code for every concept
Observer_cd
Distinct code for every observer
Start_date
Date-time observation began
Modifier_cd
Code to modify concept_cd
Instance_num Mechanism to group concept modifers
i2b2 Fact Table
 In
i2b2, an atomic fact is an observation on a patient.
 Examples





of facts
Diagnoses
Procedures
Lab data
Medications
Genetic data
i2b2 Dimension Tables
 Dimension
tables contain descriptive information about the
facts.
 Examples





Concept dimension describes the concepts stored in the concept_cd
field.
Provider dimension contains information about the observer_cd field
Patient dimension contains information about the patient_num field
Visit dimension contains information about the encounter_num field
Modifier dimension contains information about the modifier_cd field
How does i2b2 use Ontologies?
 By
and large, the concepts stored in the fact table come from
clinical coding systems or ontologies.
 Largely






dependent on data available to institution
Diagnoses
ICD9/ICD10/SNOMED
Procedures
CPT/ICD9
Medications
NDC/RXNORM
Lab results
LOINC
Molecular/genomic data
Custom or project specific data
 Ontologies
are used to organize query terms (and concepts)
hierarchically.
Metadata table
 Query
terms are stored in a separate metadata table.
 There
is a one-to-one mapping of terms in the metadata to
concepts in the dimension table.
 The
structure of the metadata table is integral to both the
visualization of the query terms (tree) and the query
mechanism itself.
Structure of Metadata Table
METADATA
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
i2b2 Metadata Root Level Categories
 Terms

with c_hlevel = 1
Display name is c_name
 Icon
(folder or container) is
determined by
c_visualattributes
 Example

c_fullname:
\Diagnoses\
Query terms are visualized hierarchically in tree
\Diagnoses\
1
Respiratory system\
Chronic obstructive diseases\
2
3
Emphysema\
4
Why are hierarchies so important for i2b2?
 Hierarchies
form the basis of both the visualization of the terms
and the query mechanism itself.
select * from metadata where c_fullname like
‘\Diagnoses\Respiratory system\Chronic obstructive
diseases\Emphysema\%’ and c_hlevel = 5
Structure of Metadata Table
METADATA
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
Hierarchies in queries
select patient_num from observation_fact where concept_cd IN (select
concept_cd from concept_dimension where concept_path LIKE
'\Diagnoses\Respiratory system\Chronic obstructive diseases\
Emphysema\%')
i2b2 Ontologies for the Long Haul
 How

do I create i2b2 metadata for a known ontology?
ICD-10
 What
happens to my legacy clinical data when I have to move
to ICD-10?

Merging ICD-9 with ICD-10
 How
 ….
do I handle genomic metadata?
Custom metadata?
NCBO BioPortal ICD-10
Building an ICD-10 Ontology with NCBO services
Pull data from NCBO via REST services.
 Reorganize information into i2b2 Metadata format

bioportal/concepts/46302/all
<data>
<pageNum>1</pageNum>
<numPages>1832</numPages>
<pageSize>50</pageSize>
<numResultsPage>50</numResultsPage>
<numResultsTotal>91590</numResultsTotal>
<contents
class="org.ncbo.stanford.bean.concept.
ClassBeanResultListBean">
<classBeanResultList>
<classBean>
<id>0-ICD10CM</id>
<fullId>http://purl.bioontology.org/
ontology/ICD10CM/0-ICD10CM</fullId>
<label>ICD-10-CM TABULAR LIST of
DISEASES and INJURIES</label>
<type>class</type>
<relations>
<entry>
<string>ChildCount</string>
<int>0</int>
</entry> ……
METADATA
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
Primary challenges
 i2b2

Metadata depends upon hierarchical information
c_fullname, c_tooltip maintain the hierarchy from root to leaves
Diseases of the respiratory system \
Chronic lower respiratory diseases \
Emphysema
Challenges..
 NCBO
REST service that enables pull of concepts includes
immediate parent/child info only

Hierarchy must be computed
<data>
<classBean>
<id>J43</id>
<label>Emphysema</label>
<relations>
<entry>
<string>SuperClass</string>
<list>
<classBean>
<id>J40-J47</id>
<label>Chronic lower respiratory diseases</label>
</classBean>
</list>
</entry>
</relations>
</classBean>
</data>
NCBO Extraction workflow
NCBO
REST
XML
Request to extract ontology
Extraction
Workflow
ICD-10
Process
Extracted
Data
i2b2
Metadata
Extracted ICD-10 terms
Released deliverables
https://community.i2b2.org/wiki/display/NCBO
What about my legacy ICD-9 data?

Ideally we would like an i2b2 ontology that integrates ICD-9 into ICD10.
Mapping Tool
 Tool
to verify/(re)assign ontology mappings.
Navigating the Mapping Tool Tree
Displays terms mapped
from one ontology within
hierarchy of another
 Mapped terms are
displayed adjacent to
terms they are mapped to
and appear in bold

Adding a new mapping
ICD9:269.3, Mineral
deficiency should
appear for ICD10:E63
Other nutritional
deficiencies
 Copy term ICD9:269.3

Adding a new mapping

Paste onto
ICD10:E63 Other
nutritional
deficiencies
Move a mapping

Ascorbic acid
deficiency (ICD9:267)
can be moved down
one level to Ascorbic
acid deficiency
(ICD10:E54)

Drag and drop down
the term one level.
Unmap a mapping

ICD9:416.8 Other
chronic pulmonary
heart diseases
appears in two places:
the one attached to
ICD10:I27.2 appears
incorrect and can be
unmapped.
The Unmapped Terms List
Free form list of terms to be
mapped
 Locate term you wish to map to
in the hierarchy tree. Drag from
table to term in the tree.
 If you make a mistake you can
either reassign the mapped
term within the tree or unmap it
from tree.
 Unmap will cause it to reappear
in the unmapped terms list if the
term has no other mappings.

Assigning an unmapped term
Drag from
unmapped
terms list
 Drop onto
term we are
mapping to

Unmapping a term
Drag term
from tree
 Drop onto
unmapped
terms list

Search Unmapped Terms By Name
Search Unmapped Terms by Code
Mapped Terms Viewer
Search Mapped Terms By Code
Search Mapped Terms By Name
Merging Ontologies

Mapping tool provides a
visualization of what the
merged ontologies would
look like

What if we could extract
a single metadata table
from this?
Integration tool
Request to integrate
Mapper
Cell
Integration
Workflow
ICD9 into ICD-10
For each mapped ICD-9
terms, compute ICD-10
hierarchy
ICD-10 merged
with ICD9 terms
Mapped ICD-9 terms
How to handle genomic data
 Ability

Needs may differ between geneticist, physician, research scientist
 Ability



to organize the variants for ease of navigation
to query for the variant in the workbench
Genomic labs may report data differently
Define the variant so it may be reliably identified over time
Implication is that the identifier for the variant does not change over time
or is maintainable.
How to (reliably) identify a genomic variant?
HGVS
Name ?
RS
#?
Chr location,
Nucleotide subst ?
Gene name +
flanking
sequences ?
All of
them??
RS number
 Uniquely
 Novel

identifies a variant over time ….but….
variants may not have rs number
User may not want to submit to dbSNP
Gene name + flanking sequences
 Not

guaranteed if gene has several isoforms
EGFR
HGVS Name
 Uniquely
identifies variant within a referenced and versioned
accession and details the nucleotide substitution.
NM_005228.3:c.2155G>T
RefSeq accession
Position
Coding DNA
Nucleotide
substitution
Is there a common denominator in all of this?
 Yes
… all ultimately describe variant location on a
chromosome.
 Nucleotide substitution defines the physical manifestation of
the variant.
WE PROPOSE:


HGVS name (n/t subst, positional info)
Flanking sequences (a way to verify positional info)
AS A WAY TO UNEQUIVOCALLY EQUATE TWO VARIANTS


ACROSS DOMAINS
ACROSS VERSIONS
Structure of Metadata Table
M ET A D A T A
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
Genomic MetadataXML record
GenomicMetadata
Version 1.0
ReferenceGenomeVersion hg18
SequenceVariant
HGVSName NM_0005228.3:c.2155G>T
SystematicName c.2155G>T
SystematicNameProtein p.Glu719Cys
AaChange missense
DnaChange substitution
SequenceVariantLocation
GeneName EGFR
FlankingSeq_5 GAATTCAAAAAGATCAAAGTGCTG
FlankingSeq_3 GCTCCGGTGCGTTCGGCACGGTGT
RegionType exon
RegionName Exon 18
Accessions
Accession
Name NM_005228
Type mrna (NCBI)
Accession
Name NP_005219
Type protein (NCBI)
Accession
Name NT_004487
Type contig (NCBI)
ChromosomeLocation
Chromosome chr7
Region 7p12
Orientation +
Organizational challenges
 By
Disease?
 By
Gene?
Combining equivalent terms
How to handle custom (local) metadata
 Edit
Tool ideal for creating small, non-standard ontology for a
local project.

Consider the case for classifying patients as smokers, nonsmokers or smoking status unknown
 The
Custom Metadata folder is designed for use with the
creation of local terms.
Create a “Smoking status” folder
Populate folder with “Smoker”, “Non-smoker”, etc
Smoking status custom metadata
www.i2b2.org
https://community.i2b2.org/wiki
http://bioportal.bioontology.org
Download