Developing i2b2 Ontologies for the Long Haul Lori Phillips, MS Partners HealthCare Systems, Inc April 25, 2012 National Centers for Biomedical Computing What is i2b2? Software for explicitly organizing and transforming personoriented clinical data in a way that is optimized for research A Allows integration of clinical data, trials data, and genotypic data portable and extensible application framework Modular software architecture allows additions without disturbing core parts Available as open source at https://www.i2b2.org Where is it used? CTSA’s Boston University Case Western Reserve University (including Cleveland Clinic) Children's National Medical Center (GWU), Washington D.C. Duke University Emory University (including Morehouse School of Medicine and Georgia Tech ) Harvard University (including Beth Israel Deaconness Medical Center, Brigham and Women's Hospital, Children's Hospital Boston, Dana Farber Cancer Center, Joslin Diabetes Center, Massachusetts General Hospital) Medical University of South Carolina Medical College of Wisconsin Oregon Health & Science University Penn State MIlton S. Hershey Medical Center Tufts University University of Alabama at Birmingham University of Arkansas for Medical Sciences University of California Davis University of California, Irvine University of California, Los Angeles* University of California, San Diego* University of California San Francisco University of Chicago University of Cincinnati (including Cinncinati Children's Hospital Medical Center) University of Colorado Denver (including Children's Hospital Colorado) University of Florida University of Kansas Medical Center University of Kentucky Research Foundation University of Massachusetts Medical School, Worcester University of Michigan University of Pennsylvania (including Children's Hospital of Philadelphia) University of Pittsburgh (including their Cancer Institute) University of Rochester School of Medicine and Dentistry University of Texas Health Sciences Center at Houston University of Texas Health Sciences Center at San Antonio University of Texas Medical Branch (Galveston) University of Texas Southwestern Medical Center at Dallas University of Utah University of Washington University of Wisconsin - Madison (including Marshfield Clinic) Virginia Commonwealth University Weill Cornell Medical College Academic Health Centers (does not include AHCs that are part of a CTSA): Arizona State University City of Hope, Los Angeles Georgia Health Sciences University, Augusta Hartford Hospital, CN HealthShare Montana Massachusetts Veterans Epidemiology Research and Information Center (MAVERICK), Boston Nemours Phoenix Children's Hospital Regenstrief Institute Thomas Jefferson University University of Connecticut Health Center University of Missouri School of Medicine University of Tennessee Health Sciences Center Wake Forest University Baptist Medical Center HMOs: Group Health Cooperative Kaiser Permanente International: Georges Pompidou Hospital, Paris, France Hospital of the Free University of Brussels, Belgium Inserm U936, Rennes, France Institute for Data Technology and Informatics (IDI), NTNU, Norway Institute for Molecular Medicine Finland (FIMM) Karolinska Institute, Sweden Landspitali University Hospital, Reykjavik, Iceland Tokyo Medical and Dental University, Japan University of Bordeau Segalen, France University of Erlangen-Nuremberg, Germany University of Goettingen, Goettingen, Germany University of Leicester and Hospitals, England (Biomed. Res. Informatics Ctr. for Clin. Sci) University of Pavia, Pavia, Italy University of Seoul, Seoul, Korea Companies: Johnson and Johnson (TransMART) GE Healthcare Clinical Data Services Why use i2b2? Cohort discovery Enables and simplifies research cohort discovery across an institution’s large, heterogeneous clinical datasets Hypothesis generation Enables and simplifies analysis of data to support a hypothesis Retrospective data analysis Enables the retrospective analysis of data to support/refute claims. i2b2 Workbench Data Model FACTS The quantitative or factual data being queried DIMENSIONS Groups of hierarchies and descriptors that define the facts. STAR SCHEMA A single fact table surrounded by numerous dimension tables. i2b2 Star Schema visit_dimension patient_dimension PK Patient_Num Birth_Date Death_Date Vital_Status_CD Age_Num* Gender_CD* Race_CD* Ethnicity_CD* 1 ∞ ∞ Patient_Num Encounter_Num Concept_CD Observer_CD Start_Date Modifier_CD Instance_Num End_Date ValType_CD TVal_Char NVal_Num ValueFlag_CD Observation_Blob Concept_Path Concept_CD Name_Char PK PK PK PK PK PK PK PK ∞ Encounter_Num Start_Date End_Date Active_Status_CD Location_CD* ∞ ∞ ∞ ∞ observer_dimension PK concept_dimension PK 1 observation_fact Observer_Path Observer_CD Name_Char ∞ modifier_dimension PK Modifier_Path Modifier_CD Name_Char Observation (fact table) Primary Keys Patient_num Distinct number for every patient Encounter_num Distinct number for every visit Concept_cd Distinct code for every concept Observer_cd Distinct code for every observer Start_date Date-time observation began Modifier_cd Code to modify concept_cd Instance_num Mechanism to group concept modifers i2b2 Fact Table In i2b2, an atomic fact is an observation on a patient. Examples of facts Diagnoses Procedures Lab data Medications Genetic data i2b2 Dimension Tables Dimension tables contain descriptive information about the facts. Examples Concept dimension describes the concepts stored in the concept_cd field. Provider dimension contains information about the observer_cd field Patient dimension contains information about the patient_num field Visit dimension contains information about the encounter_num field Modifier dimension contains information about the modifier_cd field How does i2b2 use Ontologies? By and large, the concepts stored in the fact table come from clinical coding systems or ontologies. Largely dependent on data available to institution Diagnoses ICD9/ICD10/SNOMED Procedures CPT/ICD9 Medications NDC/RXNORM Lab results LOINC Molecular/genomic data Custom or project specific data Ontologies are used to organize query terms (and concepts) hierarchically. Metadata table Query terms are stored in a separate metadata table. There is a one-to-one mapping of terms in the metadata to concepts in the dimension table. The structure of the metadata table is integral to both the visualization of the query terms (tree) and the query mechanism itself. Structure of Metadata Table METADATA C_HLEVEL C_FULLNAME C_NAME C_SYNONYM_CD C_VISUALATTRIBUTES C_TOTALNUM C_BASECODE C_METADATAXML C_FACTTABLECOLUMN C_TABLENAME C_COLUMNNAME C_COLUMNDATATYPE C_OPERATOR C_DIMCODE C_COMMENT C_TOOLTIP UPDATE_DATE DOWNLOAD_DATE IMPORT_DATE SOURCESYSTEM_CD VALUETYPE_CD INT NULL VARCHAR(900) NULL VARCHAR(2000) NULL CHAR(1) NULL CHAR(3) NULL INT NULL VARCHAR(450) NULL TEXT NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(10) NULL VARCHAR(900) NULL TEXT NULL VARCHAR(900) NULL DATETIME NULL DATETIME NULL DATETIME NULL VARCHAR(50) NULL VARCHAR(50) NULL i2b2 Metadata Root Level Categories Terms with c_hlevel = 1 Display name is c_name Icon (folder or container) is determined by c_visualattributes Example c_fullname: \Diagnoses\ Query terms are visualized hierarchically in tree \Diagnoses\ 1 Respiratory system\ Chronic obstructive diseases\ 2 3 Emphysema\ 4 Why are hierarchies so important for i2b2? Hierarchies form the basis of both the visualization of the terms and the query mechanism itself. select * from metadata where c_fullname like ‘\Diagnoses\Respiratory system\Chronic obstructive diseases\Emphysema\%’ and c_hlevel = 5 Structure of Metadata Table METADATA C_HLEVEL C_FULLNAME C_NAME C_SYNONYM_CD C_VISUALATTRIBUTES C_TOTALNUM C_BASECODE C_METADATAXML C_FACTTABLECOLUMN C_TABLENAME C_COLUMNNAME C_COLUMNDATATYPE C_OPERATOR C_DIMCODE C_COMMENT C_TOOLTIP UPDATE_DATE DOWNLOAD_DATE IMPORT_DATE SOURCESYSTEM_CD VALUETYPE_CD INT NULL VARCHAR(900) NULL VARCHAR(2000) NULL CHAR(1) NULL CHAR(3) NULL INT NULL VARCHAR(450) NULL TEXT NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(10) NULL VARCHAR(900) NULL TEXT NULL VARCHAR(900) NULL DATETIME NULL DATETIME NULL DATETIME NULL VARCHAR(50) NULL VARCHAR(50) NULL Hierarchies in queries select patient_num from observation_fact where concept_cd IN (select concept_cd from concept_dimension where concept_path LIKE '\Diagnoses\Respiratory system\Chronic obstructive diseases\ Emphysema\%') i2b2 Ontologies for the Long Haul How do I create i2b2 metadata for a known ontology? ICD-10 What happens to my legacy clinical data when I have to move to ICD-10? Merging ICD-9 with ICD-10 How …. do I handle genomic metadata? Custom metadata? NCBO BioPortal ICD-10 Building an ICD-10 Ontology with NCBO services Pull data from NCBO via REST services. Reorganize information into i2b2 Metadata format bioportal/concepts/46302/all <data> <pageNum>1</pageNum> <numPages>1832</numPages> <pageSize>50</pageSize> <numResultsPage>50</numResultsPage> <numResultsTotal>91590</numResultsTotal> <contents class="org.ncbo.stanford.bean.concept. ClassBeanResultListBean"> <classBeanResultList> <classBean> <id>0-ICD10CM</id> <fullId>http://purl.bioontology.org/ ontology/ICD10CM/0-ICD10CM</fullId> <label>ICD-10-CM TABULAR LIST of DISEASES and INJURIES</label> <type>class</type> <relations> <entry> <string>ChildCount</string> <int>0</int> </entry> …… METADATA C_HLEVEL C_FULLNAME C_NAME C_SYNONYM_CD C_VISUALATTRIBUTES C_TOTALNUM C_BASECODE C_METADATAXML C_FACTTABLECOLUMN C_TABLENAME C_COLUMNNAME C_COLUMNDATATYPE C_OPERATOR C_DIMCODE C_COMMENT C_TOOLTIP UPDATE_DATE DOWNLOAD_DATE IMPORT_DATE SOURCESYSTEM_CD VALUETYPE_CD INT NULL VARCHAR(900) NULL VARCHAR(2000) NULL CHAR(1) NULL CHAR(3) NULL INT NULL VARCHAR(450) NULL TEXT NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(10) NULL VARCHAR(900) NULL TEXT NULL VARCHAR(900) NULL DATETIME NULL DATETIME NULL DATETIME NULL VARCHAR(50) NULL VARCHAR(50) NULL Primary challenges i2b2 Metadata depends upon hierarchical information c_fullname, c_tooltip maintain the hierarchy from root to leaves Diseases of the respiratory system \ Chronic lower respiratory diseases \ Emphysema Challenges.. NCBO REST service that enables pull of concepts includes immediate parent/child info only Hierarchy must be computed <data> <classBean> <id>J43</id> <label>Emphysema</label> <relations> <entry> <string>SuperClass</string> <list> <classBean> <id>J40-J47</id> <label>Chronic lower respiratory diseases</label> </classBean> </list> </entry> </relations> </classBean> </data> NCBO Extraction workflow NCBO REST XML Request to extract ontology Extraction Workflow ICD-10 Process Extracted Data i2b2 Metadata Extracted ICD-10 terms Released deliverables https://community.i2b2.org/wiki/display/NCBO What about my legacy ICD-9 data? Ideally we would like an i2b2 ontology that integrates ICD-9 into ICD10. Mapping Tool Tool to verify/(re)assign ontology mappings. Navigating the Mapping Tool Tree Displays terms mapped from one ontology within hierarchy of another Mapped terms are displayed adjacent to terms they are mapped to and appear in bold Adding a new mapping ICD9:269.3, Mineral deficiency should appear for ICD10:E63 Other nutritional deficiencies Copy term ICD9:269.3 Adding a new mapping Paste onto ICD10:E63 Other nutritional deficiencies Move a mapping Ascorbic acid deficiency (ICD9:267) can be moved down one level to Ascorbic acid deficiency (ICD10:E54) Drag and drop down the term one level. Unmap a mapping ICD9:416.8 Other chronic pulmonary heart diseases appears in two places: the one attached to ICD10:I27.2 appears incorrect and can be unmapped. The Unmapped Terms List Free form list of terms to be mapped Locate term you wish to map to in the hierarchy tree. Drag from table to term in the tree. If you make a mistake you can either reassign the mapped term within the tree or unmap it from tree. Unmap will cause it to reappear in the unmapped terms list if the term has no other mappings. Assigning an unmapped term Drag from unmapped terms list Drop onto term we are mapping to Unmapping a term Drag term from tree Drop onto unmapped terms list Search Unmapped Terms By Name Search Unmapped Terms by Code Mapped Terms Viewer Search Mapped Terms By Code Search Mapped Terms By Name Merging Ontologies Mapping tool provides a visualization of what the merged ontologies would look like What if we could extract a single metadata table from this? Integration tool Request to integrate Mapper Cell Integration Workflow ICD9 into ICD-10 For each mapped ICD-9 terms, compute ICD-10 hierarchy ICD-10 merged with ICD9 terms Mapped ICD-9 terms How to handle genomic data Ability Needs may differ between geneticist, physician, research scientist Ability to organize the variants for ease of navigation to query for the variant in the workbench Genomic labs may report data differently Define the variant so it may be reliably identified over time Implication is that the identifier for the variant does not change over time or is maintainable. How to (reliably) identify a genomic variant? HGVS Name ? RS #? Chr location, Nucleotide subst ? Gene name + flanking sequences ? All of them?? RS number Uniquely Novel identifies a variant over time ….but…. variants may not have rs number User may not want to submit to dbSNP Gene name + flanking sequences Not guaranteed if gene has several isoforms EGFR HGVS Name Uniquely identifies variant within a referenced and versioned accession and details the nucleotide substitution. NM_005228.3:c.2155G>T RefSeq accession Position Coding DNA Nucleotide substitution Is there a common denominator in all of this? Yes … all ultimately describe variant location on a chromosome. Nucleotide substitution defines the physical manifestation of the variant. WE PROPOSE: HGVS name (n/t subst, positional info) Flanking sequences (a way to verify positional info) AS A WAY TO UNEQUIVOCALLY EQUATE TWO VARIANTS ACROSS DOMAINS ACROSS VERSIONS Structure of Metadata Table M ET A D A T A C_HLEVEL C_FULLNAME C_NAME C_SYNONYM_CD C_VISUALATTRIBUTES C_TOTALNUM C_BASECODE C_METADATAXML C_FACTTABLECOLUMN C_TABLENAME C_COLUMNNAME C_COLUMNDATATYPE C_OPERATOR C_DIMCODE C_COMMENT C_TOOLTIP UPDATE_DATE DOWNLOAD_DATE IMPORT_DATE SOURCESYSTEM_CD VALUETYPE_CD INT NULL VARCHAR(900) NULL VARCHAR(2000) NULL CHAR(1) NULL CHAR(3) NULL INT NULL VARCHAR(450) NULL TEXT NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(50) NULL VARCHAR(10) NULL VARCHAR(900) NULL TEXT NULL VARCHAR(900) NULL DATETIME NULL DATETIME NULL DATETIME NULL VARCHAR(50) NULL VARCHAR(50) NULL Genomic MetadataXML record GenomicMetadata Version 1.0 ReferenceGenomeVersion hg18 SequenceVariant HGVSName NM_0005228.3:c.2155G>T SystematicName c.2155G>T SystematicNameProtein p.Glu719Cys AaChange missense DnaChange substitution SequenceVariantLocation GeneName EGFR FlankingSeq_5 GAATTCAAAAAGATCAAAGTGCTG FlankingSeq_3 GCTCCGGTGCGTTCGGCACGGTGT RegionType exon RegionName Exon 18 Accessions Accession Name NM_005228 Type mrna (NCBI) Accession Name NP_005219 Type protein (NCBI) Accession Name NT_004487 Type contig (NCBI) ChromosomeLocation Chromosome chr7 Region 7p12 Orientation + Organizational challenges By Disease? By Gene? Combining equivalent terms How to handle custom (local) metadata Edit Tool ideal for creating small, non-standard ontology for a local project. Consider the case for classifying patients as smokers, nonsmokers or smoking status unknown The Custom Metadata folder is designed for use with the creation of local terms. Create a “Smoking status” folder Populate folder with “Smoker”, “Non-smoker”, etc Smoking status custom metadata www.i2b2.org https://community.i2b2.org/wiki http://bioportal.bioontology.org