Ontology-Informed Knowledge Representation in the Immunology Database and Analysis Portal (ImmPort) Megan Kong1, Carl Dahlke2, Diane Xiang1, David Karp4, Richard H. Scheuermann1,3 1Department of Pathology, 3Division of Biomedical Informatics, 4Division of Rheumatology, U.T. Southwestern Medical Center, Dallas, TX, 2 Health Information Systems, Northrop Grumman, Inc., Rockville, MD 1 ImmPort Purpose and History • NIH/NIAID/DAIT would like to: – maximize the return on the public investment in basic, translational and clinical research – allow investigators to more effectively extract meaningful information from the vast amounts of data generated from advanced research technologies – => data sharing policy • Bioinformatics Integration Support Contract (BISC) to support data sharing for all DAIT-funded programs - basic, translational and clinical research • Immunology Database and Analysis Portal (ImmPort) - www.ImmPort.org – Archive and manage basic and clinical research data – Integrate these research data with extensive biological knowledge – Support analysis of these integrated data Home page • Support for many large projects that use a variety of different experiment methodologies, including FCM Browse Data/ ImmPort Research Data/ ImmPort Supported Programs ImmPort Research Data | My Work Bench Immune Function and Biodefense in Children, Elderly, and Immunocompromised Populations Program Population Genetics Analysis Program: Immunity to Vaccines/Infections Program HLA Region Genetics in Immune-Mediated Diseases Program Immune Modeling Centers Other Consortium Projects Browse Data/ ImmPort Research Data/ ImmPort Supported Programs ImmPort Research Data | My Work Bench Immune Function and Biodefense in Children, Elderly, and Immunocompromised Populations Program Grants/Contracts/Projects: Project Title An Improved Influenza A Vaccine for Rapid Protection of the Elderly Institution The Wistar Institute Principle Investigator Hildegund C. J. Ertl, M.D. Objective To conduct pre-clinical studies needed to develop a vaccine for the elderly that would provide protection in the event of a bioterror attack with influenza A virus. To determine the effects of chronic immunosuppressive therapies on adaptive, innate and specific immunity To propose a comprehensive analysis of the immunologic Kinetic analysis of immunologic repletion and Children´s Hospital of Sullivan, Kathleen, response to killed trivalent influenza vaccine in different influenza vaccine responsiveness Philadelphia Ph.D. immunocompromised populations in order to understand how to improve vaccine responses To define comprehensively and in molecular and cellular Immune Function and Biodefense in Children, detail differences in recognition and response to microbeElderly, and Immunocompromised Populations: Wilson, Christopher, derived danger signals between adults, neonates and infants, University of Washington TLRs in Innate Immunity and the Induction of M.D. and how these, in turn, contribute to differences in innate Adaptive Immunity in the Neonate and Infant immunity and the induction of antigen-specific (adaptive) immunity Oklahoma Medical To understand why many patients with systemic lupus Responses to Influenza Vaccination in Systemic Thompson, Linda, Research Foundation erythematosus (SLE) fail to make adequate responses to Lupus Ph.D. (OMRF) immunization with the influenza vaccine. To characterize immune markers and mechanisms in the Immune function and Biodefense in Children, Oregon Health and Science Nikolich-Zugich, elderly that determine their vulnerability to infectious and Elderly and Immunocompromised Populations University Janko, M.D., Ph.D. bioterrorism agents in categories A-C. To determine whether the different trimesters of pregnancy, characterized by unique hormonal environments, are Immune Response to Virus Infection During Mt. Sinai School of associated with (a) identifiable, discrete changes in maternal Moran, Thomas, Ph.D. Pregnancy Medicine systemic immunity and/or (b) recognizable alterations in susceptibility to select bio-defense pathogens and/or (c) differential responses to influenza vaccination To identify the specific immune defects that make Rochester Biodefense University of Rochester Sanz, Ignacio, M.D. immunocomprised populations specially susceptible at bioterrorists attack. This proposal will explore the hypothesis that altered innate Innate Immune Responsiveness in the Elderly immune responsiveness in the elderly and the Yale School of Medicine Fikrig, Erol, M.D. and the Immunosuppressed immunosuppressed contributes to vaccine unresponsiveness or disease susceptibility. Protective Immunity in Transplant Recipients Emory Transplant Center Larsen, Christian, Ph.D. Number of Subjects Mechanistic Assays 600 ELISA, ELISPOT, Flow Cytometry 90 ELISA, ELISPOT, Microarry, RT - PCR 54 ELISPOT, Sequencing, Flow Cytometry 17 adults, 17 neonates RT-PCR, ELISA, Flow Cytometry, Microarray 60 ELISPOT, ELISA, multiplex RT-PCR 130 Flow Cytometry, ELISA, RT-PCR, Gene expression, 75 Flow Cytometry, multiplex ELISA, RT-PCR 280 Flow Cytometry 1160 SNP genotyping, Flow Cytometry, ELISA ImmPort Overview - System Components • Semi-public web-based database and analysis portal – – • – Reference data • – • • • Interpreted results Analytical metadata Clinical research data • • Study design – from clinical protocol Participant characteristics/phenotype – from assessments captured in CRF’s and laboratory testing Query tools – – – – To support retrieval of reference and experiment data based on specified criteria Pre-defined QBE Customized semantic queries Statistical analysis of distributions Biological network analysis • • • Standard FACS statistics Novel population identification based on high dimensional data clustering Measurement of immune response (e.g., ELISA, ELISPOT) • – Filtering/normalization Clustering Classification Cell population analysis • • – LD analysis TagSNP selection Haplotype reconstruction Genotype-phenotype association Gene expression analysis • • • Methodologies – microarrays, SNP genotyping, flow cytometry, ELISA, ELISPOT, qPCR, imaging, etc. Metadata (defining how the experiment was performed) - common features of all experiments Primary results - from all experiment measurement techniques Processed results – – – – Types - Gene structure, protein function, polymorphisms, metabolic, regulatory, signaling and other networks, protein-protein, gene-gene, hostpathogen interactions Sources - NCBI, Uniprot, Swissprot, BIND, Reactome Experiment data • Genetic analysis • • • • Multi-level access control Data sharing • • Analysis tools Data – • • Quantification of topological parameters Module identification Visualization tools – – – – – Genome display, including genes, introns, exons, SNPs, tagSNPs, etc. Genetic analysis results Gene expression results Networks, pathways, and molecular interactions Graphing and charting of statistical results Ontology – – Thesaurus function Organize terms and define relationships 7 Varicella: Immune Response to Varicella Vaccination in Subjects with Atopic Dermatitis Compared to Nonatopic Controls Objectives: To determine if children with AD have Varicellaspecific cell mediated immune (CMI) responses to varicella vaccination that differ from those of nonatopic controls (Th2) Endpoint: 1. 2. 3. ELISPOT: number of Varicella-specific T-cell ELISA: levels of Varicella-specific antibodies Flow cytometry: perforin generation using 2color flow cytometry staining for both CD8 and perforin. (Perforin levels, which may contribute to defects in CD8+ cytotoxic T lymphocyte function). 8 9 Driving Projects - ADVN 10 Why OBX? • Challenges – – – – – Extensive phenotypic characteristics of interest Complex study designs of treatments, assessments, samplings, etc. Few standards for how they are described or captured Each supported project captures their data in a different format Integration with mechanistic studies of specimens 11 Why OBX? • Challenges – – – – – Extensive phenotypic characteristics of interest Complex study designs of treatments, assessments, samplings, etc. Few standards for how they are described or captured Each supported project captures their data in a different format Integration with mechanistic studies of specimens • Evaluation of existing clinical data standards – BRIDG – designed for supporting the conduct of clinical trials, not a study repository optimized for analysis – CDISC STDM – specification for SAS transport files to the FDA – CDISC CDASH – models clinical encounters, but not sequence of events; requires use of EPOCHs (contiguous blocks of time) – None designed to manage mechanistic studies 12 Why OBX? • Challenges – – – – – Extensive phenotypic characteristics of interest Complex study designs of treatments, assessments, samplings, etc. Few standards for how they are described or captured Each supported project captures their data in a different format Integration with mechanistic studies of specimens • Evaluation of existing clinical data standards – BRIDG – designed for supporting the conduct of clinical trials, not a study repository optimized for analysis – CDISC STDM – specification for SAS transport files to the FDA – CDISC CDASH – models clinical encounters, but not sequence of events; requires use of EPOCHs (contiguous blocks of time) – None designed to manage mechanistic studies • Ontology-Based eXtensible Data Model (OBX) – Logical ontology structure (BFO/OBI) => extensible data model – Data element value sets derived from ontology hierarchy 13 Entities of Interest • Objects – – – – • Human subjects Specimens Materials Plans Qualities – clinical phenotype – Malar rash – WBC count – Weight – Blood glucose levels – Atopic dermatitis • Processes – – – – Recruitment, enrollment, approval Assays, physical exams Blood draw, PBMC isolation Treatment, surgery (interventions) • Context – Time – Environment • Roles – Principal investigator, study coordinator, study participant – Reagent, therapeutic 14 OBX Framework 15 OBX Study Design 16 OBX Biomaterial Transformation 17 Assay Database Schema 18 Lab test mapping into OBX Blood Test: ADVN Fields ADVN Fields Description ImmPort Table ID Subject ID Lab Test SITE Study Site will not capture VISITDT VISIT Date Lab Test HGB Hemoglobin (g/dl): Lab Test CSHGB Hemoglobin Clinical Significant Lab Test HCT Hematocrit (%): Lab Test CSHCT Hematocrit Clinical Significant Lab Test WBC Total WBC (x1000/μl): Lab Test CSWBC Total WBC Clinical Significant Lab Test NEUT Neutrophils (x1000/microliter): Lab Test CSNEUT Neutrophils Clinical Significant Lab Test LYMP Lymphocytes (x1000/μl): Lab Test CSLYMP Lymphocytes Clinical Significant Lab Test EOSIN Eosinophils (x1000/μl): Lab Test CSEOS Eosinophils Clinical Significant Lab Test ImmPort Attribute Subject ID Visit Date Type = Blood Test Subtype = Hemoglobin Measurement Test Result Value = HGB Test Result Unit = grams per deciliter Clinical Significance = CSHGB Type = Blood Test Subtype = Hematocrit Measurement Test Result Value = HCT Test Result Unit = percentage Clinical Significance = CSHCT Type = Blood Test Subtype = Total White Blood Cell Measurement Test Result Value = HGB Test Result Unit = 1000/μl Clinical Significance = CSHGB Type = Blood Test Subtype = Neutrophils Measurement Test Result Value = NEUT Test Result Unit = 1000/μl Clinical Significance = CSNEUT Type = Blood Test Subtype = Lymphocytes Measurement Test Result Value = LYMP Test Result Unit = 1000/μl Clinical Significance = CSLYMP Type = Blood Test Subtype = Eosinophils Measurement Test Result Value = EOSIN Test Result Unit = 1000/μl 19 Clinical Significance = CSEOS Load File Examples 20 FLOCK v2.0 STEPS 1. File Conversion - Convert binary .fcs file into a data matrix 2. Data Cleansing - Remove boundary events (noise) in FSC and SSC dimensions 3. Data Shrinking - Collapse data toward distribution modes 4. Normalization - Z-score normalization for values in each dimension ((xi - µ)/SD) 5. Dimension Selection - Select most informative dimensions based on measures of dispersion and distortion 6. FLOCK LoD i. Partition each dimension to generate a hyper-grid ii. Identify dense hyper-regions in hyper-grid iii. Merge neighboring dense hyper-regions to define hyper-region groups (n) iv. Determine centroids for each hyper-region group v. Use n centroids to seed single round of distance-based clustering 7. FLOCK HiD - Refine population definition based on histogram partitioning 8. Group Merging - Merge close hyper-region groups based on [distance metric] 9. Centroid Calculation - Compute centroid for each hyper-region group 10. Clustering - Cluster events to nearest centroid 11. Population statistics - Summarize population proportions, intensity levels, etc. 12. Visualization 17 B Cell Populations in Blood A PB GSM GNSM UM3-4 CD38 B220 CD27 UM1-2 N1-3 DNM IgD CD24 IgG Population characteristics Populationa N1 N2 N3 Colorb Gray Magenta Purple CD27c - IgDc + + + IgGc - CD38c + + CD24c int + + B220c + + low Proportiond 48.94% 4.69% 4.41% Putative cell typea naïve (CD38+)[Bm2?] naïve (CD38-) naïve (CD38+B220low) UM1 UM2 UM3 UM4 Darkred Salmon Darkblue Green + + + + + + int int - + + - + + + + + + low low 1.55% 0.94% 6.16% 11.50% unswitched memory (CD38+) unswitched memory (CD38-)[Bm1?] IgDlow unswitched memory (CD38+) IgDlow unswitched memory (CD38-) GSM1 GSM2 GSM3 Grayishgreen Yellow Blue + + + + - + + + + + - + + + + low low 0.36% 4.05% 4.40% switching memory (IgD+IgG+CD38+) switched memory (CD38+)[early Bm5?] switched memory (CD38-)[late Bm5?] GNSM1 GNSM2 GNSM3 GNSM4 Cyan Darkgreen Teal Orange + + + + - - + + - + + + - low low + low 4.84% 3.84% 1.30% 0.51% IgD-IgG- memory IgD-IgG- memory IgD-IgG- memory IgD-IgG- memory DNSM1 DNSM2 Pink Darkgray - - + - - - + + 0.85% 0.91% double negative memory (IgG+) double negative memory (IgG-) PB Red high - - high - low 0.75% plasmablasts Summary Statistics B cell component of the Cell Ontology http://www.obofoundry.org/ Summary • OBX is an ontology-informed clinical research data model • Ontological framework incorporated provides for extensibility and straightforward incorporation of ontology terms as value sets • Ontological framework also facilitated distributed data model development and guides ongoing mapping strategies • Successfully loaded several large, distinct clinical studies • Developed user interfaces for data extraction and analysis • Refinement through use – mapping, loading, extracting, analysis • Term submission to OBO Foundry ontologies, especially OBI 45 Acknowledgments UT Southwestern Yu (Max) Qian Jamie Lee Megan Kong Jennifer Cai Jie Huang Nishanth Marthandan Diane Xiang Young Bun Kim Paula Guidry Eva Sadat David Karp Northrop Grumman Carl Dahlke John Campbell Liz Thompson Jeff Wiser Mike Attasi OBO Foundry BFO OBI Supported by NIH N01AI40076 46