Enabling Data Sharing in Biomedical Research Integrating Data for analysis, Anonymization, and Sharing (iDASH) Aziz A. Boxwala, MD, PhD Division of Biomedical Informatics UCSD 1U54GM095327 10/25/2010 Sharing Biomedical Data – Today • Public repositories (mostly non-clinical) • Limited DUAs, public fear • Data ‘transmitted’ by FedEx – Tomorrow • Annotated public databases • Certified trust network • Consented sharing and use Sharing Computational Resources – Today • Computer scientists looking for data, biomedical and behavioral scientists looking for analytics • Processed data not shared • Massive storage and high performance computing limited to a few institutions – Tomorrow • Teams working to solve a problem (e.g., human genome project) • Processed anonymized data shared for verification and algorithmic improvement • Secure biomedical/behavioral cloud available to all Challenges • • • • • Data integration Maintenance of research subject’s privacy Respect for research subject’s autonomy Data analysis due to novel science Lack of infrastructure Challenges • • • • • Data integration Maintenance of research subject’s privacy Respect for research subject’s autonomy Data analysis due to novel science Lack of infrastructure Integrating Data (from different biological levels) Genotype genome Population registries transcription RNA transcriptome Phenotype clinical data translation Protein proteome Biomarkers labs Integrating Data (from different institutions) Researcher is authorized to get data D about I for reason R Request about individual I MRN 23212 MRN 234512 Remote Monitor DB UCSD (Epic) UC Irvine (Eclipsys) ID matching function MRN 6554 MRN 4433 Community Partners MRN 43244 UC Davis (Epic) Data matching function: Map D onto data dictionaries Request for data D Return data D UCSF (GE) Registration Client Web Search Client Application Layer Application Logic Search Engine Layer Data Structures Layer Computation / Query Layer User Request Manager NIF Search Coordinator Results Display Manager NIF Search Coordinator Resource Registry Manager Index Engine W. Cat. Manager Index Manager Information Resource Registry Web Index Data Index W. Result Postprocessor Post Clustering Engine Data Layer Web Result Ranker Keyword Query Processor Mediator Registry Data Integrator Data Mediator XML Source NIF Literature (Textpresso) Source Query Wrappers OntoQuest Ontology Manager Relational DB Web RDF DB Pathways DB NIFSTD Ontology Current Query Architecture Query Parser Data Ingestion and Transformation Data Data Reader Data Reader Reader Keyword Query Processor Model-Partitioned Data Store/Service Semantic & Assn. Catalogs Index Structures ... Graph Query Processor Tree Query Processor Result Ranking Relational Query Processor Subquery Dispatcher Execution Engine Query Planner Application-Level Post-processing Ontology Repository OntoQuest Ontology Ingestion and Transformation OWL Reader OBO Reader •How are data-ontology mappings specified? •How to store, index and query ontologies efficiently? •Managing different forms of ontology •Managing multiple inter-mapped ontologies RDFS Reader Challenges • • • • • Data integration Maintenance of research subject’s privacy Respect for research subject’s autonomy Data analysis due to novel science Lack of infrastructure The HIPAA Identifiers 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. Names All geographical subdivisions smaller than a State, except for the initial three digits of a zip code Dates (except year) directly related to an individual, including birth date, admission date, discharge date, date of death and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older Phone numbers Fax numbers Electronic mail addresses Social Security numbers Medical record numbers Health plan beneficiary numbers Account numbers Certificate/license numbers Vehicle identifiers and serial numbers, including license plate numbers Device identifiers and serial numbers Web Universal Resource Locators (URLs) Internet Protocol (IP) address numbers Biometric identifiers, including finger and voice prints Full face photographic images and any comparable images Any other unique identifying number, characteristic, or code HIPAA data sets • De-identified data set – Does not include 18 identifiers • Limited data set – can include the following identifiers: • Geographic data: town, city, State and zip code, but no street address. • Dates: A limited data set can include dates relating to an individual (e.g., birth date, admission and discharge date). • Other unique identifiers: A limited data set can include any unique identifying number, characteristic or code other than those specified in the list of 16 identifiers that are expressly disallowed • Fully identified data set – All identifiers allowed IRB concerns Limiting results to counts • No inherent privacy: Original Reconstructed Serving result counts • Allows: – Cohort finding – Exploration • Need: – Perturbation Q noise Estimated Count + Count returned Truly privacy preserving data • Yields information about distribution independent of any individual data point • How: Sampling from robust representation of joint probability distribution Sample learn Original Robust distribution Privacy preserving Source Anonymization • Multiple participating data sources (PDSs) contribute data to a central processing unit (CPU) – Cyptographic anonymization cloud: Challenges • • • • • Data integration Maintenance of research subject’s privacy Respect for research subject’s autonomy Data analysis due to novel science Lack of infrastructure Informed Consent Informed Consent • Biospecimen and data repositories are creating archives for future, possibly unforeseen types of research • Does this create challenges in adhering to the autonomy (right to self-determination) principle of biomedical ethics? • We want to enable subjects to have better control on their participation in research • Different consents within the same repository will create a challenge for investigators in selecting subjects – Matching research aims to consented uses – Selection biases Electronic Informed Consent Management • Create an informed consent ontology that can represent various dimensions of subject’s consent for research • Develop an electronic informed consent registry that documents the subjects’ consents – Enables subjects to update consent • Create a mediator that can resolve an investigator’s request for samples, data, or subject participation against the consented uses Challenges • • • • • Data integration Maintenance of research subject’s privacy Respect for research subject’s autonomy Data analysis due to novel science Lack of infrastructure Data Analysis Library • Genome Data – Compression – Genome query language • Pattern recognition • Computing with streams • Rare events Challenges • • • • • Data integration Maintenance of research subject’s privacy Respect for research subject’s autonomy Data analysis due to novel science Lack of infrastructure Data Publishing and Computational Resources • Mismatches – Data availability – Computational resources and expertise • iDASH services – Data acquisition, annotation, storage, dissemination – Scientific workflow execution – Governance and policy framework for data access control – Accessible via web portal and API Biomedical CyberInfrastructure Architecture Rich Services developed by Ingolf Krueger and colleagues Driving Biological Projects • Kawasaki Disease Research • Anticoagulant Medication Safety • Remote Monitoring of Behavior Kawasaki Disease (PI: Jane Burns) • Aim 1: To sequence size-selected cDNA from whole blood from KD patients and age-similar children with acute adenovirus infection to identify miRNA abundance patterns and to relate these patterns to disease state and to KD clinical outcome • Aim 2: To selectively sequence genomic DNA regions in the pathway genes of interest to identify rare genetic variants that may play a functional role in disease susceptibility and outcome • Aim 3: To create a KD data warehouse and webbased data analysis system aimed at facilitating discoveries using clinical and molecular data Anticoagualant Medication Monitoring (PI: Fred Resnic) • Aim 1: To determine baseline expectations for bleeding events for prasugrel and dabigatran, clopidogrel, and warfarin in eligible patients • Aim 2: To evaluate the usefulness of aggregating information from 3 healthcare centers in an automated risk-adjusted medication safety monitoring tool that alerts for unsafe use of medications in particular cohorts of patients Monitoring Sedentary Behavior (PI: Greg Norman) • Phase 1 – physical activity behavior pattern recognition and feedback device and test for Device Limiting Failures (DLFs) with 12 adults for two week cycles using a Phase I clinical trial approach. • Phase 2 – efficacy testing of the prototype with iterative improvement/ retesting in 30 sedentary adults with outcomes of accelerometer measured activity and sedentary time evaluated against controls for a 6 week intervention period. • Phase 3 – pilot randomized trial with 48 sedentary adults receiving either the intervention device or assessments only for a 3 month period evaluated with accelerometer-measured activity and sedentary time. New science: new computational needs • DBP1 – Genetic data compression – Pattern recognition – Data integration from different biological levels • DBP2 – Data integration from different institutions • aggregated results from three medical centers that serve different types of patients (BWH, VA TN, UCSD) – Rare event detection • DBP3 – – Pattern recognition from streaming data from personal monitoring – Integration of spatial, temporal, physiological, and behavioral data iDASH Team PI (Ohno-Machado) Advisory Council Steering Committee Operations Committee Core 1 R&D (Bafna, Vinterbo) Core 2 Driving Projects (Ohno-Machado) Core 3 Infrastructure (Thornton) Executive Committee Core 4 Training (Pevzner) Core 5 Dissemination (Patrick) Core 6 Administration (Boxwala, Balac) Algorithms (Varghese) DBP 1 Kawasaki Genomics (Burns) System Administration San Diego State University Master’s (Valafar) Annual Workshop Evaluation Software Engineering (Krueger) DBP 2 Pharmacosurveillance (Resnic) High Performance Computing UCSD Doctoral Program User Group DBP Selection Committee Statistical Methods (Messer) DBP 3 Activity Patterns (Norman) Helpdesk UCSD Medical Center Rotation Technical Support NCBC consortium Thank you aboxwala@ucsd.edu