Data analysis due to novel science

advertisement
Enabling Data Sharing in
Biomedical Research
Integrating Data for analysis,
Anonymization, and Sharing (iDASH)
Aziz A. Boxwala, MD, PhD
Division of Biomedical Informatics
UCSD
1U54GM095327
10/25/2010
Sharing Biomedical Data
– Today
• Public repositories (mostly non-clinical)
• Limited DUAs, public fear
• Data ‘transmitted’ by FedEx
– Tomorrow
• Annotated public databases
• Certified trust network
• Consented sharing and use
Sharing Computational Resources
– Today
• Computer scientists looking for data, biomedical and behavioral
scientists looking for analytics
• Processed data not shared
• Massive storage and high performance computing limited to a few
institutions
– Tomorrow
• Teams working to solve a problem (e.g., human genome project)
• Processed anonymized data shared for verification and algorithmic
improvement
• Secure biomedical/behavioral cloud available to all
Challenges
•
•
•
•
•
Data integration
Maintenance of research subject’s privacy
Respect for research subject’s autonomy
Data analysis due to novel science
Lack of infrastructure
Challenges
•
•
•
•
•
Data integration
Maintenance of research subject’s privacy
Respect for research subject’s autonomy
Data analysis due to novel science
Lack of infrastructure
Integrating Data
(from different biological levels)
Genotype
genome
Population
registries
transcription
RNA
transcriptome
Phenotype
clinical data
translation
Protein
proteome
Biomarkers
labs
Integrating Data
(from different institutions)
Researcher
is
authorized
to get data
D about I
for reason
R
Request about individual
I
MRN 23212
MRN
234512
Remote
Monitor DB
UCSD
(Epic)
UC Irvine
(Eclipsys)
ID matching
function
MRN 6554 MRN 4433
Community
Partners
MRN 43244
UC
Davis
(Epic)
Data matching function: Map D onto data dictionaries
Request for data D
Return data D
UCSF
(GE)
Registration Client
Web Search Client
Application Layer
Application Logic
Search Engine
Layer
Data Structures
Layer
Computation /
Query Layer
User Request
Manager
NIF Search Coordinator
Results Display Manager
NIF Search Coordinator
Resource
Registry
Manager
Index
Engine
W. Cat. Manager
Index Manager
Information
Resource
Registry
Web Index
Data Index
W. Result Postprocessor
Post Clustering
Engine
Data Layer
Web Result
Ranker
Keyword
Query
Processor
Mediator
Registry
Data Integrator
Data
Mediator
XML Source
NIF Literature
(Textpresso)
Source
Query
Wrappers
OntoQuest
Ontology Manager
Relational DB
Web
RDF DB
Pathways DB
NIFSTD
Ontology
Current Query Architecture
Query Parser
Data Ingestion and
Transformation
Data
Data
Reader
Data
Reader
Reader
Keyword
Query
Processor
Model-Partitioned
Data Store/Service
Semantic &
Assn. Catalogs
Index
Structures
...
Graph Query
Processor
Tree Query
Processor
Result Ranking
Relational Query
Processor
Subquery Dispatcher
Execution Engine
Query Planner
Application-Level
Post-processing
Ontology
Repository
OntoQuest
Ontology
Ingestion and
Transformation
OWL
Reader
OBO
Reader
•How are data-ontology
mappings specified?
•How to store, index
and query ontologies
efficiently?
•Managing different
forms of ontology
•Managing multiple
inter-mapped
ontologies
RDFS
Reader
Challenges
•
•
•
•
•
Data integration
Maintenance of research subject’s privacy
Respect for research subject’s autonomy
Data analysis due to novel science
Lack of infrastructure
The HIPAA Identifiers
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Names
All geographical subdivisions smaller than a State, except for the initial three digits of a zip code
Dates (except year) directly related to an individual, including birth date, admission date, discharge
date, date of death and all ages over 89 and all elements of dates (including year) indicative of such
age, except that such ages and elements may be aggregated into a single category of age 90 or older
Phone numbers
Fax numbers
Electronic mail addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers, including license plate numbers
Device identifiers and serial numbers
Web Universal Resource Locators (URLs)
Internet Protocol (IP) address numbers
Biometric identifiers, including finger and voice prints
Full face photographic images and any comparable images
Any other unique identifying number, characteristic, or code
HIPAA data sets
• De-identified data set
– Does not include 18 identifiers
• Limited data set
– can include the following identifiers:
• Geographic data: town, city, State and zip code, but no street
address.
• Dates: A limited data set can include dates relating to an
individual (e.g., birth date, admission and discharge date).
• Other unique identifiers: A limited data set can include any
unique identifying number, characteristic or code other than
those specified in the list of 16 identifiers that are expressly
disallowed
• Fully identified data set
– All identifiers allowed
IRB concerns
Limiting results to counts
• No inherent privacy:
Original
Reconstructed
Serving result counts
• Allows:
– Cohort finding
– Exploration
• Need:
– Perturbation
Q
noise
Estimated Count
+
Count returned
Truly privacy preserving data
• Yields information about distribution
independent of any individual data point
• How: Sampling from robust representation of
joint probability distribution
Sample
learn
Original
Robust distribution
Privacy preserving
Source Anonymization
• Multiple participating data sources (PDSs)
contribute data to a central processing unit
(CPU)
– Cyptographic anonymization cloud:
Challenges
•
•
•
•
•
Data integration
Maintenance of research subject’s privacy
Respect for research subject’s autonomy
Data analysis due to novel science
Lack of infrastructure
Informed Consent
Informed Consent
• Biospecimen and data repositories are creating
archives for future, possibly unforeseen types of
research
• Does this create challenges in adhering to the
autonomy (right to self-determination) principle of
biomedical ethics?
• We want to enable subjects to have better control
on their participation in research
• Different consents within the same repository will
create a challenge for investigators in selecting
subjects
– Matching research aims to consented uses
– Selection biases
Electronic Informed Consent
Management
• Create an informed consent ontology that can
represent various dimensions of subject’s
consent for research
• Develop an electronic informed consent registry
that documents the subjects’ consents
– Enables subjects to update consent
• Create a mediator that can resolve an
investigator’s request for samples, data, or
subject participation against the consented uses
Challenges
•
•
•
•
•
Data integration
Maintenance of research subject’s privacy
Respect for research subject’s autonomy
Data analysis due to novel science
Lack of infrastructure
Data Analysis Library
• Genome Data
– Compression
– Genome query
language
• Pattern recognition
• Computing with streams
• Rare events
Challenges
•
•
•
•
•
Data integration
Maintenance of research subject’s privacy
Respect for research subject’s autonomy
Data analysis due to novel science
Lack of infrastructure
Data Publishing and Computational
Resources
• Mismatches
– Data availability
– Computational resources and expertise
• iDASH services
– Data acquisition, annotation, storage, dissemination
– Scientific workflow execution
– Governance and policy framework for data access
control
– Accessible via web portal and API
Biomedical CyberInfrastructure
Architecture
Rich Services developed by Ingolf Krueger and colleagues
Driving Biological Projects
• Kawasaki Disease Research
• Anticoagulant Medication Safety
• Remote Monitoring of Behavior
Kawasaki Disease
(PI: Jane Burns)
• Aim 1: To sequence size-selected cDNA from whole
blood from KD patients and age-similar children with
acute adenovirus infection to identify miRNA
abundance patterns and to relate these patterns to
disease state and to KD clinical outcome
• Aim 2: To selectively sequence genomic DNA
regions in the pathway genes of interest to identify
rare genetic variants that may play a functional role
in disease susceptibility and outcome
• Aim 3: To create a KD data warehouse and webbased data analysis system aimed at facilitating
discoveries using clinical and molecular data
Anticoagualant Medication Monitoring
(PI: Fred Resnic)
• Aim 1: To determine baseline expectations for
bleeding events for prasugrel and dabigatran,
clopidogrel, and warfarin in eligible patients
• Aim 2: To evaluate the usefulness of
aggregating information from 3 healthcare
centers in an automated risk-adjusted
medication safety monitoring tool that alerts for
unsafe use of medications in particular cohorts
of patients
Monitoring Sedentary Behavior
(PI: Greg Norman)
• Phase 1
– physical activity behavior pattern recognition and feedback device
and test for Device Limiting Failures (DLFs) with 12 adults for two
week cycles using a Phase I clinical trial approach.
• Phase 2
– efficacy testing of the prototype with iterative improvement/
retesting in 30 sedentary adults with outcomes of accelerometer
measured activity and sedentary time evaluated against controls
for a 6 week intervention period.
• Phase 3
– pilot randomized trial with 48 sedentary adults receiving either the
intervention device or assessments only for a 3 month period
evaluated with accelerometer-measured activity and sedentary
time.
New science: new computational needs
• DBP1
– Genetic data compression
– Pattern recognition
– Data integration from different biological levels
• DBP2
– Data integration from different institutions
• aggregated results from three medical centers that serve different types
of patients (BWH, VA TN, UCSD)
– Rare event detection
• DBP3 –
– Pattern recognition from streaming data from personal
monitoring
– Integration of spatial, temporal, physiological, and behavioral
data
iDASH Team
PI
(Ohno-Machado)
Advisory
Council
Steering
Committee
Operations
Committee
Core 1 R&D
(Bafna, Vinterbo)
Core 2 Driving Projects
(Ohno-Machado)
Core 3 Infrastructure
(Thornton)
Executive
Committee
Core 4 Training
(Pevzner)
Core 5 Dissemination
(Patrick)
Core 6 Administration
(Boxwala, Balac)
Algorithms
(Varghese)
DBP 1 Kawasaki
Genomics
(Burns)
System
Administration
San Diego State
University Master’s
(Valafar)
Annual Workshop
Evaluation
Software Engineering
(Krueger)
DBP 2
Pharmacosurveillance
(Resnic)
High Performance
Computing
UCSD Doctoral
Program
User Group
DBP Selection
Committee
Statistical Methods
(Messer)
DBP 3 Activity Patterns
(Norman)
Helpdesk
UCSD Medical Center
Rotation
Technical Support
NCBC consortium
Thank you
aboxwala@ucsd.edu
Download