presentation

advertisement
Athos Antoniades
Linked2Safety Project (FP7-ICT-2011-7 – 5.3)
A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR
SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDS
AND CLINICAL TRIALS SYSTEMS
ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH
12th International Conference on Bioinformatics and Bioengineering, Larnaka






Why Share Data?
What are the current legal and ethical limitations?
How have scientists shared medical data so far?
Key Problems
Perturbation
Cell Suppression
FP7, ICT-2011 – 5.3
Page 2
 Why share data:
 Replication Testing
 Statistical Power
 Multiple Testing Problem
 Legal and Ethical Issues
 Anonymization vs Pseudoanonimization
 Limitations derived from consent form signed by subjects
 Other, regional, study, or subject specific issues.
FP7, ICT-2011 – 5.3
Page 3
aa
aA
AA
Case
U00
U01
U02
Control
U10
U11
U12
FP7, ICT-2011 – 5.3
Page 4
A paper that analyzes data from a specific study reports:
Age
Marital Status
FP7, ICT-2011 – 5.3
Age
0-16
18-24
25-34
35~
Married
0
10
40
60
Widowed Single
1
50
5
50
7
40
15
20
Page 5
A paper that analyzes data from a specific study reports:
Age
Marital Status
FP7, ICT-2011 – 5.3
Age
0-16
18-24
25-34
35~
Married
0
10
40
60
Widowed Single
1
50
5
50
7
40
15
20
Page 6
A paper that analyzes data from a specific study reports:
Age
Marital Status
FP7, ICT-2011 – 5.3
Age
0-16
18-24
25-34
35~
Married
0
10
40
60
Widowed Single
1
50
5
50
7
40
15
20
Page 7
Paper 1 that analyzes data
from a specific study reports:
Paper 2 that analyzes data
from the same study reports:
Age Married Widowed Single
NA
0-16
NA
50
18-24
10
7
50
25-34
40
7
40
35~
60
15
20
FP7, ICT-2011 – 5.3
Marital Status
Age
Age
Marital Status
Age Married Widowed Single
NA
0-16
NA
50
18-25
10
8
50
26-35
45
7
40
36~
55
14
20
Page 8
Perturbation (+-1) and
Cell Suppression (<5)
Original Data
Age Married Widowed Single
1
0-16
0
50
18-24
10
7
50
25-34
40
7
40
35~
60
15
20
FP7, ICT-2011 – 5.3
Marital Status
Age
Age
Marital Status
Age Married Widowed Single
NA
0-16
NA
51
18-24
9
8
49
25-34
40
7
41
35~
61
14
21
Page 9
• Most common parameters tested
Perturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10]
Cell Supression: <0, <=1, <=3,<=5,<=10
•
Standard main effect test using Chi
Square
•
Pearson’s Correlation Coefficient used to
evaluate deviation of each parameter
combination to original results.
•
A-priory defined threshold for Pearson’s
correlation coefficient <=0.95.
FP7, ICT-2011 – 5.3
Page 10
FP7, ICT-2011 – 5.3
Page 11
Objectives:
 Design and develop the data mining techniques and the scalable
infrastructure for the identification of phenotypic and genetic
associations related to adverse events.
 Develop new and implement existing state of the art analytical
approaches for genetic data.
 Define and implement the knowledge extraction and filtering
mechanisms and the knowledge base
 Integrate the knowledge base into a lightweight decision support
system (Adverse events early detection mechanism)
FP7, ICT-2011 – 5.3
Page 12
FP7, ICT-2011 – 5.3
Page 13
Provides the tools for identifying and removing erroneous
data or data that do not conform to the quality standards
that a user might define.
Tools:
 Hardy-Weinberg Equilibrium Test
 Allele Frequency Test
 Missing Data Test
FP7, ICT-2011 – 5.3
Page 14
Provides the tools for removing redundant or irrelevant
features from a dataset.
Tools:
 Rough Set Feature Selection
 Information Gain Feature Selection
 Chi Squared Feature Selection
FP7, ICT-2011 – 5.3
Page 15
FP7, ICT-2011 – 5.3
Page 16
Provides the tools for performing single hypothesis testing
on a dataset and test for associations.
Tools:
 Pearson’s Chi Square Test
 Fisher’s Exact Test
 Odds Ratio
 Binomial Logistic Regression
 Linkage Disequilibrium
 Genetic Region Based Association Testing
FP7, ICT-2011 – 5.3
Page 17
Provides the tools for performing data mining analyses on
a dataset and extract association rules.
Tools:
 Association Rules (apriori)
 Decision Trees with Percentage Split (C4.5)
 Decision Trees with Cross Validation (C4.5)
 Random Forest with Percentage Split
 Random Forest with Cross Validation
FP7, ICT-2011 – 5.3
Page 18
FP7, ICT-2011 – 5.3
Page 19
FP7, ICT-2011 – 5.3
Page 20
 Knowledge Extraction Mechanism
 This mechanism is responsible for storing statistically
significant associations and important association rules in the
Linked2Safety knowledge database
 Has two steps:
 Logging system
 Storing important knowledge
 Filtering mechanism
 This mechanism allows users to insert or delete associations
and association rules
FP7, ICT-2011 – 5.3
Page 21
 Uses the knowledge in the L2S knowledge base
 Runs in the background to identify new associations and
association rules
 Reruns analyses when updated datasets are available
 Creates alerts for patients profiles associated with adverse
events
FP7, ICT-2011 – 5.3
Page 22
FP7, ICT-2011 – 5.3
Page 23
FP7, ICT-2011 – 5.3
Page 24
Overlapping non genetic data of at least 2 data providers:
Age
Gender
BMI
Smoking Ever
Dyslipidemia
Diabetes
Diabetes type I
Diabetes type II
Anemia
Depressive personality disorder
Major depressive disorder
Schizotypal personality disorder
FP7, ICT-2011 – 5.3
Variables
Weight gain
Headaches
Gastrointestinal symptoms
Ophthalmological problems
Type of ophthalmological condition
High blood pressure
Heart conditions exist
Type of heart condition
Hypertension
Myocardial infarction
Stroke
Coronary heart disease
Page 25
We were able to identify for a given dataset the maximum
noise that can be added to the data without significantly
affecting the outcomes.
Results presented are only relevant to MASTOS, all other
datasets need to repeat the analytical approach described
to determine the maximum noise that can be added to the
results.
Further investigation is necessary to identify the minimum
parameter settings to satisfy legal and ethical requirements.
FP7, ICT-2011 – 5.3
Page 26
Athos Antoniades
University of Cyprus
email: athos@cs.ucy.ac.cy
FP7, ICT-2011 – 5.3
Page 27
Download