Athos Antoniades Linked2Safety Project (FP7-ICT-2011-7 – 5.3) A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDS AND CLINICAL TRIALS SYSTEMS ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH 12th International Conference on Bioinformatics and Bioengineering, Larnaka Why Share Data? What are the current legal and ethical limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression FP7, ICT-2011 – 5.3 Page 2 Why share data: Replication Testing Statistical Power Multiple Testing Problem Legal and Ethical Issues Anonymization vs Pseudoanonimization Limitations derived from consent form signed by subjects Other, regional, study, or subject specific issues. FP7, ICT-2011 – 5.3 Page 3 aa aA AA Case U00 U01 U02 Control U10 U11 U12 FP7, ICT-2011 – 5.3 Page 4 A paper that analyzes data from a specific study reports: Age Marital Status FP7, ICT-2011 – 5.3 Age 0-16 18-24 25-34 35~ Married 0 10 40 60 Widowed Single 1 50 5 50 7 40 15 20 Page 5 A paper that analyzes data from a specific study reports: Age Marital Status FP7, ICT-2011 – 5.3 Age 0-16 18-24 25-34 35~ Married 0 10 40 60 Widowed Single 1 50 5 50 7 40 15 20 Page 6 A paper that analyzes data from a specific study reports: Age Marital Status FP7, ICT-2011 – 5.3 Age 0-16 18-24 25-34 35~ Married 0 10 40 60 Widowed Single 1 50 5 50 7 40 15 20 Page 7 Paper 1 that analyzes data from a specific study reports: Paper 2 that analyzes data from the same study reports: Age Married Widowed Single NA 0-16 NA 50 18-24 10 7 50 25-34 40 7 40 35~ 60 15 20 FP7, ICT-2011 – 5.3 Marital Status Age Age Marital Status Age Married Widowed Single NA 0-16 NA 50 18-25 10 8 50 26-35 45 7 40 36~ 55 14 20 Page 8 Perturbation (+-1) and Cell Suppression (<5) Original Data Age Married Widowed Single 1 0-16 0 50 18-24 10 7 50 25-34 40 7 40 35~ 60 15 20 FP7, ICT-2011 – 5.3 Marital Status Age Age Marital Status Age Married Widowed Single NA 0-16 NA 51 18-24 9 8 49 25-34 40 7 41 35~ 61 14 21 Page 9 • Most common parameters tested Perturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10] Cell Supression: <0, <=1, <=3,<=5,<=10 • Standard main effect test using Chi Square • Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results. • A-priory defined threshold for Pearson’s correlation coefficient <=0.95. FP7, ICT-2011 – 5.3 Page 10 FP7, ICT-2011 – 5.3 Page 11 Objectives: Design and develop the data mining techniques and the scalable infrastructure for the identification of phenotypic and genetic associations related to adverse events. Develop new and implement existing state of the art analytical approaches for genetic data. Define and implement the knowledge extraction and filtering mechanisms and the knowledge base Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism) FP7, ICT-2011 – 5.3 Page 12 FP7, ICT-2011 – 5.3 Page 13 Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define. Tools: Hardy-Weinberg Equilibrium Test Allele Frequency Test Missing Data Test FP7, ICT-2011 – 5.3 Page 14 Provides the tools for removing redundant or irrelevant features from a dataset. Tools: Rough Set Feature Selection Information Gain Feature Selection Chi Squared Feature Selection FP7, ICT-2011 – 5.3 Page 15 FP7, ICT-2011 – 5.3 Page 16 Provides the tools for performing single hypothesis testing on a dataset and test for associations. Tools: Pearson’s Chi Square Test Fisher’s Exact Test Odds Ratio Binomial Logistic Regression Linkage Disequilibrium Genetic Region Based Association Testing FP7, ICT-2011 – 5.3 Page 17 Provides the tools for performing data mining analyses on a dataset and extract association rules. Tools: Association Rules (apriori) Decision Trees with Percentage Split (C4.5) Decision Trees with Cross Validation (C4.5) Random Forest with Percentage Split Random Forest with Cross Validation FP7, ICT-2011 – 5.3 Page 18 FP7, ICT-2011 – 5.3 Page 19 FP7, ICT-2011 – 5.3 Page 20 Knowledge Extraction Mechanism This mechanism is responsible for storing statistically significant associations and important association rules in the Linked2Safety knowledge database Has two steps: Logging system Storing important knowledge Filtering mechanism This mechanism allows users to insert or delete associations and association rules FP7, ICT-2011 – 5.3 Page 21 Uses the knowledge in the L2S knowledge base Runs in the background to identify new associations and association rules Reruns analyses when updated datasets are available Creates alerts for patients profiles associated with adverse events FP7, ICT-2011 – 5.3 Page 22 FP7, ICT-2011 – 5.3 Page 23 FP7, ICT-2011 – 5.3 Page 24 Overlapping non genetic data of at least 2 data providers: Age Gender BMI Smoking Ever Dyslipidemia Diabetes Diabetes type I Diabetes type II Anemia Depressive personality disorder Major depressive disorder Schizotypal personality disorder FP7, ICT-2011 – 5.3 Variables Weight gain Headaches Gastrointestinal symptoms Ophthalmological problems Type of ophthalmological condition High blood pressure Heart conditions exist Type of heart condition Hypertension Myocardial infarction Stroke Coronary heart disease Page 25 We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes. Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results. Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements. FP7, ICT-2011 – 5.3 Page 26 Athos Antoniades University of Cyprus email: athos@cs.ucy.ac.cy FP7, ICT-2011 – 5.3 Page 27