Data quality issues in population-based studies; Impacts on harmonisation 09/04/2006 Data Quality/Usability Issues in e-Health: Practice, Process and Policy Isabel Fortier, Ph.D. P3G Observatory Génome Québec Data quality Numbers are always valid without context Objectives Give a brief overview of the factors or contexts influencing data quality in studies as well as in aggregated analysis Discuss their potential impacts on results Initiate discussion on potential ways to increase data quality Principal focus Selection and follow-up of participants Information collection and treatment Data management 1. Selection and follow-up of participants Quality of the sample Potential bias if the disease or the “exposure” status influence the selection of the subjects or their participation to the study The study sample is “never” totally representative of the reference population Example Participation in a study can allow access to a better clinical follow-up or to a comprehensive evaluation of health not easily accessible through the medical system in place. Impact: potential higher participation of subjects with health problems or at risk to develop some. Can lead to an over representation of unhealthy persons in the database. According to the context, selection bias can increase or decrease the strength of the association Possible ways to limit the impact Develop good selection and follow-up designs Taking into account scientific validity but also feasibility (costs, infrastructure, organisation of medical care in the country, etc.) Identify and keep in mind the limits of the study If possible, compare participants and non participants characteristics to identify discrepancies Limit results interpretation to the adequate reference population Aggregation: Selection criteria frequently used in population-based biobanks Age group (~100%) Country of residence (~100%) Residence in specific communities/geographic areas (> 75%) Sex (excluding pregnancy) (~10%) Pregnancy (~ 5-10%) Employment status (~ 5-10%) etc. Extracted from P3G Observatory Aggregation a simple problem? Age groups and sampling design; 8 population-based studies Age Sampling Design 3 disctinct procedures (CS) 0 10 20 30 40 50 60 70 80 90 Clinic patients (CS) Clinic patients (CS) Clinic patients (L) Clinic patients (CS) Random sampling (L) Random sampling (CS) Random sampling (L) * Different selection bias among studies Extracted from P3G Observatory 2. Information collection Quality of the information collected It is RARELY possible to perfectly classify participants : Health conditions, Exposures, Socioeconomic status, Genetic characteristics, etc. Participants classification J. Lacroix,MSO6027 Validity Sensitivity 100%, Specificity 100% REAL Prevalence : 10% Estimation D No D Total Disease « gold standard » D No D 100 0 0 900 100 900 Total 100 900 1000 Sensitivity 95%, Specificity 95% ESTIMATED Prevalence : 14% Estimation D No D Total Disease « gold standard » D No D 95 45 5 855 100 900 Total 140 860 1000 Differential and non-differential information bias Differential bias Variation of classification error between groups under study Impact: Increase OR decrease association strength Non-differential bias Similar classification error between groups under study Impact: Decrease association strength Information bias: Definition of health status Heterogeneity in the expression of the health problem under study Increasing focus on complex diseases often without efficient tools to define expression (subtypes, co-morbidity, etc.) Evaluators Potential subjectivity of the interviewers, study staff, etc. Participants Subjectivity in the responses obtained from participants Diagnostic tools and information sources Validity (ability to distinguish between who has a condition and who does not) Pros and cons of health information sources Questionnaires Accessible, + Relatively low costs Physiological or biochemical measures Relatively good validity Costs, Specific Limited quality of infrastructures information needed Cross sectional measures Medical evaluations Illnesses registries Governmental databases Relatively good validity Good validity Low costs, Useful for longitudinal follow-up Costs, Difficulties for standardisation of data collection Limits in completeness of the registries Limited quality (administrative databases) Information sources: 10 population-based studies Questionnaires Physiol/Biological Measures Medical Records Registries L L L Birth / Death - Hospitalization / Medical CS CS CS/L CS/L CS/L Birth / Death - Hospitalization / Medical Illness Registries - Medication CS CS CS Birth / Death - Hospitalization / Medical Illness Registries - Medication CS CS CS Birth / Death - Hospitalization / Medical Illness Registries - Medication CS/L CS/L CS/L Birth / Death - Illness Registries L L L Birth / Death - Hospitalization / Medical Illness Registries - Medication CS Birth / Death - Hospitalization / Medical Illness Registries - Medication L L L L Birth / Death - Hospitalization / Medical CS CS L Birth / Death - Hospitalization / Medical Illness Registries - Medication Extracted from P3G Observatory Governmental databases: An example Medical services (administrative DB): for 60 % of the consultations: Date, professional class and specialty, diagnostics classification (principal reason of the consultation-for payment), etc. Hospitalizations (MED-ECHO): for all hospitalizations: Date of admission, type of admission, hospitalization duration, principal diagnosis (CIM-9/10), secondary diagnoses, treatments, etc. Pharmaceutical services: for 50 % of the prescriptions: Date, medication class, code, form, dosage and quantity, duration of treatment, physician specialty. Deaths: Date of death, location, cause of death, etc. Potential definitions of asthma cases Asthma as principal reason of consultation (RAMQ) At least once or several times? Information on consultation available only for 60% of the population Influence of age distribution on follow-up (Cartagène: 24-75 years) Only limited information on secondary diagnostics Medication for asthma prescribed Medications not all specific to asthma Medications prescribed but not necessarily taken Information available for only 50% of the population Asthma as cause of death Asthma will rarely be coded as the cause of death Hospitalization for asthma (MED-ECHO) Or merge information between databases to define cases How to compare those information between studies? What is the validity of these potential outcomes (sensitivity/specificity)? Potential impact on results Risk Factor “Gold standard” OR = 2 Study measures Non differential bias Sensitivity (E = Non E) = 0.8 Specificity (E = Non E) = 0.9 OR = 1.6 Gold standard E Non E D 400 200 No D 600 800 1000 1000 Risk Factor Study Measure E Non E D 380 240 No D 620 760 1000 1000 Aggregation of information from questionnaires: An example Targeted outcome: Cancers Ever had cancer Type of cancer Onset of symptoms or diagnostic date Ever had Cancer Study 1 Have you ever had cancer? Yes, No, I don't know Study 2 Have you ever been told by a doctor or other health professional that you had cancer or a malignancy of any kind? Yes, No, Refused, Don't Know Study 3 Has a physician ever told you that you had any of the following cancers? List of cancer Extracted from P3G Observatory Type of cancer Study 1 What kind of cancer?____________________________ Study 2 Has a physician ever told you that you had any of the following cancers? Prostate cancer, Lung or bronchial cancer, Colon or rectal cancer, Bladder cancer, Lymphoma, Other cancer (define) Study 3 What kind of cancer was it? BLADDER BLOOD BONE BRAIN BREAST CERVIX(CERVICAL) COLON ESOPAGUS ESOPHAGEAL) GALLBLADDER KIDNEY LARYNX/WINDPIPE LEUKEMIA LIVER LUNG LYMPHOMA/HODGKINS' DISEASE MELANOMA MOUTH/TONGUE/LIP NERVOUS SYSTEM OVARY (OVARIAN) PANCREAS (PANCREATIC) PROSTATE RECTUM (RECTAL) SKIN (NON-MELANOMA) SKIN (DON'T KNOW WHAT KIND) SOFT TISSUE (MUSCLE OR FAT) STOMACH TESTIS (TESTICULAR) THYROID UTERUS (UTERINE) OTHER MORE THAN 3 KINDS REFUSED DON'T KNOW Extracted from P3G Observatory Onset of symptoms or diagnostic date Study 1 In which year was this ascertained? Year |_|_|_|_| or age at that time |_|_| Study 2 How old were you when the cancer was first diagnosed? |___|___|___| age in years, refused, don't know Study 3 Prostate cancer O Never O Before October 2001 O Oct. 2001 - July 2003 O After July 2003 Extracted from P3G Observatory Information bias: Risk factors definition Tools and procedures used for estimation of genetic and environmental factors (questionnaires, laboratory standard operation procedures, technology used, etc) Validity… Evaluators/technicians Potential subjectivity of interviewers, bias introduced by study staff, etc. Participants (questionnaires) Subjectivity in the responses obtained from participants Definition of the environmental exposure in time Important variations of the exposure level through follow-up Aggregation: An example Marital Status 1. Please indicate your current marital status by ticking the appropriate box. SINGLE, MARRIED OR LIVING AS MARRIED, WIDOWED, SEPARATED, DIVORCED 1. Are you now married, widowed, divorced, separated, never married or living with a partner? MARRIED, WIDOWED, DIVORCED, SEPARATED, NEVER MARRIED, LIVING WITH PARTNER, REFUSED, DON'T KNOW 2. What is your marital status? MARRIED, SINGLE, DIVORCED, WIDOWED Extracted from P3G Observatory STUDY 1 STUDY 2 STUDY 3 WIDOWED WIDOWED WIDOWED DIVORCED DIVORCED DIVORCED MARRIED OR LIVING AS MARRIED MARRIED MARRIED LIVING WITH PARTNER SEPARATED SEPARATED SINGLE SINGLE NEVER MARRIED Extracted from P3G Observatory 3. Data management Data management: A complex task Biobank data Collected and produced in different centers Recruitment centers or clinics, genotyping laboratories, etc. Heterogeneous Biochemical and physiological measures, genealogies, genotypes, etc. Various formats Databases, papers, electronic, XML, etc. Various codification rules between centers Extensive Important number of participants Longitudinal data New high-throughput genotyping technologies Confidential information Data Management: Impacts on quality Data entry Carried out by staff with various backgrounds, not necessarily aware of the consequences of inexact or incomplete data on the overall quality of the study Samples and questionnaires identification and manipulation (DNA extraction, biochemical analysis, etc) Important potential for errors in the processes, numerous manipulations Keys generation and management (identification codes) A complex and crucial procedure to protect identity and avoid errors in the correspondence between identification numbers and individuals Impacts on quality… Size of databases Millions of genotypes correlated with millions of epidemiological outcomes Increasing complexity of data transfer, storage, query and analysis Validation Essential to insure continuous quality controls, including crossvalidations, statistical validations, etc. Conclusion Realise a population-based study: A complex task, generally achieved with limited resources To obtain sufficient statistical power for investigation of the impact of genes and environment on complex disease, aggregation of data between studies will often be needed. How to facilitate aggregation of information? Facilitate exchange of expertise and merging of efforts (networks of collaboration) Allow easy access to relevant protocols, procedures, questionnaires, etc. If relevant, facilitate development of common standard operation procedures or tools Work prospectively, for aggregation of future studies Different initiatives… P3G Observatory A tool among others TOOLS PERTAINING TO STUDIES GENERAL TOOLS Repository of reference tools and documentation to support population-based biobanks activities. For example: DESIGN OF STUDIES • Repository of standard operation procedures or good practices guides. ETHICS AND GOVERNANCE • Methodological documentation in epidemiology or genomics. • Reference tools in statistics INFORMATION COLLECTION/ TREATMENT • Information technology applications (OpenBiobank) • Repository of useful websites • Etc. INFORMATION TECHNOLOGY DATA ANALYSIS Descriptive information on worldwide population-based biobanks and tools for comparison and harmonisation between specific biobanks. Description: For targeted studies, description of methods, data, ethics and governance rules, operation procedures, etc.. Comparison: Tools for comparison, between targeted studies, of the information collected or produced and of procedures used. Harmonisation: Specific models and procedures for harmonisation of the information collected or produced between subgroups of studies. OpenBiobank Application Suite 3 Samples Management Application 1 Participants Recruitment and Evaluation Management Application 6 Integration Application Projects & Contacts Management Data Staging & Validation Identification protection / Anonymization DATA WAREHOUSE Data Querying and Reporting 4 Biochemical Analysis Application 2 Epidemiological Data Application 5 Genotyping Application Data from different sources: 7 Genetic Statistical Package Governmental DB, Project Results, etc. Thank you !