Data quality issues in population-based studies; Impacts on harmonisation

advertisement
Data quality issues in
population-based studies;
Impacts on harmonisation
09/04/2006
Data Quality/Usability Issues in e-Health:
Practice, Process and Policy
Isabel Fortier, Ph.D.
P3G Observatory
Génome Québec
Data quality
Numbers are always valid without context
Objectives



Give a brief overview of the factors or contexts
influencing data quality in studies as well as in
aggregated analysis
Discuss their potential impacts on results
Initiate discussion on potential ways to increase data
quality
Principal focus

Selection and follow-up of participants

Information collection and treatment

Data management
1. Selection and follow-up of participants
Quality of the sample

Potential bias if the disease or the “exposure” status influence the
selection of the subjects or their participation to the study

The study sample is “never” totally representative of the reference
population
Example
Participation in a study can allow access to a better clinical follow-up
or to a comprehensive evaluation of health not easily accessible
through the medical system in place.
Impact: potential higher participation of subjects with health
problems or at risk to develop some. Can lead to an over
representation of unhealthy persons in the database.
According to the context, selection bias can
increase or decrease the strength of the association
Possible ways to limit the impact

Develop good selection and follow-up designs
Taking into account scientific validity but also feasibility (costs,
infrastructure, organisation of medical care in the country, etc.)

Identify and keep in mind the limits of the study
 If possible, compare participants and non participants
characteristics to identify discrepancies
 Limit results interpretation to the adequate reference population
Aggregation: Selection criteria frequently
used in population-based biobanks







Age group (~100%)
Country of residence (~100%)
Residence in specific communities/geographic areas
(> 75%)
Sex (excluding pregnancy) (~10%)
Pregnancy (~ 5-10%)
Employment status (~ 5-10%)
etc.
Extracted from P3G Observatory
Aggregation a simple problem?
Age groups and sampling design; 8 population-based studies
Age
Sampling Design
3 disctinct procedures (CS)
0
10
20
30
40
50
60
70
80
90
Clinic patients (CS)
Clinic patients (CS)
Clinic patients (L)
Clinic patients (CS)
Random sampling (L)
Random sampling (CS)
Random sampling (L)
* Different selection bias among studies
Extracted from P3G Observatory
2. Information collection
Quality of the information collected
It is RARELY possible to
perfectly classify participants :
Health conditions,
Exposures,
Socioeconomic status,
Genetic characteristics,
etc.
Participants classification
J. Lacroix,MSO6027
Validity
Sensitivity 100%, Specificity 100%
REAL Prevalence : 10%
Estimation
D
No D
Total
Disease « gold standard »
D
No D
100
0
0
900
100
900
Total
100
900
1000
Sensitivity 95%, Specificity 95%
ESTIMATED Prevalence : 14%
Estimation
D
No D
Total
Disease « gold standard »
D
No D
95
45
5
855
100
900
Total
140
860
1000
Differential and non-differential information bias
Differential bias
Variation of classification error between groups under study
Impact: Increase OR decrease association strength
Non-differential bias
Similar classification error between groups under study
Impact: Decrease association strength
Information bias:
Definition of health status




Heterogeneity in the expression of the health problem under study
 Increasing focus on complex diseases often without efficient
tools to define expression (subtypes, co-morbidity, etc.)
Evaluators
 Potential subjectivity of the interviewers, study staff, etc.
Participants
 Subjectivity in the responses obtained from participants
Diagnostic tools and information sources
 Validity (ability to distinguish between who has a condition and
who does not)
Pros and cons of health information sources
Questionnaires
Accessible,
+ Relatively low
costs
Physiological
or biochemical
measures
Relatively good
validity
Costs,
Specific
Limited quality of infrastructures
information
needed
Cross sectional
measures
Medical
evaluations
Illnesses
registries
Governmental
databases
Relatively
good validity
Good
validity
Low costs,
Useful for
longitudinal
follow-up
Costs,
Difficulties for
standardisation of data
collection
Limits in
completeness of
the
registries
Limited quality
(administrative
databases)
Information sources:
10 population-based studies
Questionnaires
Physiol/Biological
Measures
Medical
Records
Registries
L
L
L
Birth / Death - Hospitalization / Medical
CS
CS
CS/L
CS/L
CS/L
Birth / Death - Hospitalization / Medical
Illness Registries - Medication
CS
CS
CS
Birth / Death - Hospitalization / Medical
Illness Registries - Medication
CS
CS
CS
Birth / Death - Hospitalization / Medical
Illness Registries - Medication
CS/L
CS/L
CS/L
Birth / Death - Illness Registries
L
L
L
Birth / Death - Hospitalization / Medical
Illness Registries - Medication
CS
Birth / Death - Hospitalization / Medical
Illness Registries - Medication
L
L
L
L
Birth / Death - Hospitalization / Medical
CS
CS
L
Birth / Death - Hospitalization / Medical
Illness Registries - Medication
Extracted from P3G Observatory
Governmental databases: An example




Medical services (administrative DB): for 60 % of the consultations:
 Date, professional class and specialty, diagnostics classification
(principal reason of the consultation-for payment), etc.
Hospitalizations (MED-ECHO): for all hospitalizations:
 Date of admission, type of admission, hospitalization duration,
principal diagnosis (CIM-9/10), secondary diagnoses,
treatments, etc.
Pharmaceutical services: for 50 % of the prescriptions:
 Date, medication class, code, form, dosage and quantity,
duration of treatment, physician specialty.
Deaths:
 Date of death, location, cause of death, etc.
Potential definitions of asthma cases





Asthma as principal reason of consultation (RAMQ)
 At least once or several times?
 Information on consultation available only for 60% of the population
 Influence of age distribution on follow-up (Cartagène: 24-75 years)
 Only limited information on secondary diagnostics
Medication for asthma prescribed
 Medications not all specific to asthma
 Medications prescribed but not necessarily taken
 Information available for only 50% of the population
Asthma as cause of death
 Asthma will rarely be coded as the cause of death
Hospitalization for asthma (MED-ECHO)
Or merge information between databases to define cases
How to compare those information between studies?
What is the validity of these potential outcomes (sensitivity/specificity)?
Potential impact on results
Risk Factor
“Gold standard”
OR = 2
Study measures
Non differential bias
Sensitivity (E = Non E) = 0.8
Specificity (E = Non E) = 0.9
OR = 1.6
Gold
standard
E
Non E
D
400
200
No D
600
800
1000
1000
Risk Factor
Study
Measure
E
Non E
D
380
240
No D
620
760
1000
1000
Aggregation of information
from questionnaires: An example
Targeted outcome: Cancers



Ever had cancer
Type of cancer
Onset of symptoms or diagnostic date
Ever had Cancer
Study 1
Have you ever had cancer?
Yes, No, I don't know
Study 2
Have you ever been told by a doctor or other health professional
that you had cancer or a malignancy of any kind?
Yes, No, Refused, Don't Know
Study 3
Has a physician ever told you that you had any of the following
cancers?
List of cancer
Extracted from P3G Observatory
Type of cancer
Study 1
What kind of cancer?____________________________
Study 2
Has a physician ever told you that you had any of the following cancers?
Prostate cancer, Lung or bronchial cancer, Colon or rectal cancer,
Bladder cancer, Lymphoma, Other cancer (define)
Study 3
What kind of cancer was it?
BLADDER
BLOOD
BONE
BRAIN
BREAST
CERVIX(CERVICAL)
COLON
ESOPAGUS
ESOPHAGEAL)
GALLBLADDER
KIDNEY
LARYNX/WINDPIPE
LEUKEMIA
LIVER
LUNG
LYMPHOMA/HODGKINS'
DISEASE
MELANOMA
MOUTH/TONGUE/LIP
NERVOUS SYSTEM
OVARY (OVARIAN)
PANCREAS
(PANCREATIC)
PROSTATE
RECTUM (RECTAL)
SKIN (NON-MELANOMA)
SKIN (DON'T KNOW WHAT
KIND)
SOFT TISSUE (MUSCLE OR
FAT)
STOMACH
TESTIS (TESTICULAR)
THYROID
UTERUS (UTERINE)
OTHER
MORE THAN 3 KINDS
REFUSED
DON'T KNOW
Extracted from P3G Observatory
Onset of symptoms or diagnostic date
Study 1
In which year was this ascertained?
Year |_|_|_|_| or age at that time |_|_|
Study 2
How old were you when the cancer was first diagnosed?
|___|___|___| age in years, refused, don't know
Study 3
Prostate cancer
O Never
O Before October 2001
O Oct. 2001 - July 2003
O After July 2003
Extracted from P3G Observatory
Information bias:
Risk factors definition




Tools and procedures used for estimation of genetic and
environmental factors (questionnaires, laboratory standard
operation procedures, technology used, etc)
 Validity…
Evaluators/technicians
 Potential subjectivity of interviewers, bias introduced by
study staff, etc.
Participants (questionnaires)
 Subjectivity in the responses obtained from participants
Definition of the environmental exposure in time
 Important variations of the exposure level through follow-up
Aggregation: An example
Marital Status
1.
Please indicate your current marital status by ticking the
appropriate box.
SINGLE, MARRIED OR LIVING AS MARRIED, WIDOWED,
SEPARATED, DIVORCED
1.
Are you now married, widowed, divorced, separated, never married
or living with a partner?
MARRIED, WIDOWED, DIVORCED, SEPARATED, NEVER
MARRIED, LIVING WITH PARTNER,
REFUSED, DON'T KNOW
2.
What is your marital status?
MARRIED, SINGLE, DIVORCED, WIDOWED
Extracted from P3G Observatory
STUDY 1
STUDY 2
STUDY 3
WIDOWED
WIDOWED
WIDOWED
DIVORCED
DIVORCED
DIVORCED
MARRIED OR LIVING
AS MARRIED
MARRIED
MARRIED
LIVING WITH
PARTNER
SEPARATED
SEPARATED
SINGLE
SINGLE
NEVER MARRIED
Extracted from P3G Observatory
3. Data management
Data management: A complex task
Biobank data
 Collected and produced in different centers
 Recruitment centers or clinics, genotyping laboratories, etc.
 Heterogeneous
 Biochemical and physiological measures, genealogies,
genotypes, etc.
 Various formats
 Databases, papers, electronic, XML, etc.
 Various codification rules between centers
 Extensive
 Important number of participants
 Longitudinal data
 New high-throughput genotyping technologies
 Confidential information
Data Management:
Impacts on quality



Data entry
 Carried out by staff with various backgrounds, not necessarily
aware of the consequences of inexact or incomplete data on the
overall quality of the study
Samples and questionnaires identification and manipulation (DNA
extraction, biochemical analysis, etc)
 Important potential for errors in the processes, numerous
manipulations
Keys generation and management (identification codes)
 A complex and crucial procedure to protect identity and avoid
errors in the correspondence between identification numbers and
individuals
Impacts on quality…


Size of databases
 Millions of genotypes correlated with millions of epidemiological
outcomes
 Increasing complexity of data transfer, storage, query and
analysis
Validation
 Essential to insure continuous quality controls, including crossvalidations, statistical validations, etc.
Conclusion
Realise a population-based study:
A complex task, generally achieved with limited resources
To obtain sufficient statistical power for investigation of the impact of
genes and environment on complex disease, aggregation of data
between studies will often be needed.
How to facilitate aggregation of information?




Facilitate exchange of expertise and merging of efforts (networks of
collaboration)
Allow easy access to relevant protocols, procedures,
questionnaires, etc.
If relevant, facilitate development of common standard operation
procedures or tools
Work prospectively, for aggregation of future studies
Different initiatives…
P3G Observatory
A tool among others
TOOLS PERTAINING TO STUDIES
GENERAL TOOLS
Repository of reference tools and
documentation to support
population-based biobanks
activities. For example:
DESIGN OF
STUDIES
• Repository of standard operation
procedures or good practices
guides.
ETHICS AND
GOVERNANCE
• Methodological documentation
in epidemiology or genomics.
• Reference tools in statistics
INFORMATION
COLLECTION/
TREATMENT
• Information technology
applications (OpenBiobank)
• Repository of useful websites
• Etc.
INFORMATION
TECHNOLOGY
DATA ANALYSIS
Descriptive information on worldwide
population-based biobanks and tools
for comparison and harmonisation
between specific biobanks.
Description:
For targeted studies, description of
methods, data, ethics and
governance rules, operation
procedures, etc..
Comparison:
Tools for comparison, between
targeted studies, of the information
collected or produced and of
procedures used.
Harmonisation:
Specific models and procedures for
harmonisation of the information
collected or produced between
subgroups of studies.
OpenBiobank Application Suite
3
Samples
Management
Application
1
Participants
Recruitment and
Evaluation
Management
Application
6
Integration Application
Projects &
Contacts
Management
Data Staging
& Validation
Identification
protection /
Anonymization
DATA WAREHOUSE
Data
Querying and
Reporting
4
Biochemical
Analysis
Application
2
Epidemiological Data
Application
5
Genotyping
Application
Data from different sources:
7
Genetic
Statistical
Package
Governmental DB,
Project Results, etc.
Thank you !
Download