Powerpoint - The BIAS project

advertisement
Hierarchical models for combining multiple data
sources measured at individual and small area levels
Chris Jackson
With Nicky Best and Sylvia
Richardson
Department of Epidemiology and Public
Health
Imperial College, London
chris.jackson@imperial.ac.uk
BIAS project
http://www.bias-project.org.uk
Outline
Infer some individual-level relationship, e.g. influence of
individual socio-economic circumstances on risk of ill
health
 Use combination of datasets, individual and aggregate, to
answer the question.
 Multi-level models on multi-level data.
Examples:
 Hospital admission for cardiovascular disease and sociodemographic factors
 Low birth weight and air pollution

Combining different forms of observational data
Advantages
Aggregate
Individual
Census
 National registers
 Environmental
monitors

Surveys
 Cohort studies
 Case-control
 Census SAR

 Abundant,
collected
 Covers whole
population
 Can study smallarea variations
Direct information on
exposure-outcome
relationship
 More variables
available

Ecological bias
 Distinguishing
individual from
area-level effects
 Not many
variables

Low power
 Little geographical
information 
confidentiality

Reduce confounding  Conflicts between
and bias
information from each
 Maximise power
 Separate individual
and area-level effects

COMBINED
routinely
Disadvantages
Example 1: Cardiovascular hospitalisation
Question
 Socio-demographic predictors of hospitalisation for heart and
circulatory disease for individuals
 Is there any evidence of contextual effects (area-level as well as
individual predictors)
Design
Data synthesis using
 Area-level administrative data: hospital episode statistics and census
small-area statistics
 Individual-level survey data: Health Survey for England.
Issue
 Reduce ecological bias and improve power, compared to using
datasets singly.
Example 2: Low birth weight and pollution
Question
 Influence of traffic-related air pollution (PM10, NO2, CO) on risk of
intrauterine growth retardation ( low birth weight )
Design
Data synthesis using two individual-level datasets
 National births register, 2000. (~600,000 births)
 Millennium Cohort Study. (~20,000 births)
Issue
 Geographical identifiers ( pollution exposure), and outcome, available
for both datasets
 Important confounders (maternal age, smoking, ethnicity…) only
available in the small dataset. Combine to increase power.
Multilevel models for individual and area data
Most commonly used to model
 individual-level outcomes yij (individual j, area i)
in terms of
 individual-level predictors xij
 group-level (e.g. area-level) predictors xi
 Allow baseline risk (possibly also covariate effects) to vary by
area:
yij ~ i +  xij + b xi
However
We want to model area-level outcomes yi as well as
individual outcomes yij
Modelling the area-level outcome
Individual exposure
xij
yij
Aggregate exposure
xi
Individual exposure
xij
yij
Aggregate exposure
xi
yi
Individual
outcome
Individual
outcome
Aggregate
outcome
Ecological inference


Determining individual-level exposure-outcome
relationships using aggregate data.
A simple ecological model:
Yi ~ Binomial(pi, Ni), logit(pi) =  + b Xi
Yi is the number of disease cases in area i
Ni is the population in area i
Xi is the proportion of individuals in area i with e.g. low social class.
pi is the area-specific disease rate


exp(b) = odds ratio associated with exposure Xi
This is the group level association. Not necessarily equal
to individual-level association → ecological bias
Ecological bias
Bias in ecological studies can be caused by:
 Confounding. As in all observational studies



confounders can be area-level (between-area) or individual-level
(within-area).
Solution: try to account for confounders.
non-linear exposure-response relationship, combined with
within-area variability of exposure
 No bias if exposure is constant in area (contextual
effect)
 Bias increases as within-area variability increases
 …unless models are refined to account for this hidden
variability
Improving ecological inference



Alleviate bias associated with within-area exposure
variability.
Get some information on within-area distribution fi(x) of
exposures, e.g. from individual-level exposure data.
Use this to form well-specified model for ecological data by
integrating the underlying individual-level model.
Yi ~ Binomial(pi , Ni), pi =  pik(x) fi(x) dx
pi is average group-level risk
pik(x) is individual-level model (e.g. logistic regression)
fi(x) is distribution of exposure x within area i (or joint
distribution of multiple exposures)
When ecological inference can work


Using well-specified model
Information on within-area distribution of exposure


Information, e.g. from a sample of individual exposures, to estimate
the unbiased model that accounts for this distribution.
High between-area contrasts in exposure


Information on the variation in outcome between areas with low
exposure rates and high exposure rates
E.g. to determine ethnic differences in health, better to study areas
in London (more diverse) than areas in a rural region.
When there is insufficient information in ecological data:
 May be able to incorporate individual-level exposureoutcome data…
Hierarchical related regression
Infer individual-level relationships using both individual and aggregate data
Individual-level model
 Logistic regression for individual-level outcome
 Includes individual or area-level predictors
 Use this to
 model the individual-level data
 construct correct model for aggregate data
Model for aggregate data
Based on averaging the individual model over the within-area joint
distribution of covariates.
 Alleviates ecological bias.

Combined model
Individual and aggregate data assumed to be generated by the
same baseline and relative risk parameters.
 Estimate these parameters using both datasets simultaneously

Combining ecological and case-control data



If outcome is rare, individual-level data from surveys or
cohorts will usually contain little information.
Supplement ecological data with case-control data instead.
Haneuse and Wakefield (2005) describe a hybrid
likelihood for combination of ecological and case-control
data
 Even including individual data from the cases only can
reduce ecological bias to acceptable levels.
Issues with combining data




Some variables missing in one dataset
 e.g. smoking, blood pressure available in survey but not
administrative data
Different but related information in each
 e.g. self-reported disease versus hospital admission
records.
Conflicts between datasets in information on what is
nominally the same variable
 e.g. self-completed and interviewed responses to
surveys
Ideally the individual and aggregate data are from the
same source (e.g. census small-area and SAR)
Example: Cardiovascular disease (CVD)
AGGREGATE
INDIVIDUAL
Hospital Episode Statistics
Health Survey for England
• number of CVD admissions in
area in 1998, by age group/sex
Census small area statistics
• marginal proportions non-white,
social class IV/V,…
Census Samples of Anonymised
Records (2%)
• full within-area cross-classification
of individuals,
age/sex/ethnicity/social class/car
ownership - required for correct
• Self-reported admission to hospital for
CVD (1998 only)
• Self-reported long-term CVD (1997,
1999, 1998, 2000, 2001)
 Multiple imputation for missing hospital
admission in not-1998.
• individual age and sex
• individual ethnicity
• individual social class
• individual car access
aggregate model
Baseline and relative risk of CVD admission for individual
Are aggregate and individual data consistent?
Health Survey for England aggregated over districts
Census covariates or Hospital Episode Statistics data
Basic illustration of combining
individual and aggregate data
Aggregate census data
disease
Area
admissions
count
yi
UNKNOWNS
exposure
e.g.
proportion
low social
class
xi
Areas i
DATA
exposure
Individual
survey
data
xij
Individual
social class
disease
CVD
admission
yij
Areas i, individuals j
Relative risk for
individuals
b
Area baseline risk
i
Aggregate census data
yik
Cross-classification of
individuals
xirsk
Census Samples of
Anonymised Records
xil
Areas i,
individuals l
Areas i
xir
xis
xik
social class r, employment
status s, age/sex strata k.
DATA
b
Individual survey data
exposures
disease
Relative risk for Area/stratum
exposures
baseline risk
Exposures
CVD
admission
ik
xij
yij
Areas i, individuals j
More complex models for
disease, more confounders,
need another data source.
Aggregate census data
Cross-classification of
individuals
yik
xirsk
Census Samples of
Anonymised Records
xil
Areas i,
individuals l
Areas i
xir
xis
xik
Relative risk for Area/stratum
exposures
baseline risk
social class r, employment
status s, age/sex strata k.
DATA
b
Survey data (1998)
Self
reported
CVD
CVD
admissions
yij*
Survey data (19972001)
xij
CVD
admissions
yij
Areas i, individuals j
ik
including
imputed
values
yij
Areas i, individuals j
Imputing missing
outcomes in individual
data
Estimated coefficients (with 95% CI) for multiple
regression model of the risk of hospitalisation
Carstairs
Individual data only
Individual
District
District + individual
Ward
Ward + individual
No car
Aggregate data only
Social class IV/V
Models combining
individual and aggregated
data
Non white
-1.0
-0.5
0.0
0.5
1.0
Log odds ratio
1.5
2.0
Individual and area-level predictors



Area level covariates in underlying model for hospitalisation risk
(Carstairs deprivation index)
 No significant influence of Carstairs, after accounting for
individual-level factors
Random effects models
Random area-level baseline risk, quantifies remaining variability
between areas.
 After adjusting for covariates, variance partitioned into
individual / area-level components
 4% of residual variance between wards attributable to
unobserved area-level factors
(2% for districts)

Little evidence of contextual effects
Example: Low birth weight and pollution
Geographically complete individual dataset from national
register, with exposure, outcome but not confounders
 Geographically sparse survey dataset with all variables.
→ missing data problem
 Impute missing covariates that are likely to be confounded
with the pollution exposure.
 Information for this imputation
 from aggregate data (e.g. ethnicity, from census).
 from sparse survey dataset

National register data (LARGE)
Low birth weight
POLLUTION
Survey data (Small)

b
Low birth weight
Pollution
e
regression model
CONFOUNDERS
Aggregate
census data
Ethnicity
Sex, age
Confounders
b
c
Sex, age
Socioeconomic
Socioeconomic
?
Smoking
?
Ethnicity
?
Maternal age
?
etc..
Parallel regression models




Desire unbiased inference on the effect of the
primary exposure.
Available from small dataset with all
confounders, but with low power.
Information for imputation comes from small
dataset or ecological data  is resulting
uncertainty worth the precision gained?
Work in progress, currently awaiting some
data.
Summary



Combining datasets can increase power and
reduce bias, making use of strengths of each
Problems may arise when data are incompatible
or inconsistent.
Bayesian hierarchical models useful in cases of
conflicts.


All our methods can be implemented in WinBUGS
More applied studies needed to demonstrate the
utility of the approach.
Publications
Our papers available from http://www.bias-project.org.uk
 C. Jackson, N. Best, S. Richardson. Hierarchical related
regression for combining aggregate and survey data in
studies of socio-economic disease risk factors. under
revision, Journal of the Royal Statistical Society, Series A.
 C. Jackson, N. Best, S. Richardson. Improving ecological
inference using individual-level data. Statistics in Medicine
(2006) 25(12):2136-2159.
 C. Jackson, S. Richardson, N. Best. Studying place
effects on health by synthesising area-level and individual
data. Submitted.
 S. Haneuse and J. Wakefield. The combination of
ecological and case-control data. Submitted.
Download