Document

advertisement
PH240A
Epidemiology and the Curse of
Dimensionality
Alan Hubbard
Division of Biostatistics
U.C. Berkeley
hubbard@stat.berkeley.edu
What’s Wrong?

Not related to analysis of data misclassification, biased sample,
unmeasured confounders, ....

Related to analysis of data – improper
modeling strategies, failure to account for
design in analysis, poorly defined
parameter of interest, improper inference
reported given the strategy used.
What’s Proper Inference?

For an estimate, the standard error is an estimate
of the variability of this estimate in repeated
experiments performed just as the one used to
derived the estimate.

So, every little decision one makes about how to
analyze the data, that comes from looking at the
data, should be included in such a variability
estimate.

If feedback between data and analysis decisions
is ignored (as it typically is), then inference can
be biased.
A nice, mindless way to
always get proper inference

Define an algorithm that one will use to get the
final estimate (how variables are chosen, how
they are entered in a model, etc., etc.).

Apply this algorithm to data to get estimate.

Re-sample the data with replacement to create
new pseudo-experiment.

Apply this algorithm to this “new” data to get
another estimate.

Repeat 10,000 times - graph histogram of
estimates - BOOTSTRAPPING.
Causal Inference and Curse of Dimensionality
Causal Model and Data
Text from Taubes NYTimes Article
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Text from Taubes NYTimes Article
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Causal Inference and Curse of
Dimensionality
Causal Model and Data
VW
Full Data
•X  ((Ya: a),W)
Observed Data
•O  (W,A,YA)
A
treatment
(risk factor)
Y
outcome
The Curse of Dimensionality and Epidemiology

Can’t beat curse of dimensionality (unless one is lucky).

Consider simplistic following scenario
 W are the covariates (confounders) and all are
categorical with the same number of level, e.g., 2 or 3
or 4...

In order to get nonparametric causal inference, one
must have a perfectly matched unexposed person
(A=0) for every exposed person (A=1).

Given the number of confounders, how many subjects
does one have to sample for every exposed person?
Number of unexposed per exposed subject one needs
to sample to get perfect matching.
Doing the best one can with the
Curse
Concepts and Models in Causal
Inference
Prediction vs. Explanation

Prediction – create a model that the
clinician will use to help predict risk
of a disease for the patient.

Explanation – trying to investigate
the causal association of a treatment
or risk factor and a disease outcome.

This talk concerns studies where the
goal is explanation.
Theoretical Experiment

Start with some hypothesis.

From a statistical/scientific perspective the first
step is to define a theoretical experiment that
would address the hypothesis of interest.

For most questions in social-epi, running such
an experiment will be unthinkable.

However, defining the experiment of interest:
1.
2.
3.
4.
Makes explicit the specific hypothesis
Defines the Full Data
Defines the specific parameter of interest
Leads ultimately to estimators from the Observed data
and the necessary identifiability assumptions
The problem with observational
studies: lack of randomization

If one has a treatment, or risk factor, with two
levels (A and B), no guarantee that study
populations (those getting A and B) will be
roughly equivalent (in risk of the disease of
interest).

In a perfect world can given everyone in study
level A, record outcome, reset clock and then
give level B.

Randomization means one can interpret
estimates as if this is precisely what was done.
Counterfactuals

Even defining statistically what a “causal” effect
is, is not trivial.

One way that leads to practical methods to
estimate causal effects is to define
COUNTERFACTUALS

Assume that the “full” data would be, for every
subject, one could observe the outcome of
interest for each possible level of the treatment
(or risk factor) of interest.
Counterfactuals, CONT.

So, if Y is the outcome, A is the tx of interest, then the
best statistical situation is one where one observes, for
each subject, Ya, for each treatment level A=a.

For example if there is simply two levels of exposure (eg,
cigarettes A=1 mean yes and =0 is no), then each
subject has in theory two counterfactuals, Y0 and Y1.

These are called counterfactuals because, they are the
outcomes on might observe if, counter to fact, one could
set the clock back and re-start the whole experiment with
a new a.

To estimate specific causal effects, we then define
parameters that relate, for instance, how the means of
these counterfactuals differ as one changes a.
Causal Parameters in Point Tx studies

Point treatment studies are those where the
effect of interest refers to a treatment, risk factor,
... at one point in time.

Time-dependent treatment studies are those
where the effect of interest is a time-course of
treatment or exposure (more in a bit).

Types of effects (or parameters) one might want
to estimate are: total effects, direct and indirect
effects, dynamic treatment regimes....
General Causal Graph For a Point
Treatment Study
V W
treatment
(or expsoure)
confounders
A
Z
Y
Intermediate
Variable
outcome
Total Effects in Point Tx
Studies

Parameters of the distribution of the
counterfactuals: Ya

Examples for binary A (0 or 1):
• E[Y1]-E[Y0]
• E[Y1]/E[Y0]

Examples for binary A and Y (Causal OR)
• E[Y1](1-E[Y0])/{E[Y0](1-E[Y1]}=
P[Y1=1](1-P[Y0 =1])/{ P[Y0 =1](1-P[Y1 =1]}
Total Effects in Point Tx
Studies, cont.

Regression models (marginal structural models – MSM’s)
relating mean of Ya vs a: E[Ya]=m(a| b) (continuous or
ordered categorical a), e.g.,
m(a | b )  b 0  b1a
or
1
m( a | b ) 
1  exp( ( b 0  b1a))

Stratified MSM’s
m(a,V | b )  b 0  b1a  b 2V  b 3aV
Dynamic Treatment Regimes


D  {w  d ( w)  A : d ) is the abstract set of
possible dynamic treatment regimes.
D  {w  d ( w) :  } is slightly less abstract
(that is, your rule will controlled by some
constant (or perhaps a vector if W is a
vector), .

Finally, the specific rule could be:
d ( w)  I ( w   )
A Dynamic Treatment Regime Parameter
of Interest

Parameter of interest for this might be:
E[Yd ]=P(Yd = 1)
which represents the expected outcome in a
population where all subjects are subject to treatment
rule, d.

Example is to put patient on cholesterol lowering
drugs (the A = yes or no) if cholesterol (the W) is >
200 (.=200) and the outcome is heart disease (the
Y).
Direct Effects

Need to define counterfactuals for
both endogenous variables, Z and
Y.

Ya = the counterfactual outcome if
receives A=a.

Yaz = the counterfactual if receives
A=a, Z=z.

Za = counterfactual Z if receives
A=a.

Note, Ya=YaZa

Finally, YaZa*, a*  a is the
counterfactual if receives A=a but if
the Z, counter to fact, the subject
had received a different a (a*).
W
A
Z
Y
Direct Effects

First, let A =0 be a reference group

Total Effect: E[Y1]-E[Y0]

Pure Direct Effect: is the difference between
the exposed (A=a) and unexposed (A=0) if
the intermediate variables is “set” to its value
when A=0:
 
E Y  E Y 
E YaZ0  E[Y0 ] 
aZ0
0Z0
Time-Dependent Treatments

Measurements made at regular
times, 0,1,2,...,K.

A(j) is the treatment (or
exposure) dose on the jth day.

Y is outcome measured at the
end (only once) at day K+1.


A ( j )  ( A(0), A(1),..., A( j )) is the
history of treatment as
measured at time j.
L ( j )  ( L(0), L(1),..., L( j )) is the
history of the potential
confounders measured at time j.
Time-dependent counterfactuals of
Interest
Ya , a  (a(0), a(1),..., a( K ))

Must know define counterfactuals with regard
to a whole vector of possible treatments.
a  (0,0,0...,0), a  (1,1,1...,1)

If A is binary (e.g., yes/no) there are 2K
possible counterfactuals, e.g.,
Example of MSM for timedependent Tx

Choose a reasonable model that relates
counterfactual mean to treatment history.

Example:
E[Ya ]  b 0  b1sum(a )

where
K
sum(a )   a( j )
j 1
Estimators
of MSM’s in Point Treatment Studies

A denotes a “treatment” or exposure of interest –
assume categorical.

W is a vector (set) of confounders

Y is an outcome

Define observed data is O= (Y,W,A)

Ya are the counterfactual outcomes of interest

The “full data” is:
X FULL  ( X a , a  Α)
Key Assumptions
1. Consistency Assumption: observed
data, O is O=(A,XA) – i.e., the data for a
subject is simply one of the
counterfactual outcomes from the full
data.
2. Randomization Assumption: A  Ya | W , a
so no unmeasured confounders for
treatment. In other words: within strata
of W, A is randomized
Key Assumptions,
cont.
1. Experimental Treatment Assignment:
all treatments are possible for all
members of the target population, or:
P( A  a | W )  0,
for all W.
Likelihood of Data in simple
Point Treatment
• Given the assumptions, the likelihood of the data
simplifies to:
L(O )  P(Y | A,W ) P( A | W )
• Factorizes into the distribution of interest and the
treatment assignment distribution.
• The G-computation formula works specifically
with parameters of P(Y|A,W) estimated with
maximum likelihood and ignores the treatment
assignment distribution.
G-computation Approach
• Given assumptions, note that P(Y|A=a,W) =
P(Ya|W) or E(Y|A=a,W) = E(Ya|W).
E[Ya ]   E[Ya | W  w]dP. ( w)
• Then,
w
• Which leads to our G-comp. estimate of the
counterfactual mean in this simple context.
n
ˆE[Ya ]   1 Eˆ [Y | A  a,W  Wi ]
i 1 n
• Regress
Eˆ [Ya ] vs. a to get an estimate of MSM.
Inverse Probability of Treatment
Weighted (IPTW) estimator
• The G-comp. approach models is three-steps
1. Model E[Y|A,W]
2. Estimate E[Ya]
3. Regress Eˆ[Ya ] vs. a to get an estimate of
MSM (e.g., m(a | b )  b 0  b1a )
• The IPTW uses a different approach that instead
models treatment assignment to adjust for
confounding and uses these as weights in
regression.
• Define g(a|W) to be the P(A=a|W).
IPTW Estimating Function

General Estimating Function is (for
stratified by V MSM’s):
h( A,V )
(Y  m( A,V | b ))
g( A |W )

For unstratified MSM’s
h( A)
(Y  m( A | b ))
g( A |W )

Example (linear model)
1
 
 A  (Y  m( A | b ))
g( A |W )
Likelihood (or Bayesian)
Approach

Likelihood approach specifies the joint
distribution of data and can be thought of
as a model of the “whole” process (or at
least the relevant parts of it).

Likelihood-based inference relies on
getting the model right.

So, both estimates and standard errors
are derived under the unlikely
circumstance that you chose the correct
model.
What if Model is not “Truth” by
reasonable projection
Estimating Function
Approach

Estimating function concentrates on the
part of the distribution of the data relevant
to the parameter of interest

Separates out estimation of the “nuisancestuff” from what you care about.

And, if derived properly, the inference can
be derived for a parameter which is not
the true relationship, but some projection
of this truth onto a (hopefully) informative
sub-model.
Points of Talk

Curse of dimensionality poses
insurmountable challenge of causal
inference from observational data without
some assumptions.

To attack curse, first ask - what is my
question and what parameter of interest
does it imply?

Use estimator that separates out the
parameter of interest from the stuff you do
not care about (nuisance paramters).
Points of Talk, Cont.

Use a inferential technique that
accounts for any decisions made
while looking at data (i.e.,
bootstrapping).

Think of model not as truth but as
projection and thus derive inference
accordingly (again, boostrapping).
Download