PH240A Epidemiology and the Curse of Dimensionality Alan Hubbard Division of Biostatistics U.C. Berkeley hubbard@stat.berkeley.edu What’s Wrong? Not related to analysis of data misclassification, biased sample, unmeasured confounders, .... Related to analysis of data – improper modeling strategies, failure to account for design in analysis, poorly defined parameter of interest, improper inference reported given the strategy used. What’s Proper Inference? For an estimate, the standard error is an estimate of the variability of this estimate in repeated experiments performed just as the one used to derived the estimate. So, every little decision one makes about how to analyze the data, that comes from looking at the data, should be included in such a variability estimate. If feedback between data and analysis decisions is ignored (as it typically is), then inference can be biased. A nice, mindless way to always get proper inference Define an algorithm that one will use to get the final estimate (how variables are chosen, how they are entered in a model, etc., etc.). Apply this algorithm to data to get estimate. Re-sample the data with replacement to create new pseudo-experiment. Apply this algorithm to this “new” data to get another estimate. Repeat 10,000 times - graph histogram of estimates - BOOTSTRAPPING. Causal Inference and Curse of Dimensionality Causal Model and Data Text from Taubes NYTimes Article QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Text from Taubes NYTimes Article QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Causal Inference and Curse of Dimensionality Causal Model and Data VW Full Data •X ((Ya: a),W) Observed Data •O (W,A,YA) A treatment (risk factor) Y outcome The Curse of Dimensionality and Epidemiology Can’t beat curse of dimensionality (unless one is lucky). Consider simplistic following scenario W are the covariates (confounders) and all are categorical with the same number of level, e.g., 2 or 3 or 4... In order to get nonparametric causal inference, one must have a perfectly matched unexposed person (A=0) for every exposed person (A=1). Given the number of confounders, how many subjects does one have to sample for every exposed person? Number of unexposed per exposed subject one needs to sample to get perfect matching. Doing the best one can with the Curse Concepts and Models in Causal Inference Prediction vs. Explanation Prediction – create a model that the clinician will use to help predict risk of a disease for the patient. Explanation – trying to investigate the causal association of a treatment or risk factor and a disease outcome. This talk concerns studies where the goal is explanation. Theoretical Experiment Start with some hypothesis. From a statistical/scientific perspective the first step is to define a theoretical experiment that would address the hypothesis of interest. For most questions in social-epi, running such an experiment will be unthinkable. However, defining the experiment of interest: 1. 2. 3. 4. Makes explicit the specific hypothesis Defines the Full Data Defines the specific parameter of interest Leads ultimately to estimators from the Observed data and the necessary identifiability assumptions The problem with observational studies: lack of randomization If one has a treatment, or risk factor, with two levels (A and B), no guarantee that study populations (those getting A and B) will be roughly equivalent (in risk of the disease of interest). In a perfect world can given everyone in study level A, record outcome, reset clock and then give level B. Randomization means one can interpret estimates as if this is precisely what was done. Counterfactuals Even defining statistically what a “causal” effect is, is not trivial. One way that leads to practical methods to estimate causal effects is to define COUNTERFACTUALS Assume that the “full” data would be, for every subject, one could observe the outcome of interest for each possible level of the treatment (or risk factor) of interest. Counterfactuals, CONT. So, if Y is the outcome, A is the tx of interest, then the best statistical situation is one where one observes, for each subject, Ya, for each treatment level A=a. For example if there is simply two levels of exposure (eg, cigarettes A=1 mean yes and =0 is no), then each subject has in theory two counterfactuals, Y0 and Y1. These are called counterfactuals because, they are the outcomes on might observe if, counter to fact, one could set the clock back and re-start the whole experiment with a new a. To estimate specific causal effects, we then define parameters that relate, for instance, how the means of these counterfactuals differ as one changes a. Causal Parameters in Point Tx studies Point treatment studies are those where the effect of interest refers to a treatment, risk factor, ... at one point in time. Time-dependent treatment studies are those where the effect of interest is a time-course of treatment or exposure (more in a bit). Types of effects (or parameters) one might want to estimate are: total effects, direct and indirect effects, dynamic treatment regimes.... General Causal Graph For a Point Treatment Study V W treatment (or expsoure) confounders A Z Y Intermediate Variable outcome Total Effects in Point Tx Studies Parameters of the distribution of the counterfactuals: Ya Examples for binary A (0 or 1): • E[Y1]-E[Y0] • E[Y1]/E[Y0] Examples for binary A and Y (Causal OR) • E[Y1](1-E[Y0])/{E[Y0](1-E[Y1]}= P[Y1=1](1-P[Y0 =1])/{ P[Y0 =1](1-P[Y1 =1]} Total Effects in Point Tx Studies, cont. Regression models (marginal structural models – MSM’s) relating mean of Ya vs a: E[Ya]=m(a| b) (continuous or ordered categorical a), e.g., m(a | b ) b 0 b1a or 1 m( a | b ) 1 exp( ( b 0 b1a)) Stratified MSM’s m(a,V | b ) b 0 b1a b 2V b 3aV Dynamic Treatment Regimes D {w d ( w) A : d ) is the abstract set of possible dynamic treatment regimes. D {w d ( w) : } is slightly less abstract (that is, your rule will controlled by some constant (or perhaps a vector if W is a vector), . Finally, the specific rule could be: d ( w) I ( w ) A Dynamic Treatment Regime Parameter of Interest Parameter of interest for this might be: E[Yd ]=P(Yd = 1) which represents the expected outcome in a population where all subjects are subject to treatment rule, d. Example is to put patient on cholesterol lowering drugs (the A = yes or no) if cholesterol (the W) is > 200 (.=200) and the outcome is heart disease (the Y). Direct Effects Need to define counterfactuals for both endogenous variables, Z and Y. Ya = the counterfactual outcome if receives A=a. Yaz = the counterfactual if receives A=a, Z=z. Za = counterfactual Z if receives A=a. Note, Ya=YaZa Finally, YaZa*, a* a is the counterfactual if receives A=a but if the Z, counter to fact, the subject had received a different a (a*). W A Z Y Direct Effects First, let A =0 be a reference group Total Effect: E[Y1]-E[Y0] Pure Direct Effect: is the difference between the exposed (A=a) and unexposed (A=0) if the intermediate variables is “set” to its value when A=0: E Y E Y E YaZ0 E[Y0 ] aZ0 0Z0 Time-Dependent Treatments Measurements made at regular times, 0,1,2,...,K. A(j) is the treatment (or exposure) dose on the jth day. Y is outcome measured at the end (only once) at day K+1. A ( j ) ( A(0), A(1),..., A( j )) is the history of treatment as measured at time j. L ( j ) ( L(0), L(1),..., L( j )) is the history of the potential confounders measured at time j. Time-dependent counterfactuals of Interest Ya , a (a(0), a(1),..., a( K )) Must know define counterfactuals with regard to a whole vector of possible treatments. a (0,0,0...,0), a (1,1,1...,1) If A is binary (e.g., yes/no) there are 2K possible counterfactuals, e.g., Example of MSM for timedependent Tx Choose a reasonable model that relates counterfactual mean to treatment history. Example: E[Ya ] b 0 b1sum(a ) where K sum(a ) a( j ) j 1 Estimators of MSM’s in Point Treatment Studies A denotes a “treatment” or exposure of interest – assume categorical. W is a vector (set) of confounders Y is an outcome Define observed data is O= (Y,W,A) Ya are the counterfactual outcomes of interest The “full data” is: X FULL ( X a , a Α) Key Assumptions 1. Consistency Assumption: observed data, O is O=(A,XA) – i.e., the data for a subject is simply one of the counterfactual outcomes from the full data. 2. Randomization Assumption: A Ya | W , a so no unmeasured confounders for treatment. In other words: within strata of W, A is randomized Key Assumptions, cont. 1. Experimental Treatment Assignment: all treatments are possible for all members of the target population, or: P( A a | W ) 0, for all W. Likelihood of Data in simple Point Treatment • Given the assumptions, the likelihood of the data simplifies to: L(O ) P(Y | A,W ) P( A | W ) • Factorizes into the distribution of interest and the treatment assignment distribution. • The G-computation formula works specifically with parameters of P(Y|A,W) estimated with maximum likelihood and ignores the treatment assignment distribution. G-computation Approach • Given assumptions, note that P(Y|A=a,W) = P(Ya|W) or E(Y|A=a,W) = E(Ya|W). E[Ya ] E[Ya | W w]dP. ( w) • Then, w • Which leads to our G-comp. estimate of the counterfactual mean in this simple context. n ˆE[Ya ] 1 Eˆ [Y | A a,W Wi ] i 1 n • Regress Eˆ [Ya ] vs. a to get an estimate of MSM. Inverse Probability of Treatment Weighted (IPTW) estimator • The G-comp. approach models is three-steps 1. Model E[Y|A,W] 2. Estimate E[Ya] 3. Regress Eˆ[Ya ] vs. a to get an estimate of MSM (e.g., m(a | b ) b 0 b1a ) • The IPTW uses a different approach that instead models treatment assignment to adjust for confounding and uses these as weights in regression. • Define g(a|W) to be the P(A=a|W). IPTW Estimating Function General Estimating Function is (for stratified by V MSM’s): h( A,V ) (Y m( A,V | b )) g( A |W ) For unstratified MSM’s h( A) (Y m( A | b )) g( A |W ) Example (linear model) 1 A (Y m( A | b )) g( A |W ) Likelihood (or Bayesian) Approach Likelihood approach specifies the joint distribution of data and can be thought of as a model of the “whole” process (or at least the relevant parts of it). Likelihood-based inference relies on getting the model right. So, both estimates and standard errors are derived under the unlikely circumstance that you chose the correct model. What if Model is not “Truth” by reasonable projection Estimating Function Approach Estimating function concentrates on the part of the distribution of the data relevant to the parameter of interest Separates out estimation of the “nuisancestuff” from what you care about. And, if derived properly, the inference can be derived for a parameter which is not the true relationship, but some projection of this truth onto a (hopefully) informative sub-model. Points of Talk Curse of dimensionality poses insurmountable challenge of causal inference from observational data without some assumptions. To attack curse, first ask - what is my question and what parameter of interest does it imply? Use estimator that separates out the parameter of interest from the stuff you do not care about (nuisance paramters). Points of Talk, Cont. Use a inferential technique that accounts for any decisions made while looking at data (i.e., bootstrapping). Think of model not as truth but as projection and thus derive inference accordingly (again, boostrapping).