Using causal graphs to understand bias in the medical literature About these slides This presentation was created for the Boston Less Wrong Meetup by Anders Huitfeldt (Anders_H) I have tried to optimize for intuitive understanding, not for technical precision or mathematical rigor The slides are inspired by courses taught by Miguel Hernan and Jamie Robins at the Harvard School of Public Health Directed Acyclic Graphs This is a Directed Acyclic Graph: The nodes (letters) are variables The graph is “Directed” because the arrows have direction It is “Acyclic” because, if you move in the direction of the arrows, you can never get back to where you began We use these graphs to represents the assumptions we are making about the relationships between the individual variables on the graph Directed Acyclic Graphs We can use DAGs to reason about which statements about independence are logical consequences of other statements about independence The rules for this type of reasoning are called “D-Separation” (Pearl, 1987) It is possible to do the same thing using algebra, but D-Separation saves a lot of time and energy Directed Acyclic Graphs This DAG is complete There is a direct arrow between all variables on the graph This means we are not making any assumptions about independence between any two variables Directed Acyclic Graphs On this DAG, there are missing arrows Each missing arrow corresponds to assumptions about independence Specifically, when arrows are missing, we assume that every variable is independent of the past, given the joint distribution of its parents Other independencies may follow automatically Directed Acyclic Graphs There is a «path» between two variables if you can get from one variable to the other by following arrows. The direction of the arrows does not matter for determining if a path exists (but does matter for whether it is open). We can tell whether two variables are independent by checking whether there is an open path between them Colliders and Non-Colliders A path from A to C via B could be of four different types: ABC ABC ABC A B C The last of these types is different from the others: We call B a “Collider” on this path Notice how the arrows from A and C “Collide” in B On the three other types of paths, B is a “Non-Collider” Note that the concept of “Collider” is path-dependent: B could be a collider on one path, and a non-collider on another path Conditioning If we look at the data within levels of a covariate, that covariate has been “Conditioned on” We represent that by drawing a box around the variable on the DAG The Rules of D-Separation If variables have not been conditioned on: Non-Colliders are open Colliders are closed (unless a downstream consequence is conditioned on) If variables have been conditioned on: Non-colliders that have been conditioned on are closed Colliders that have been conditioned on are open Colliders that have downstream consequences that have been conditioned on, are open The Rules of D-Separation A path from A to B is open if all variables between A and B are open on that path Two variables d-separated (independent) if all paths between them are closed Two variables A and B are d-separated conditional on a third variable, if conditioning on the third variable closes all paths between A and B Causal DAGs We can give the DAG a causal interpretation if we are willing to assume That the variables are in temporal (Causal) order And that whenever two variables on our graph share a common ancestor, that ancestor is also shown on the graph If we have a Causal DAG, we can use it as a map of the data generating mechanism: Causal DAGs can be interpreted to say that if we change the value of A, that «change» will propagate to other variables in the direction of the arrows Causal DAGs All scientific papers make assumptions about the data generating mechanism. Causal DAGs are simply a very good way of being explicit about those assumptions. We use them because: They assure us that our assumptions correspond to a plausible, logically consistent causal process They make it easy to verify that our analysis matches the assumptions They give us very precise definitions of different types of bias They make it much easier to think about complicated data generating mechanisms Causal DAGs Note that we can never know what the data generating mechanism actually looks like The best we can do is make arguments that our map fits the territory Sometimes it is very obvious that the map does not match the territory. Causal Inference A pathway is causal if every arrow on the path is in the forward direction If I intervene to change the value of A, this will lead to changes in all variables that are downstream from A The goal of causal inference is to predict how much the outcome Y will change if you change A In other words, we are quantifying the magnitude of the combination of all forward-going pathways from A to Y If we have data from the observed world, and we know that the only open pathway from exposure to the outcome is in the forward direction, then the observed correlation is purely due to causation Bias However, if there exists any open pathway between the exposure and the outcome where one or more of the arrows is in the backward direction, there is bias Open pathways that have arrows in the backward direction will lead to correlation in the observed data But that correlation will not be reproduced in an experiment where you intervened to change the exposure The two main types of bias are confounding and selection bias Confounding is a question of who gets exposed Selection bias is a question of who gets to be part of the study Confounding Confounding bias occurs when there is a common cause of the exposure and the outcome You can check for it using the “Backdoor Criterion” If there exists an open path between A and Y that goes into A (as opposed to out from A), we call that a “Backdoor path” A backdoor path between A and Y will always have an arrow in the backwards direction Example of a DAG with Confounding Confounding Notice that if we had randomized people to be smokers or non-smokers, the arrow from Sex to Smoking could not exist We would know it didn’t exist, because the only cause of smoking is our random number generator Therefore, there could be no confounding The best way to abolish confounding is to randomize exposure. However, this is expensive, and is usually not feasible Controlling for Confounding There are many ways to control for confounding if the data is observational instead of experimental Standard analysis (stratification, regression, matching) are based on looking at the effect within levels of a confounder If we do this, we put a box around the confounder on the DAG This closes the backdoor path If we condition on all the confounders, the only open pathways will be in the forward direction, and all remaining correlation between the exposure and the outcome is due to causation Controlling for Confounding An alternative way to control for confounding, is to simulate a world where there are no arrows into treatment We do this by weighting all observation by the inverse probability of treatment, given the confounders. We can represent this on the DAG by abolishing the arrows from the confounders to the treatment (in contrast to drawing a box around the confounder) In this simulated world we can run any analysis we want without considering confounding There are situations where this type of analysis is valid, whereas all conditioning-based methods such as regression are biased. Controlling for Confounding Before you choose to control for a variable, make sure it actually is a confounder If you control for something that is not a confounder, you can introduce bias For example, this can happen if you control for a causal intermediate Controlling for Confounding Make sure you never control for anything that is causally downstream from the exposure For example, in this situation, the investigators want to find the effect of eating at McDonalds on the risk of Heart Attacks. They have controlled for BMI This introduces bias by blocking part of the effect we are interested in M-Bias Just because a variable is pre-treatment and correlated with the outcome does not make it safe to control for In fact, sometimes controlling for a pre-treatment variable introduces bias. M-Bias Consider the following DAG: You want to estimate the effect of smoking on cancer Should you control for Coffee Drinking or not? Selection Bias Selection bias occurs when the investigators have accidentally conditioned on a collider Selection Bias Imagine you are interested in the effect of Socioeconomic Status on Cancer Since it is easier to get an exact diagnosis at autopsy, you decided to only enroll people who had an autopsy in your study This means you are looking at the effect within a single level of autopsy: “Autopsy = 1” The variable has been conditioned on People of low socioeconomic status are less likely to have an autopsy People with cancer are also less likely to have an autopsy. Selection Bias • There is now an open pathway from Socioeconomic Status to Cancer with a backward arrow: • Socioeconomic Status Autopsy Cancer Evaluating a Scientific Paper If you are given a paper, and you want to know if the claims are likely to be true: 1. First, make sure they are addressing a well-defined causal question 2. Look at the analysis section and determine what map the authors have of the data generating mechanism 3. Ask yourself whether you think the implied map captures the important features of the territory Is there confounding that has not been accounted for? Did the authors accidentally condition on any variables to cause selection bias? Evaluating a Scientific Paper Example: Prof Yudkowsky wants to estimate the effect of reading HPMOR, on the probability of defeating dark lords He controls for sex Evaluating a Scientific Paper 1. Draw the DAG that Prof Yudkowski had in mind when he conducted this analysis 2. Do you think this DAG captures the most important aspects of the territory? Time-Dependent Confounding In many situations, exposure varies with time We can picture this has having an exposure variable for every time point, labelled A0, A1, A2 etc There may also be time-dependent confounding by L0, L1 and L2 Time-Dependent Confounding On this graph, L1 confounds the effect of A1 on Y However, it is also on the causal pathway from A0 to Y Do you control for it or not? Time-Dependent Confounding In this situation, all stratification-based approaches, such as regression or matching, are biased. This is because these methods put a box around the variable L1, blocking part of the effect we are studying Methods for controlling for confounding that do rely on conditioning on L1 are still valid This includes inverse probability weighting (marginal structural models), the parametric g-formula and G-Estimation Time-Dependent Confounding Time-dependent confounding is very common in real data-generating mechanisms Consider the following scenario: If I don’t take my pills this year, my health is likely to decrease next year If my health has decreased next year, I am more likely to take my pills next year Health predicts my risk of death In this situation, it is impossible to obtain the effect of pills on mortality without using generalized (non-stratification based) methods This is true whenever there is a feedback loop like the one described here Time-Dependent Confounding There are many alternative “causal” models that do not recognize timedependent confounding These models work fine if exposure is truly something that does not vary with time However, that is very rarely the case If we are not trained to draw maps that recognize this important feature of the territory, we will end up assuming that it does not exist This is often a bad assumption Further Reading If you are a mathematician or computer scientist, and want a very formal understanding of the theory: Judea Pearl. Causality: Models of Reasoning and Inference. (Cambridge University Press, 2000) If are not a mathematician, but want to understand how to apply causal methods to analyze observational data Miguel Hernan and James Robins. Causal Inference. (Chapman & Hall/CRC, 2013) Most of the book is available for free at http://www.hsph.harvard.edu/miguelhernan/causal-inference-book/