Research Seminar, Special Topic: Causal Inference in the Social Sciences Statistics 711, Section 01 (Susan Murphy, samurphy@umich.edu) Sociology 897, Section 02 (Yu Xie, yuxie@psc.isr.umich.edu) Education 737, Section 02 (Steve Raudenbush rauden@umich.edu) Week 1 (January 6) Introduction Causal Inference Seminar, Class 1, Page 2 I. Causal Questions A causal question is a simple question involving only two theoretical concepts: a cause and an effect. The question is Cause => Effect? Note that cause is something we can manipulate. (Example from thermodynamics.) Let us ignore measurement issues and represent the two concepts by two variables, X and Y. The question is X => Y? How do we ascertain it? Four examples in social science: (1) Does pre-marital cohabitation decrease or increase the likelihood of divorce? (2) Is it better to have more siblings or fewer siblings for educational attainment? (3) Will children who attend Head Start programs have better life chances? (4) Is the earnings return to college education overestimated? Causal Inference Seminar, Class 1, Page 3 Naïve Method: Simple Comparisons. That is, compare units of analysis who experience X to those who do not experience X. Say in a poor community, N1 children attended Head Start, and N2 did not. 27 years later, you measure the educational attainment of the two groups, say y1 for those who attended Head Start and y2 who did not attend Head Start. We can compute E(y1 - y2) = E(y1) - E(y2) = 13 - 14 = -1. Should we conclude from this that Head Start has a negative effect on educational attainment? No. There are possible biases due to selectivity: those who did not go to Head Start tended to come from higher SES families, and those from higher SES families tend to have higher educational attainment: Graphically, the true causal model could be: HS - + Edu + SES The appropriate research question is not to compare observed y1 and observed y2. Causal Inference Seminar, Class 1, Page 4 II. Causal Effect as a Counter-Factual Question Rather, causal inference asks the counter-factual question: for those who attended Head Start, what would have happened to them if they hadn't attended? Or, for those who did not attend Head Start, what would have happened to them if they had attended? Same for the cohabitation example, the sibling example, and the college return example. III. Selectivity Bias Thus, the most difficult problem for causal inference is selectivity bias. That is, how can we be assured that those who experience X are comparable to those who did? The issue of comparability is one of potential omitted variable bias. Let us take a look at a numerical example. The following table presents data on admissions to the Graduate School at the University of California-Berkeley: Row Proportion % Admission Yes No Sex of Applicants Men Women 44 35 56 65 Total 100 100 What factors might explain the sex difference? Discrimination? The reason is sex segregation across majors. Causal Inference Seminar, Class 1, Page 5 Conditional Relationship between Sex and Admission for the Six Largest Majors Major A B C D E F Men Women No. Appl. % Admitted No. Appl.% Admitted 825 560 325 417 191 373 62 63 37 33 28 6 108 25 593 375 393 341 82 68 34 35 24 7 The causal chain is: Sex --> Major --> Admission. The effect of Sex on Admission is only indirect. It’s important to recall the following principle: For a potential variable (Z) to introduce bias to (or distort) the causal relationship between X and Y, two conditions must be satisfied: (1) Z must affect Y (relevance condition). (2) Z must be correlated to X (confounding condition). It will be useful to remember these two conditions throughout the semester. Causal Inference Seminar, Class 1, Page 6 IV. Two Types of Selectivity Bias 1. Observable Selectivity If subjects who experience X and those who do not are different in observed characteristics, this type of selectivity is called observable selectivity. This problem can be handled by statistical controls to make the two groups comparable. 2. Unobservable Selectivity The more difficult problem is to deal with selectivity in unmeasured characteristics. This could be true in the examples of Head Start, cohabitation, college return, and sibship effect. This problem is also called "endogeneity problem": the occurrence of X is endogenous to Y. Potentially at least, the researcher always faces both types of selectivity, when data are collected from an observational design. IV. Experimental Approach The two conditions about omitted variables give rise to two approaches: the experimental approach and the structural approach. The experiment approach breaks the confounding between X and all possible Z through randomization. The structural approach models all relevant Z that is confounded with X so that the effect of X is seen within each level of Z. Causal Inference Seminar, Class 1, Page 7 The experimental approach takes care of both (observable and unobservable) types of selectivity. The structural approach can handle the first type of selectivity well and is quite limited with the second. As pointed out by Manski and Garfinkel (1992), experimental designs suffer from shortcomings that are often overlooked. For example, researchers and policy makers are rarely concerned with the impact of feedback and changing environments when results from an experimental study are extended to the whole population. As a complement, Manski and Garfinkel recommend the continued use of the "structural" approach, i.e., modeling causal processes statistically based on observational data. Manski and Garfinkel identified a major shortcoming of the experimental approach: We cannot always extrapolate results from an experimental setting to natural setting. -- Commonly referred to as lacking “external validity.” On p.17, Manski and Garfinkel argue, "In fact, reduced-form experimental evaluation actually requires that a highly specific and suspect structural assumption hold: Individuals and organizations must respond in the same way to the experimental version of a program as they would to the actual version." In this paper, Manski and Garfinkel (intentionally) fail to make the distinction between external validity and internal validity. They believe that both types of invalidity derive from selection bias, thus the same source. Causal Inference Seminar, Class 1, Page 8 V. Structural Approach Manski and Garfinkel refer to statistical methods that model observed or unobserved heterogeneity as "structural approach." They are usually based on observational data, with a strong theoretical assumption of how the world works. Difference between structural and reduced-form equations: Definition: Exogenous variables are variables that are used only as independent variables in all equations. Endogenous variables are variables that are used as dependent variables in some equations and may be used as independent variables in other equations. 1. Structural Equations Structural equations are theoretically derived equations that often have endogenous variables as independent variables. 2. Reduced Forms Reduced form equations are equations in which all independent variables are exogenous variables. In other words, in reduced form equations, we purposely ignore intermediate (or relevant) variables. With reduced form equations, only total effects are obtained. We will discuss structural equation models in more detail. Causal Inference Seminar, Class 1, Page 9 VI. Comparison of the Two Approaches 1. Advantages of the Structural Approach: (1) Since it is conducted in a natural setting, its findings are directly relevant to the whole population. In contrast, results from an experimental design need to be extrapolated. (2) It is less costly. In contrast, experimental research is very expensive. (3) It builds upon and contributes to theory. Since the structural approach needs strong theory, we gain better theoretical understanding of the processes. The reduced-form approach only yields simple answers to simple questions. Advantages of the Reduced-form Approach (1) Endogeneity bias can be eliminated through randomization. (2) It requires fewer assumptions. (3) It does not require complicated statistical models that the public and government officials have difficulty understanding.