Statistics for Social and Behavioral Sciences Part IV: Causality Association and Causality Session 22 Prof. Amine Ouazad Statistics Course Outline PART I. INTRODUCTION AND RESEARCH DESIGN Week 1 Four Steps of “Thinking Like a Statistician” Study Design: Simple Random Sampling, Cluster Sampling, Stratified Sampling Biases: Nonresponse bias, Response bias, Sampling bias PART II. DESCRIBING DATA Weeks 2-4 Sample statistics: Mean, Median, SD, Variance, Percentiles, IQR, Empirical Rule Bivariate sample statistics: Correlation, Slope PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL STATISTICS Weeks 5-9 Estimating a parameter using sample statistics. Confidence Interval at 90%, 95%, 99% Testing a hypothesis using the CI method and the t method. Weeks 10-14 PART IV. : CORRELATION AND CAUSATION: TWO GROUPS, REGRESSION ANALYSIS Multivariate regression now! Coming up • “Comparison of Two Groups” Last week. • “Univariate Regression Analysis” Last Saturday. (Section 9.5) • “Association and Causality: Multivariate Regression” Today, Monday, Tuesday. Chapters 10 and 11. • “Randomized Experiments and ANOVA”. Wednesday. Chapter 12. • “Robustness Checks and Wrap Up”. Last Thursday. Outline 1. Correlation and Causation 2. Multiple Causes Partly Spurious Association Spurious Association Chain Relationship 3. Interaction Next time: Multivariate regression What causes crime? • National Neighborhood Crime Study (2002), Peterson, Ruth D., and Krivo, Lauren J. Ohio State University. N = 6,935 neighborhoods. • Crime data from local police departments, and the Federal Bureau of Investigation. • Total crime rate per 1,000 residents. • Number of police officers. Ethnicity of police officers. • Demographics of the neighborhood: poverty, unemployment rate, education. Regression of Crime Rate on the Unemployment Rate • y : total crime per 1,000 residents. • x : unemployment rate from 0 to 100. Causation Matters If the true relationship between X and Y is described by …. X Y • Changing, manipulating X will affect Y. • Example: – if Poverty -> Crime, then addressing poverty (e.g. war on poverty, food stamps, welfare programs) will lower crime. – if CO2 emissions Global average temperature, then reducing in CO2 emissions (eg through policies such as the Kyoto protocol) will lower global temperature. – If shoe size -> literacy, changing shoe size will affect literacy ! Nonsense. – If Hepatitis B vaccination -> autism, then reducing vaccination rates will lower the incidence of autism. True Model vs Statistical Model X Y • is your statistical model • But the true model may be different: 1. Order is wrong. • Y causes X instead of X causing Y. 2. Multiple causes. • X may not be the most practically significant determinant of Y. 3. Spurious association. • X may not cause Y at all. 4. Chain relationship. • The impact of X on Y may be mediated by another variable X2. 5. Interaction. • The impact of X on Y may depend on the value of another variable X2. Order is wrong? X Y True model Y X Statistical model • Regression suggests that more police officers per 10,000 resident leads to a higher crime rate per capita !?! • Beware of software and formulas. Use them wisely. Outline 1. Correlation and Causation 2. Multiple Causes Partly Spurious Association Spurious Association Chain Relationship 3. Interaction Next time: Multivariate regression Multiple Causes • Acknowledge that crime (Y) may be caused by a series of factors: True Model X1 X2 X3 … XK Y Multiple Causes • Acknowledge that the variable X1 that you were focused on may not be the most practically significant variable that determines Y. • Crime: finding the most important determinants of crime. – Education? Poverty? Unemployment? Femaleheaded households? Ethnicity of police officers? Number of police officers per 10,000 residents? Incarceration rate? From Univariate to Multivariate • Univariate regression: True model: y = a + b x1 + e Statistical model: y = a + b x1 + e with E(y|x1) = a + b x1. And SD(y|x1) = SD(e) . • Multivariate regression: True model: y = a + b1 x1+ b2 x2 + b3 x3 + e Statistical model: y = a + b1 x1 + b2 x2 + b3 x3 + e with E(y|x1,x2,x3) = a + b x1 + b2 x2 + b3 x3 . And SD(y|x1,x2,x3) = SD(e) . Including X2 may affect the coefficient b1 of X1 • Race has a negative statistically significant impact on the crime rate. Accounting for multiple variables avoids simplistic statements !!! Partly Spurious Association between X1 and Y X2 X1 Y True model X1 Y Statistical model • The statistical model does not include X2. • When including X2 in the regression, the effect of X1 is lower in magnitude. • X2 has both a direct and indirect effect on X1. Spurious Association X2 X1 Y True model X1 Y Statistical model • A statistically significant slope coefficient b does not mean that X1 causes Y. • Another factor X2 may be causing both X1 and Y. • When including X2 in the regression, the effect Shoe size and Literacy Age Shoe size Literacy True model Shoe size Literacy Statistical model • Sample of N children from age 5 to age 16. • Literacy measured in the Early Childhood Longitudinal Study. • Including age in the regression will likely render the coefficient of shoe size non significant. Correct approach X2 X1 X2 Y True model X1 Y Statistical model • Make the true model and the statistical model coincide. • Regress Y on both X1 and X2. • Include all determinants of crime in the regression. What makes a good school? Funding Teacher quality Funding Student test score True model Student test score Statistical model • Researchers had found that school funding is positively correlated (statistically significant and positive r and b) with student test scores…. • But when including measures of teacher quality, the relationship between the amount of money a school spends has no statistically significant impact on student test scores. Chain Relationship X1 X2 Y True model X1 Y Statistical model • X1 causes Y …. But the effect of X1 on Y is entirely due to its effect on X2. • When not including X2 in the regression, the coefficient of X1 is statistically significant. • When including X2 in the regression, the coefficient of X1 is not statistically significant. Outline 1. Correlation and Causation 2. Multiple Causes Partly Spurious Association Spurious Association Chain Relationship 3. Interaction Next time: Multivariate regression X2 X1 Interaction Y True model X1 Y Statistical model • X2 affects how X1 causes Y. • For instance, unemployment causes crime, but the impact is much lower in neighborhoods that have a higher income. • When not accounting for X2, the coefficient of X1 measures the average impact of X1 on Y. Accounting for the Interaction of X1 and X2 • Include both X2 and the product of X1 and X2 in the regression. Model: y = a + b1 x1 + b2 x2 + b3 x1*x2 + e • If b3 is positive, the impact of x1 on y is larger the higher the value of x2. • If b3 is negative, the impact of x1 on y is smaller the higher the value of x2. Accounting for the Interaction of unemployment and income • Here, b3 is negative ! • T_HINC75: percentage in neighborhood with high income. Wrap up • Know the difference between the true model and the statistical model. • Learn how to perform a multivariate regression in Stata. • • • • • Order X and Y correctly. Account for multiple causes. Account for spurious correlations. Account for chain relationships. Account for interactions. Coming up: • • • • Schedule for next week: Chapter on “Association and Causality”, and “Multivariate Regression”. Last online quiz sent last night, due Sunday 9am. Make sure you come to sessions and recitations. Sunday Monday Tuesday Wednesday Thursday Recitation Evening session 7.30pm West Administration 002 Usual class 12.45pm Usual room Evening session 7.30pm West Administration 001 Usual class 12.45pm Usual room