Slides for Session #22

advertisement
Statistics for Social
and Behavioral Sciences
Part IV: Causality
Association and Causality
Session 22
Prof. Amine Ouazad
Statistics Course Outline
PART I. INTRODUCTION AND RESEARCH DESIGN
Week 1
Four Steps of “Thinking Like a Statistician”
Study Design: Simple Random Sampling, Cluster Sampling, Stratified Sampling
Biases: Nonresponse bias, Response bias, Sampling bias
PART II. DESCRIBING DATA
Weeks 2-4
Sample statistics: Mean, Median, SD, Variance, Percentiles, IQR, Empirical Rule
Bivariate sample statistics: Correlation, Slope
PART III. DRAWING CONCLUSIONS FROM DATA:
INFERENTIAL STATISTICS
Weeks 5-9
Estimating a parameter using sample statistics. Confidence Interval at 90%, 95%, 99%
Testing a hypothesis using the CI method and the t method.
Weeks 10-14
PART IV. : CORRELATION AND CAUSATION:
TWO GROUPS, REGRESSION ANALYSIS
Multivariate regression
now!
Coming up
• “Comparison of Two Groups”
Last week.
• “Univariate Regression Analysis”
Last Saturday. (Section 9.5)
• “Association and Causality: Multivariate Regression”
Today, Monday, Tuesday. Chapters 10 and 11.
• “Randomized Experiments and ANOVA”.
Wednesday. Chapter 12.
• “Robustness Checks and Wrap Up”.
Last Thursday.
Outline
1. Correlation and Causation
2. Multiple Causes
Partly Spurious Association
Spurious Association
Chain Relationship
3. Interaction
Next time: Multivariate regression
What causes crime?
• National Neighborhood Crime Study (2002),
Peterson, Ruth D., and Krivo, Lauren J. Ohio State
University.
N = 6,935 neighborhoods.
• Crime data from local police departments, and the
Federal Bureau of Investigation.
• Total crime rate per 1,000 residents.
• Number of police officers. Ethnicity of police officers.
• Demographics of the neighborhood: poverty,
unemployment rate, education.
Regression of Crime Rate on the
Unemployment Rate
• y : total crime per 1,000 residents.
• x : unemployment rate from 0 to 100.
Causation Matters
If the true relationship between X and Y is
described by ….
X
Y
• Changing, manipulating X will affect Y.
• Example:
– if Poverty -> Crime, then addressing poverty (e.g. war on
poverty, food stamps, welfare programs) will lower crime.
– if CO2 emissions  Global average temperature, then reducing
in CO2 emissions (eg through policies such as the Kyoto
protocol) will lower global temperature.
– If shoe size -> literacy, changing shoe size will affect literacy !
Nonsense.
– If Hepatitis B vaccination -> autism, then reducing vaccination
rates will lower the incidence of autism.
True Model vs Statistical Model
X
Y
• is your statistical model
• But the true model may be different:
1. Order is wrong.
•
Y causes X instead of X causing Y.
2. Multiple causes.
•
X may not be the most practically significant determinant of Y.
3. Spurious association.
•
X may not cause Y at all.
4. Chain relationship.
•
The impact of X on Y may be mediated by another variable X2.
5. Interaction.
•
The impact of X on Y may depend on the value of another variable
X2.
Order is wrong?
X
Y
True model
Y
X
Statistical model
• Regression suggests that more police officers per 10,000 resident leads to a
higher crime rate per capita !?!
• Beware of software and formulas. Use them wisely.
Outline
1. Correlation and Causation
2. Multiple Causes
Partly Spurious Association
Spurious Association
Chain Relationship
3. Interaction
Next time: Multivariate regression
Multiple Causes
• Acknowledge that crime (Y) may be caused by
a series of factors:
True Model
X1
X2
X3
…
XK
Y
Multiple Causes
• Acknowledge that the variable X1 that you were
focused on may not be the most practically
significant variable that determines Y.
• Crime: finding the most important determinants
of crime.
– Education? Poverty? Unemployment? Femaleheaded households? Ethnicity of police officers?
Number of police officers per 10,000 residents?
Incarceration rate?
From Univariate to Multivariate
• Univariate regression:
True model:
y = a + b x1 + e
Statistical model: y = a + b x1 + e
with E(y|x1) = a + b x1.
And SD(y|x1) = SD(e) .
• Multivariate regression:
True model:
y = a + b1 x1+ b2 x2 + b3 x3 + e
Statistical model: y = a + b1 x1 + b2 x2 + b3 x3 + e
with E(y|x1,x2,x3) = a + b x1 + b2 x2 + b3 x3 .
And SD(y|x1,x2,x3) = SD(e) .
Including X2 may affect
the coefficient b1 of X1
• Race has a negative statistically significant impact on the crime
rate. Accounting for multiple variables avoids simplistic statements
!!!
Partly Spurious Association
between X1 and Y
X2
X1
Y
True model
X1
Y
Statistical model
• The statistical model does not include X2.
• When including X2 in the regression, the effect of
X1 is lower in magnitude.
• X2 has both a direct and indirect effect on X1.
Spurious Association
X2
X1
Y
True model
X1
Y
Statistical model
• A statistically significant slope coefficient b does
not mean that X1 causes Y.
• Another factor X2 may be causing both X1 and Y.
• When including X2 in the regression, the effect
Shoe size and Literacy
Age
Shoe
size
Literacy
True model
Shoe
size
Literacy
Statistical model
• Sample of N children from age 5 to age 16.
• Literacy measured in the Early Childhood Longitudinal
Study.
• Including age in the regression will likely render the
coefficient of shoe size non significant.
Correct approach
X2
X1
X2
Y
True model
X1
Y
Statistical model
• Make the true model and the statistical model
coincide.
• Regress Y on both X1 and X2.
• Include all determinants of crime in the regression.
What makes a good school?
Funding
Teacher
quality
Funding
Student
test
score
True model
Student
test
score
Statistical model
• Researchers had found that school funding is positively
correlated (statistically significant and positive r and b) with
student test scores….
• But when including measures of teacher quality, the
relationship between the amount of money a school
spends has no statistically significant impact on student test
scores.
Chain Relationship
X1
X2
Y
True model
X1
Y
Statistical model
• X1 causes Y …. But the effect of X1 on Y is entirely due to its
effect on X2.
• When not including X2 in the regression, the coefficient of X1
is statistically significant.
• When including X2 in the regression, the coefficient of X1 is
not statistically significant.
Outline
1. Correlation and Causation
2. Multiple Causes
Partly Spurious Association
Spurious Association
Chain Relationship
3. Interaction
Next time: Multivariate regression
X2
X1
Interaction
Y
True model
X1
Y
Statistical model
• X2 affects how X1 causes Y.
• For instance, unemployment causes crime, but the
impact is much lower in neighborhoods that have a
higher income.
• When not accounting for X2, the coefficient of X1
measures the average impact of X1 on Y.
Accounting for the Interaction
of X1 and X2
• Include both X2 and the product of X1 and X2 in the
regression.
Model:
y = a + b1 x1 + b2 x2 + b3 x1*x2 + e
• If b3 is positive, the impact of x1 on y is larger the
higher the value of x2.
• If b3 is negative, the impact of x1 on y is smaller the
higher the value of x2.
Accounting for the Interaction
of unemployment and income
• Here, b3 is negative !
• T_HINC75: percentage in neighborhood with high income.
Wrap up
• Know the difference between the true model and
the statistical model.
• Learn how to perform a multivariate regression in
Stata.
•
•
•
•
•
Order X and Y correctly.
Account for multiple causes.
Account for spurious correlations.
Account for chain relationships.
Account for interactions.
Coming up:
•
•
•
•
Schedule for next week:
Chapter on “Association and Causality”, and “Multivariate Regression”.
Last online quiz sent last night, due Sunday 9am.
Make sure you come to sessions and recitations.
Sunday
Monday
Tuesday
Wednesday
Thursday
Recitation
Evening session
7.30pm
West
Administration
002
Usual class
12.45pm
Usual room
Evening session
7.30pm
West
Administration
001
Usual class
12.45pm
Usual room
Download