β - The American University in Cairo

advertisement
BASIC DATA ANALYSIS AND STATISTICS
R. SHAPIRO
American University in Cairo
June 3-6, 2012
•
•
•
•
Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt)
Exploring Theories: Bivariate Analysis
Multivariate Models (Regression Approaches)
Limited Dependent Variables (dichotomous
variables) and Interactions
• Survey Research: Issues and Sources of Error
• Identifying Causal Mechanisms and Time Series
Analysis
• Using “Instruments” to Indentify Causal Effects
Exploring Theories: Bivariate analysis. “Correlation is
not causation!” But you have to start somewhere....
First Steps
• Centrality of causal theorizing. Dependent and
independent variable(s) unit of analysis? Generalizing
to what universe/population? Assumption of
unidirectional causation (revisited later)?
X -------- > Y
e.g.,
Democracy -----------> Income (of countries)
Education -----------> Income (of individuals)
• Plausibility of theory? Causal mechanism/story?
Next Steps in Quantitative Research
• Measurement of variables (ideally at the designated
unit of analysis). “Validity” and “reliability” of
measures?
• Hypothesis specification (for measures); expected
covariation/correlation?
• Statistical evidence of covariation/correlation?
• Rejecting null hypothesis? Substantive versus
“statistical significance?”
• Next steps? Statistical controls, multivariate analysis,
to be continued…. Strengthening causal inferences.
Questions at the Statistical Analysis Stage?
• “Level of measurement” for the measures of the
dependent and independent variables: Categorical
or Continuous? (Further distinction of “nominal,”
“ordinal,” “interval” or “ratio” level variables.)
• The preferred statistical method depends on this the
level of measurement of the variables!
• Motivation to put everything into a regression
analysis framework – for later multivariate analysis.
• The Bivariate Regression approach. Case of
Income -------> Test score of individuals
Bivariate Ordinary Least Squares Regression
(OLS)
• Case of : Income -------> Test scores of individuals
• The regression line takes the form of
Predicted Y = intercept + slope (X) or
Predicted Y = a + bX, where “a” and “b” take on the
unique numeric values that minimize the average
vertical distances (by minimizing the squared
distances). between all the points and the regression
line. To the extent Y and X are linearly related in this
way, the regression line falls much closer to all the
points than does the line through the mean of Y.
• Min. for all cases the sum of (Y-Predicted Y)2
Bivariate Scatterplot
Regression Line Versus the Mean: Idea of
“Explained Variance”
640
660
Regression Line
Mean of Test Scores
620
Test Scores
680
700
California School District T est Score and Income
2.0
2.5
3.0
(Logged) Average Income
3.5
4.0
Linear regression lets us estimate the
slope of the population regression line
• Ultimately our aim is to estimate the
causal effect on Y of a unit change in X –
but for now, just think of the problem of
fitting a straight line to data on two
variables, Y and X.
• The slope of the population regression
line is the expected effect on Y of a unit
change in X.
The Population Linear Regression Model
Yi = β0 + β1Xi + ui, i = 1,…, n
•
•
•
•
•
•
We have n observations, (Xi, Yi), i = 1,.., n.
X is the independent variable or regressor
Y is the dependent variable
β0 = intercept
β1 = slope
ui = the regression error
The regression error consists of omitted factors. In general,
these omitted factors are other factors that influence Y, other
than the variable X. The regression error also includes error in
the measurement of Y.
The population regression model in a picture: Observations on Y and X (n = 7);
the population regression line; and the regression error (the “error term”):
4-15
n
The OLS estimator solves:
min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1
• The OLS estimator minimizes the average squared
difference between the actual values of Yi and the
prediction (“predicted value”) based on the
estimated line. That is, it minimizes the vertical
distances.
• This minimization problem can be solved using
calculus.
• The result is the OLS estimators of β0 and β1.
4-16
4-17
Application to the California Test
Score – Class Size data
• Estimated slope = = – 2.28
• Estimated intercept = 698.9
• Estimated regression line: Tessscore = 698.9 – 2.28×STR
4-18
OLS regression: STATA output
regress testscr str, robust
Regression with robust standard errors
Number of obs =
420
F( 1,
418) =
19.26
Prob > F
= 0.0000
R-squared
= 0.0512
Root MSE
= 18.581
------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------+---------------------------------------------------------------str | -2.279808
.5194892
-4.39
0.000
-3.300945
-1.258671
_cons |
698.933
10.36436
67.44
0.000
678.5602
719.3057
-------------------------------------------------------------------------
4-19
Example of the R2 and the SER
Testscore = 698.9 – 2.28×STR, R2 = .05, SER = 18.6
STR explains only a small fraction of the variation in test scores. Does this
make sense? Does this mean the STR is unimportant in a policy sense?
4-20
A real-data example from labor economics: average hourly
earnings vs. years of education (data source: Current Population
Survey):
Slope of the Regression Line, Variability Around It, and
the Correlation Coefficient
• Predicted Y = a + bX, where b is the slope.
• Correlation Coefficient, Pearson’s “r”, ranges
from -1 to 0 to +1, and is larger in size to the
extent that the observed data fall very close to
the regression line. The r2 indicates how much
closer proportionately the regression line fall
closer (vertically) to the observed values of
the dependent variable that the horizontal
line through the mean of the dependent
variable. Why are both useful?
Correlation=1
r = Correlation =.95
Same Slope (b) but Correlation =0.75,
Implications? More variability? Why?
Correlation = -.50
No correlation
OLS can be sensitive to an outlier (also look
for non-linearity? discuss later?):
• Is the lone point an outlier in X or Y?
• In practice, outliers are often data glitches (coding or recording
problems). Sometimes (or more often?) they are observations
that really shouldn’t be in your data set. Plot your data!
4-28
The larger the variance of X, the smaller the
variance of the slope b
The number of black and blue dots is the same. Using which would
you get a more accurate regression line?
4-29
Analyzing Categorical Measures
• For categorical independent and dependent
variables: Cross Tabulation
• For a categorical independent variable and a
continuous dependent variable or a
categorical dependent variable that can be
treated as continuous: Compare Means on
the dependent variable.
• For a dichotomous dependent variable coded
0-1, the mean is the proportion of cases in the
1 category, so means on it can be compared!
Go to Stata example of standard
bivariate analysis, non-regression
• Crucial: Preparing Data -- Recoding; Dealing
with “Missing Values,” if any; etc.
• Go to PDF file, W4910x11 Bivariate Crosstabs
and Means Analysis. Examples from U.S
Survey Data.
• On to a Regression Analysis framework next…
Moving to a regression framework for
categorical variables:
• Treating categorical variables as
continuous, if categories are “ordered”
(“ordinal” vs. “nominal” level variables).
• Special case of dichotomous variables.
(The mean of a 0-1 variable is the proportion
of cases in the “1” category (Ave. 0,0,1,1,1=.6)
• Crucial bridge: “dummy variable
regression.” (And now for some comic relief,
normally done at a blackboard with chalk:
Example Using U.S. Survey Data
and Stata Software
• Assumptions in treating ordinal variables
as continuous variables.
• Statistical versus Substantive
Significance? Variability. Sampling
error”/confidence intervals. The
“standard error.”
• PDF file W4910x11 Bivariate Regression,
Dummy Variables.
Statistical Control: Understanding Multivariate
Models (Multiple Regression Analysis)
• Predicted Y = a + b1X1 + b2X2, where the b’s are
the coefficients for which the differences
between the observed Y’s and predicted Y’s
are minimum. In this case we have more b’s to
estimate to min. the sum of (Y-Predicted Y)2
• It now also has the interpretations shown
below, beginning with the comparisons of
different possible scenarios for “conditional”
regressions—that hold one variable constant.
“Effect” of Region and Democracy on
Economic Growth (made up data)
• Predicted EG = a + b1(Democracy) +b2(Region),
where we think both democracy and region
have possible causal effects.
• Case of only two regions (1 and 2; Region is
coded 0-1), to illustrate a simple case of
Statistical Control/holding one var. constant.
• Linear equation assumes no “interaction”;
that is, “effect” of Democracy is the same in
Region 1 and 2 (and same for Region; but is
it?). There are different possibilities: (and comic relief)
(b) Interactions between
continuous and binary variables
Yi = β0 + β1Di + β2Xi + ui
• Di is binary—a dummy variable coded 0-1; X is continuous
• As specified above, the effect on Y of X (holding constant D) =
β2, which does not depend on D; that is, it is the same for D=0
and for D=1. But what if that is not the case???
• To allow the effect of X to depend on D, include the
“interaction term” Di×Xi as a regressor:
Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui
8-46
Binary-continuous interactions, ctd.
8-47
Binary-continuous interactions: the
two regression lines
Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui
Observations with Di= 0 (the “D = 0” group):
Yi = β0 + β2Xi + ui The D=0 regression line
Observations with Di= 1 (the “D = 1” group):
Yi = β0 + β1 + β2Xi + β3Xi + ui
= (β0+β1) + (β2+β3)Xi + ui The D=1 regression line
8-48
(c) Interactions between two
continuous variables
Yi = β0 + β1X1i + β2X2i + ui
•
•
•
•
X1, X2 are continuous
As specified, the effect of X1 doesn’t depend on X2
As specified, the effect of X2 doesn’t depend on X1
To allow the effect of X1 to depend on X2, include the
“interaction term” X1i×X2i as a regressor:
Yi = β0 + β1X1i + β2X2i + β3(X1i×X2i) + ui
8-49
Next: An Instructional Example of a a Simple
Three Variable Model
• From U.S. Survey Data. Using Stata software.
• Ordinal variables are treated again as
continuous variables, collapsing the number
of categories in the independent variables. We
would normally not collapse variables; that
loses information. We do so here for purposes
of seeing how the assumption of “no
interaction” plays out in a simple, illustrative
way.
• Go to PDF file W4910x11 Control Variables
Example: the California test score data
Regression of TestScore against STR:
TestScore = 698.9 – 2.28×STR
Now include percent English Learners in the district (PctEL):
= 686.0 – 1.10×STR – 0.65PctEL
• What happens to the coefficient on STR?
• What (STR, PctEL) = 0.19)
6-55
Multiple regression in STATA
reg testscr str pctel, robust;
Regression with robust standard errors
Number of obs
F( 2,
417)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
223.82
0.0000
0.4264
14.464
-----------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str | -1.101296
.4328472
-2.54
0.011
-1.95213
-.2504616
pctel | -.6497768
.0310318
-20.94
0.000
-.710775
-.5887786
_cons |
686.0322
8.728224
78.60
0.000
668.8754
703.189
------------------------------------------------------------------------------
TestScore = 686.0 – 1.10×STR – 0.65PctEL
6-56
Another Example of a Multivariate Model
Estimated with Stata
• More like real research than the previous
example. Multiple control variables. (No
collapsing of categories, losing information.)
• Scatterplots to explore non-linearity.
• Inclusion of multiplicative terms to.
explore statistical interactions.
• Go to PDF file W4911y12 Regressions…..
Linearity vs. Non-Linearity
• Non-linear relationships. Easy cases are
models which are still linear in the
coefficients; can be estimated with OLS.
• Case of dichotomous dependent variable
(coded 0-1), for which theory is nonlinear and not linear in the coefficients.
An “S” shaped curve: logit or probit
model. What kind of theory? Versus a
Linear Probability Model, LPM with OLS.
But the TestScore – Income relation
looks nonlinear...
8-60
Example: the TestScore – Income relation
Incomei = average district income in the ith district
(thousands of dollars per capita)
Quadratic specification:
TestScorei = β0 + β1Incomei + β2(Incomei)2 + ui
Cubic specification:
TestScorei = β0 + β1Incomei + β2(Incomei)2 + β3(Incomei)3 + ui
8-61
Interpreting the estimated regression
function:
(a) Plot the predicted values
Testscore = 607.3 + 3.85Incomei – 0.0423(Incomei)2
(2.9) (0.27)
(0.0048)
8-62
Example: Linear Probability Model (LPM). HMDA data
Mortgage denial v. ratio of debt payments to income
(P/I ratio) in a subset of the HMDA data set (n = 127)
Probit
Logit versus Probit Models
The predicted probabilities from the probit and logit models are very close in
these HMDA regressions:
Logit Models (or Logistic Regression)
Can not estimate with OLS. Requires Maximum Likelihood
Estimation (MLE)
Logit (continued)
• Where P is the Predicted Y for a dichotomous
(0-1) dependent variable, that is, predicting
the probability that Y=1. The same goal as the
Linear Probability Model (regression) but with
a non-linear (S-curve) relationship.
• e = the natural log base (2.718…), and bX
refers to the linear combination of indep. vars.
• It involves interactions of independent vars.
• Go to W4911y12 Logit, Probit, LPM example
in Stata….
Survey Data Analysis
• Anderson and Guillory (1997) examine
satisfaction with democracy.
• Huber, Kernell, and Leoni (2005) examine
partisan attachment.
• Interactions and Limited Dependent
Variables.
Survey Research: Issues and Sources of
Error
• Issues in Survey Research
• Go to PDF File Sources of Errors in Surveys.
Identifying Causal Mechanisms and
Time Series Analysis
• Getting insight and leverage from
observing variations over time and short
term changes from longitudinal data.
• Comparing directly changes over time.
• Looking for sequences or time lags in the
data over time.
• Examples (next slide).
Data Analysis Examples
• Unit of analysis is the time period (e.g.,
year, month, etc.) for a single unit (e.g.,
one country or other kind of case); e.g.
one country’s-years.
• Stata example of simple time series with
exogenous variables only; example of a
lagged endogenous variable. Go to PDF
File W4911y12 Paper5Part1.
Time Series (continued)
• Continue to PDF File W4911y12
Paper5Part2.
• “Panel” or “pooled time series” data.
For multiple units over time. For
example, separate time series for many
countries—the unit of analysis is
“country-years.” Data show variation
both over time and across units; need to
watch this.
Time Series (continued)
• Stata example of pooled time series; Go
to PDF File W4911y12 Paper5
SupplementPooledTimeSeries.
• Panel Data. Logic of “fixed effects”.
• Examples from readings.
Using “Instruments” to Indentify
Causal Effects
• Issue of “reciprocal causation”/
”endogeneity”/”simultaneity bias”.
• Need to find an “exogenous” variable as
an instrument.
• Assumptions about exogeneity and lack
of direct causal effect on an endogenous
variable.
• Example from Acemoglu et al.
Instrumental Variables (continued)
• Logic of Indirect Least Squares
• Two-Stage Least Squares (TSLS or 2SLS)
• Stata example using U.S. Survey Data, Go
to PDF File W4911y12Paper4.
• Other research examples; see next table.
Factor Analysis and Scale
Construction
• Example from Verba and Nie,
Participation in America (1972)
• Stata example from U.S. survey data.
Factor Analysis and Scale
Construction (continued)
• Stata Example from U.S. Survey Data
• Go to W4911y Paper6Factor Analysis
example.
Other Topics?
• Questions?
Download