BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012 • • • • Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt) Exploring Theories: Bivariate Analysis Multivariate Models (Regression Approaches) Limited Dependent Variables (dichotomous variables) and Interactions • Survey Research: Issues and Sources of Error • Identifying Causal Mechanisms and Time Series Analysis • Using “Instruments” to Indentify Causal Effects Exploring Theories: Bivariate analysis. “Correlation is not causation!” But you have to start somewhere.... First Steps • Centrality of causal theorizing. Dependent and independent variable(s) unit of analysis? Generalizing to what universe/population? Assumption of unidirectional causation (revisited later)? X -------- > Y e.g., Democracy -----------> Income (of countries) Education -----------> Income (of individuals) • Plausibility of theory? Causal mechanism/story? Next Steps in Quantitative Research • Measurement of variables (ideally at the designated unit of analysis). “Validity” and “reliability” of measures? • Hypothesis specification (for measures); expected covariation/correlation? • Statistical evidence of covariation/correlation? • Rejecting null hypothesis? Substantive versus “statistical significance?” • Next steps? Statistical controls, multivariate analysis, to be continued…. Strengthening causal inferences. Questions at the Statistical Analysis Stage? • “Level of measurement” for the measures of the dependent and independent variables: Categorical or Continuous? (Further distinction of “nominal,” “ordinal,” “interval” or “ratio” level variables.) • The preferred statistical method depends on this the level of measurement of the variables! • Motivation to put everything into a regression analysis framework – for later multivariate analysis. • The Bivariate Regression approach. Case of Income -------> Test score of individuals Bivariate Ordinary Least Squares Regression (OLS) • Case of : Income -------> Test scores of individuals • The regression line takes the form of Predicted Y = intercept + slope (X) or Predicted Y = a + bX, where “a” and “b” take on the unique numeric values that minimize the average vertical distances (by minimizing the squared distances). between all the points and the regression line. To the extent Y and X are linearly related in this way, the regression line falls much closer to all the points than does the line through the mean of Y. • Min. for all cases the sum of (Y-Predicted Y)2 Bivariate Scatterplot Regression Line Versus the Mean: Idea of “Explained Variance” 640 660 Regression Line Mean of Test Scores 620 Test Scores 680 700 California School District T est Score and Income 2.0 2.5 3.0 (Logged) Average Income 3.5 4.0 Linear regression lets us estimate the slope of the population regression line • Ultimately our aim is to estimate the causal effect on Y of a unit change in X – but for now, just think of the problem of fitting a straight line to data on two variables, Y and X. • The slope of the population regression line is the expected effect on Y of a unit change in X. The Population Linear Regression Model Yi = β0 + β1Xi + ui, i = 1,…, n • • • • • • We have n observations, (Xi, Yi), i = 1,.., n. X is the independent variable or regressor Y is the dependent variable β0 = intercept β1 = slope ui = the regression error The regression error consists of omitted factors. In general, these omitted factors are other factors that influence Y, other than the variable X. The regression error also includes error in the measurement of Y. The population regression model in a picture: Observations on Y and X (n = 7); the population regression line; and the regression error (the “error term”): 4-15 n The OLS estimator solves: min b0 ,b1 [Yi (b0 b1 X i )]2 i 1 • The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (“predicted value”) based on the estimated line. That is, it minimizes the vertical distances. • This minimization problem can be solved using calculus. • The result is the OLS estimators of β0 and β1. 4-16 4-17 Application to the California Test Score – Class Size data • Estimated slope = = – 2.28 • Estimated intercept = 698.9 • Estimated regression line: Tessscore = 698.9 – 2.28×STR 4-18 OLS regression: STATA output regress testscr str, robust Regression with robust standard errors Number of obs = 420 F( 1, 418) = 19.26 Prob > F = 0.0000 R-squared = 0.0512 Root MSE = 18.581 ------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+---------------------------------------------------------------str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671 _cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057 ------------------------------------------------------------------------- 4-19 Example of the R2 and the SER Testscore = 698.9 – 2.28×STR, R2 = .05, SER = 18.6 STR explains only a small fraction of the variation in test scores. Does this make sense? Does this mean the STR is unimportant in a policy sense? 4-20 A real-data example from labor economics: average hourly earnings vs. years of education (data source: Current Population Survey): Slope of the Regression Line, Variability Around It, and the Correlation Coefficient • Predicted Y = a + bX, where b is the slope. • Correlation Coefficient, Pearson’s “r”, ranges from -1 to 0 to +1, and is larger in size to the extent that the observed data fall very close to the regression line. The r2 indicates how much closer proportionately the regression line fall closer (vertically) to the observed values of the dependent variable that the horizontal line through the mean of the dependent variable. Why are both useful? Correlation=1 r = Correlation =.95 Same Slope (b) but Correlation =0.75, Implications? More variability? Why? Correlation = -.50 No correlation OLS can be sensitive to an outlier (also look for non-linearity? discuss later?): • Is the lone point an outlier in X or Y? • In practice, outliers are often data glitches (coding or recording problems). Sometimes (or more often?) they are observations that really shouldn’t be in your data set. Plot your data! 4-28 The larger the variance of X, the smaller the variance of the slope b The number of black and blue dots is the same. Using which would you get a more accurate regression line? 4-29 Analyzing Categorical Measures • For categorical independent and dependent variables: Cross Tabulation • For a categorical independent variable and a continuous dependent variable or a categorical dependent variable that can be treated as continuous: Compare Means on the dependent variable. • For a dichotomous dependent variable coded 0-1, the mean is the proportion of cases in the 1 category, so means on it can be compared! Go to Stata example of standard bivariate analysis, non-regression • Crucial: Preparing Data -- Recoding; Dealing with “Missing Values,” if any; etc. • Go to PDF file, W4910x11 Bivariate Crosstabs and Means Analysis. Examples from U.S Survey Data. • On to a Regression Analysis framework next… Moving to a regression framework for categorical variables: • Treating categorical variables as continuous, if categories are “ordered” (“ordinal” vs. “nominal” level variables). • Special case of dichotomous variables. (The mean of a 0-1 variable is the proportion of cases in the “1” category (Ave. 0,0,1,1,1=.6) • Crucial bridge: “dummy variable regression.” (And now for some comic relief, normally done at a blackboard with chalk: Example Using U.S. Survey Data and Stata Software • Assumptions in treating ordinal variables as continuous variables. • Statistical versus Substantive Significance? Variability. Sampling error”/confidence intervals. The “standard error.” • PDF file W4910x11 Bivariate Regression, Dummy Variables. Statistical Control: Understanding Multivariate Models (Multiple Regression Analysis) • Predicted Y = a + b1X1 + b2X2, where the b’s are the coefficients for which the differences between the observed Y’s and predicted Y’s are minimum. In this case we have more b’s to estimate to min. the sum of (Y-Predicted Y)2 • It now also has the interpretations shown below, beginning with the comparisons of different possible scenarios for “conditional” regressions—that hold one variable constant. “Effect” of Region and Democracy on Economic Growth (made up data) • Predicted EG = a + b1(Democracy) +b2(Region), where we think both democracy and region have possible causal effects. • Case of only two regions (1 and 2; Region is coded 0-1), to illustrate a simple case of Statistical Control/holding one var. constant. • Linear equation assumes no “interaction”; that is, “effect” of Democracy is the same in Region 1 and 2 (and same for Region; but is it?). There are different possibilities: (and comic relief) (b) Interactions between continuous and binary variables Yi = β0 + β1Di + β2Xi + ui • Di is binary—a dummy variable coded 0-1; X is continuous • As specified above, the effect on Y of X (holding constant D) = β2, which does not depend on D; that is, it is the same for D=0 and for D=1. But what if that is not the case??? • To allow the effect of X to depend on D, include the “interaction term” Di×Xi as a regressor: Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui 8-46 Binary-continuous interactions, ctd. 8-47 Binary-continuous interactions: the two regression lines Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui Observations with Di= 0 (the “D = 0” group): Yi = β0 + β2Xi + ui The D=0 regression line Observations with Di= 1 (the “D = 1” group): Yi = β0 + β1 + β2Xi + β3Xi + ui = (β0+β1) + (β2+β3)Xi + ui The D=1 regression line 8-48 (c) Interactions between two continuous variables Yi = β0 + β1X1i + β2X2i + ui • • • • X1, X2 are continuous As specified, the effect of X1 doesn’t depend on X2 As specified, the effect of X2 doesn’t depend on X1 To allow the effect of X1 to depend on X2, include the “interaction term” X1i×X2i as a regressor: Yi = β0 + β1X1i + β2X2i + β3(X1i×X2i) + ui 8-49 Next: An Instructional Example of a a Simple Three Variable Model • From U.S. Survey Data. Using Stata software. • Ordinal variables are treated again as continuous variables, collapsing the number of categories in the independent variables. We would normally not collapse variables; that loses information. We do so here for purposes of seeing how the assumption of “no interaction” plays out in a simple, illustrative way. • Go to PDF file W4910x11 Control Variables Example: the California test score data Regression of TestScore against STR: TestScore = 698.9 – 2.28×STR Now include percent English Learners in the district (PctEL): = 686.0 – 1.10×STR – 0.65PctEL • What happens to the coefficient on STR? • What (STR, PctEL) = 0.19) 6-55 Multiple regression in STATA reg testscr str pctel, robust; Regression with robust standard errors Number of obs F( 2, 417) Prob > F R-squared Root MSE = = = = = 420 223.82 0.0000 0.4264 14.464 -----------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616 pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786 _cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189 ------------------------------------------------------------------------------ TestScore = 686.0 – 1.10×STR – 0.65PctEL 6-56 Another Example of a Multivariate Model Estimated with Stata • More like real research than the previous example. Multiple control variables. (No collapsing of categories, losing information.) • Scatterplots to explore non-linearity. • Inclusion of multiplicative terms to. explore statistical interactions. • Go to PDF file W4911y12 Regressions….. Linearity vs. Non-Linearity • Non-linear relationships. Easy cases are models which are still linear in the coefficients; can be estimated with OLS. • Case of dichotomous dependent variable (coded 0-1), for which theory is nonlinear and not linear in the coefficients. An “S” shaped curve: logit or probit model. What kind of theory? Versus a Linear Probability Model, LPM with OLS. But the TestScore – Income relation looks nonlinear... 8-60 Example: the TestScore – Income relation Incomei = average district income in the ith district (thousands of dollars per capita) Quadratic specification: TestScorei = β0 + β1Incomei + β2(Incomei)2 + ui Cubic specification: TestScorei = β0 + β1Incomei + β2(Incomei)2 + β3(Incomei)3 + ui 8-61 Interpreting the estimated regression function: (a) Plot the predicted values Testscore = 607.3 + 3.85Incomei – 0.0423(Incomei)2 (2.9) (0.27) (0.0048) 8-62 Example: Linear Probability Model (LPM). HMDA data Mortgage denial v. ratio of debt payments to income (P/I ratio) in a subset of the HMDA data set (n = 127) Probit Logit versus Probit Models The predicted probabilities from the probit and logit models are very close in these HMDA regressions: Logit Models (or Logistic Regression) Can not estimate with OLS. Requires Maximum Likelihood Estimation (MLE) Logit (continued) • Where P is the Predicted Y for a dichotomous (0-1) dependent variable, that is, predicting the probability that Y=1. The same goal as the Linear Probability Model (regression) but with a non-linear (S-curve) relationship. • e = the natural log base (2.718…), and bX refers to the linear combination of indep. vars. • It involves interactions of independent vars. • Go to W4911y12 Logit, Probit, LPM example in Stata…. Survey Data Analysis • Anderson and Guillory (1997) examine satisfaction with democracy. • Huber, Kernell, and Leoni (2005) examine partisan attachment. • Interactions and Limited Dependent Variables. Survey Research: Issues and Sources of Error • Issues in Survey Research • Go to PDF File Sources of Errors in Surveys. Identifying Causal Mechanisms and Time Series Analysis • Getting insight and leverage from observing variations over time and short term changes from longitudinal data. • Comparing directly changes over time. • Looking for sequences or time lags in the data over time. • Examples (next slide). Data Analysis Examples • Unit of analysis is the time period (e.g., year, month, etc.) for a single unit (e.g., one country or other kind of case); e.g. one country’s-years. • Stata example of simple time series with exogenous variables only; example of a lagged endogenous variable. Go to PDF File W4911y12 Paper5Part1. Time Series (continued) • Continue to PDF File W4911y12 Paper5Part2. • “Panel” or “pooled time series” data. For multiple units over time. For example, separate time series for many countries—the unit of analysis is “country-years.” Data show variation both over time and across units; need to watch this. Time Series (continued) • Stata example of pooled time series; Go to PDF File W4911y12 Paper5 SupplementPooledTimeSeries. • Panel Data. Logic of “fixed effects”. • Examples from readings. Using “Instruments” to Indentify Causal Effects • Issue of “reciprocal causation”/ ”endogeneity”/”simultaneity bias”. • Need to find an “exogenous” variable as an instrument. • Assumptions about exogeneity and lack of direct causal effect on an endogenous variable. • Example from Acemoglu et al. Instrumental Variables (continued) • Logic of Indirect Least Squares • Two-Stage Least Squares (TSLS or 2SLS) • Stata example using U.S. Survey Data, Go to PDF File W4911y12Paper4. • Other research examples; see next table. Factor Analysis and Scale Construction • Example from Verba and Nie, Participation in America (1972) • Stata example from U.S. survey data. Factor Analysis and Scale Construction (continued) • Stata Example from U.S. Survey Data • Go to W4911y Paper6Factor Analysis example. Other Topics? • Questions?