Ordinary Least Squares Regression PO7001: Quantitative Methods I Kenneth Benoit 24 November 2010 Independent and Dependent variables I I A dependent variable represents the quantity we wish to explain variation in, or the thing we are trying to explain Typical examples of a dependent variable in political science: I I I I I An independent variable represents a quantity whose variation will be used to explain variation in the dependent variable Typical examples of independent variables in political science: I I I I I I votes received by a governing party support for a referendum result like Lisbon party support for European integration demographic: gender, national background, age economic: socioeconomic status, income, national wealth political: party affiliation of one’s parents institutional: district magnitude, electoral system, presidential v. parliamentary behavioural: campaign spending levels Using this language implies causality: X → Y The importance of variation I Variation in outcomes – of the dependent variable – are what we seek to explain in social and political research I We seek to explain these outcomes using (independent) variables I The very language – here, the term “variable” suggests that the quantity so named has to vary I Conversely, a quantity that does not vary is impossible to study in this way. This also applies to samples that do not vary: these will not help us in research. I Typically when we collect data, we wish to have as much variation in our sample as possible I Example: In the returns from schooling example, it would be best to have as much variation in schooling as possible, to maximize the leverage of our research Different functional relationships Linear A linear relationship, also known as a straight-line relationship, exists if a line drawn through the central tendency of the points is a straight-line Curvilinear Exists if the relationship between variables is not a straight-line function, but is instead curved Example: Television viewing (Y ) as a function of age (X ) (More on this in Quant 2) Correlations I The central idea behind correlation is that two variables have a relationship such that there is a systematic relationship between the their values I This relationship can be positive or negative I This relationship varies in strength I Bivariate associations are usually depicted graphically by use of a scatterplot I We can also summarize (as per an earlier week) the correlation using Pearson’s r to show the direction and strength of the bivariate relationship Pearson’s r revisited r = = = P (X − X̄ )(Yi − Ȳ ) qPi i 2 2 i (Xi − X̄ ) (Yi − Ȳ ) Sum of Products p Sum of Squaresx Sum of Squaresy SP p SSx SSy Testing the significance of Pearson’s r I Pearson’s r measures correlation in the sample – but the association we are interested in exists in the population I Question: what is the probability that any correlation we measure in a sample really exists in the population, and is not merely due to sampling error I H0 : ρx,y = 0 – No correlation exists in the population I HA : ρx,y 6= 0 I Significance can be tested by the t-ratio: √ r N −2 t= √ 1 − r2 Pearson’s r significance testing: Example I Let’s assume we have data such that: r = .24, N = 8 I So computation of t is: t = = = √ .24 8 − 2 p 1 − (.24)2 .24(2.45) √ 1 − .0576 .59 .59 √ = = .61 .97 .9424 I From R (using qt()), we know that the critical value for t with df=6, α = .05 is 2.447 – so we do not reject H0 I In R there is a test for this called cor.test(x,y) Pearson’s r significance testing using R > x <- c(12,10,6,16,8,9,12,11) > y <- c(12,8,12,11,10,8,16,15) > cor(x,y) [1] 0.2420615 > # compute the empirical t-value > (t.calc <- (.24*sqrt(8-2) / sqrt(1-.24^2))) [1] 0.6055768 > # compute the critical t-value > (t.crit <- qt(1-.05/2, 6)) [1] 2.446912 > # compare > t.calc > t.crit [1] FALSE > # using R’s built-in correlation significance test > cor.test(x,y) Pearson’s product-moment correlation data: x and y t = 0.6111, df = 6, p-value = 0.5636 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.5577490 0.8087778 sample estimates: cor 0.2420615 Regression analysis I Recall the basic linear model: yi = a + bXi I I Here the relationship is determined by two parameters: a the intercept: this refers to the value of Y when X is zero b the slope: or the rate of change in Y for a one-unit change in X . Also known as the regression coefficient Note that this implies a straight line, perfect relationship however. Because this is never the case in real research, we also have an error term or residual ei : Yi = β0 + β1 xi + εi and we use the β and ε terminology instead of the high-school math quantities a and b Example 35 30 25 20 15 Sentence length (Y) 40 45 par(mar=c(4,4,1,1)) x <- c(0,3,1,0,6,5,3,4,10,8) y <- c(12,13,15,19,26,27,29,31,40,48) plot(x, y, xlab="Number of prior convictions (X)", ylab="Sentence length (Y)", pch=19) abline(h=c(10,20,30,40), col="grey70") 0 2 4 6 Number of prior convictions (X) 8 10 Least squares formulas For the three parameters (simple regression): I the regression coefficient: P (xi − x̄)(yi − ȳ ) P β̂1 = (xi − x̄)2 I the intercept: β̂0 = ȳ − β̂1 x̄ I and the residual variance σ 2 : 1 X σ̂ 2 = [yi − (β̂0 + β̂1 xi )]2 n−2 Least squares formulas continued Things to note: I the prediction line is ŷ = β̂0 + β̂1 x I the value ŷi = β̂0 + β̂1 xi is the predicted value for xi I the residual is ei = yi − ŷi I The residual sum of squares (RSS) = I The estimate for σ 2 is the same as 2 i ei P σ̂ 2 = RSS/(n − 2) Example to show fomulas in R > x <- c(0,3,1,0,6,5,3,4,10,8) > y <- c(12,13,15,19,26,27,29,31,40,48) > (data <- data.frame(x, y, xdev=(x-mean(x)), ydev=(y-mean(y)), + xdevydev=((x-mean(x))*(y-mean(y))), + xdev2=(x-mean(x))^2, + ydev2=(y-mean(y))^2)) x y xdev ydev xdevydev xdev2 ydev2 1 0 12 -4 -14 56 16 196 2 3 13 -1 -13 13 1 169 3 1 15 -3 -11 33 9 121 4 0 19 -4 -7 28 16 49 5 6 26 2 0 0 4 0 6 5 27 1 1 1 1 1 7 3 29 -1 3 -3 1 9 8 4 31 0 5 0 0 25 9 10 40 6 14 84 36 196 10 8 48 4 22 88 16 484 > (SP <- sum(data$xdevydev)) [1] 300 > (SSx <- sum(data$xdev2)) [1] 100 > (SSy <- sum(data$ydev2)) [1] 1250 > (b1 <- SP / SSx) [1] 3 From observed to “predicted” relationship I In the above example, β̂0 = 14, β̂1 = 3 I This linear equation forms the regression line I The regression line always passes through two points: I I the point (x = 0, y = β̂0 ) the point (x̄, ȳ ) (the average X predicts the average Y ) 2 i ei P I The residual sum of squares (RSS) = I The regression line is that which minimizes the RSS Plot of regression example 45 Regression line 30 25 20 15 Sentence length (Y) 35 40 Yhat = 14 + 3X Y intercept 0 2 4 6 Number of prior convictions (X) 8 10 Requirements for regression 1. Both variables should be measured at the interval level 2. The relationship (in the population) between X and Y is linear I I A transformation may fix non-linearity in some cases Outliers or influential points may need special treatment 3. Sample is randomly chosen (necessary for inference) 4. Both variables must be normally distributed, unless we have a very large sample Regression “terminology” I y is the dependent variable I I referred to also (by Greene) as a regressand X are the independent variables I I also known as explanatory variables also known as regressors I y is regressed on X I The error term is sometimes called a disturbance Notation I Independent variables X I Dependent variable Y I Yi is a random variable (not directly observed) I yi is a realised value of Yi (observed) I So dependent variable sometimes refers to a set of numbers in your dataset (y ) and sometimes to a random variable at each i (Yi ). Interpreting the regression results I The Y -intercept corresponds to the expected value of Y when X = 0 (may or may not be meaningful) I The regression coefficient β1 refers to the expected change in Y resulting from a one-unit change in X I Typically we are more interested in β1 than β0 , but β0 forms a vital part of any regression estimation I We can make a prediction yˆi for any i (and xi , although we should be careful to choose reasonable values of xi ) Example: For xi =13, yˆi = β̂0 + β̂1 (13) = 14 + 3(13) = 14 + 39 = 53 Sums of squares (ANOVA) TSS Total sum of squares P (yi − ȳ )2 P ESS Estimation or Regression sum of squares (ŷi − ȳ )2 P 2 P RSS Residual sum of squares ei = (ŷi − yi )2 The key to remember is that TSS = ESS + RSS R2 How much of the variance did we explain? PN PN 2 (yˆi − ȳ )2 RSS i=1 (yi − yˆi ) = 1 − PN = Pi=1 R =1− N 2 2 TSS i=1 (yi − ȳ ) i=1 (yi − ȳ ) 2 Can be interpreted as the proportion of total variance explained by the model. R 2 and Pearson’s r For simple linear regression (i.e. one independent variable), R 2 is the same as the correlation coefficient, Pearson’s r , squared. R2 I Note that computing the regression line comes from the same sums of squares (SP, SSx , SSY ) used in computing the correlation r I The squared correlation R 2 is an important quantity in regression (sometimes called the coefficient of determination) I R 2 is the proportion of variance in Y determined by variation in X I 0 ≤ R 2 ≤ 1.0 I But remember that R 2 is not a regression parameter I Computation: SP r=p SSx SSy R 2 computation example r = p = p = = = SP SSx SSy 300 (100)(1250) 300 √ 125000 300 353.55 .85 r 2 is then (.85)2 = .72 This means that 72% of the sentence length is explained by the number of priors R2 I A much over-used statistic: it may not be what we are interested in at all I Interpretation: the proportion of the variation in y that is explained lineraly by the independent variables I Defined in terms of sums of squares: ESS TSS RSS = 1− TSS P (yi − ŷi )2 = 1− P (yi − ȳ )2 R2 = I Alternatively, R 2 is the squared correlation coefficient between y and ŷ R 2 continued I When a model has no intercept, it is possible for R 2 to lie outside the interval (0, 1) I R 2 rises with the addition of more explanatory variables. For n−1 this reason we often report “adjusted R 2 ”: 1 − (1 − R 2 ) n−k−1 where k is the total number of regressors in the linear model (excluding the constant) I Whether R 2 is high or not depends a lot on the overall variance in Y I To R 2 values from different Y samples cannot be compared R 2 continued I Solid arrow: P variation in y when X is unknown (TSS Total Sum of Squares (yi − ȳ )2 ) I Dashed arrow: variation in y when X is known (ESS Estimation P Sum of Squares (ŷi − ȳ )2 ) R 2 decomposed y = ŷ + Var(y ) = Var(ŷ ) + Var(e) + 2Cov(ŷ , e) Var(y ) = Var(ŷ ) + Var(e) + 0 X X (yi − ȳ )2 /N = (ŷi − ŷ¯ )2 /N + (ei − ê)2 /N X X X (yi − ȳ )2 = (ŷi − ŷ¯ )2 + (ei − ê)2 X X X (yi − ȳ )2 = (ŷi − ŷ¯ )2 + ei2 X TSS TSS/TSS = ESS + RSS = ESS/TSS + RSSTSS 1 = R 2 + unexplained variance Example from height-weight data (of course) there is a direct way to compute regression statistics in R: > # regression models in R > regmodel <- lm(y ~ x) > summary(regmodel) Call: lm(formula = y ~ x) Residuals: Min 1Q Median -5.2121 -2.5682 -0.6515 3Q 1.5303 Max 8.2424 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 49.0909 21.6202 2.271 0.0636 . x 0.7576 0.3992 1.898 0.1065 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.587 on 6 degrees of freedom Multiple R-squared: 0.375, Adjusted R-squared: 0.2709 F-statistic: 3.601 on 1 and 6 DF, p-value: 0.1065 Illustration using the Anscombe dataset [show R analysis here] ## Illustration using Anscombe dataset data(anscombe) attach(anscombe) round(coef(lm(y1~x1)), 2) # same b0, b1 round(coef(lm(y2~x2)), 2) round(coef(lm(y3~x3)), 2) round(coef(lm(y4~x4)), 2) round(summary(lm(y1~x1))$r.squared, 2) # same R^2 round(summary(lm(y2~x2))$r.squared, 2) round(summary(lm(y3~x3))$r.squared, 2) round(summary(lm(y4~x4))$r.squared, 2) # plot the four x-y pairs par(mfrow=c(2,2), mar=c(4, 4, 1, 1)+0.1) # 4 plots in one graphic window plot(x1,y1) abline(lm(y1~x1), col="red", lty="dashed") plot(x2,y2) abline(lm(y2~x2), col="red", lty="dashed") plot(x3,y3) abline(lm(y3~x3), col="red", lty="dashed") plot(x4,y4)1 abline(lm(y4~x4), col="red", lty="dashed") detach(anscombe) 6 y2 4 3 5 4 6 5 7 y1 8 7 9 8 10 9 11 Anscombe plots 4 6 8 10 12 14 4 6 8 10 12 14 x2 y4 6 6 8 8 y3 10 10 12 12 x1 4 6 8 10 x3 12 14 8 10 12 14 x4 16 18 Extensions of the regression model I The regression model can be generalized to multiple regression, which involves “regressing Y ” on several independent variables X1 , X2 , etc. I Regression allows us to isolate the linear contribution of each unit of each Xk on Y , by “holding everything else constant” I This is the most common and most powerful basic technique in social science statistics, and something that is used in virtually every analysis that attempts to establish any sort of cause and effect I The additional X variables are typically considered to be control variables I Some variables may also be categorical, especially when they are “dummy variables” (0 or 1) Linear model The linear model can be written as: Yi = Xi β + εi εi ∼ N(0, σ 2 ) Or, alternatively, as: Yi ∼ N(µi , σ 2 ) µi = Xi β Linear model Two components of the model: Yi ∼ N(µi , σ 2 ) µi = Xi β Stochastic Systematic Generalised version: Yi ∼ f (θi , α) θi = g (Xi , β) Stochastic Systematic Model Yi ∼ f (θi , α) θi = g (Xi , β) Stochastic Systematic Stochastic component: varies over repeated (hypothetical) observations on the same unit. Systematic component: varies across units, but constant given X . Model Yi ∼ f (θi , α) θi = g (Xi , β) Stochastic Systematic Two types of uncertainty: Estimation uncertainty: lack of knowledge about α and β; can be reduced by increasing N. Fundamental uncertainty: represented by stochastic component and exists independent of researcher. Inference from regression In linear regression, the sampling distribution of the coefficient estimates form a normal distribution, which is approximated by a t distribution due to approximating σ by s. Thus we can calculate a confidence interval for each estimated coefficient. Or perform a hypothesis test along the lines of: H0 :β1 = 0 H1 :β1 6= 0 Inference from regression To calculate the confidence interval, we need to calculate the standard error of the coefficient. Rule of thumb to get the 95% confidence interval: β − 2SE < β < β + 2SE Thus if β is positive, we are 95% certain it is different from zero when β − 2SE > 0. (Or when the t value is greater than 2 or less than −2.) In R, we get the standard errors by using the summary command on the model output. Simple regression model: example Does infant mortality get lower as GDP per capita increases, measured in 1976? > m <- lm(INFMORT ~ LEVEL, data=aclp, subset=(YEAR == 1976 & INFMORT >= 0)) > summary(m) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.3028647 2.7222356 12.234 9.44e-13 *** LEVEL -0.0018998 0.0003038 -6.253 9.29e-07 *** F -test In simple linear regression, we can do an F -test: H0 :β1 = 0 H1 :β1 6= 0 F = ESS/1 ESS = 2 RSS/(n − 2) σ̂ with 1 and n − 2 degrees of freedom. Example > require(foreign) > dail <- read.dta("dailcorrected.dta") > summary(lm(votes1st ~ spend_total + incumb + electorate + minister, data=dail)) Call: lm(formula = votes1st ~ spend_total + incumb + electorate + minister, data = dail) Residuals: Min 1Q -4934.1 -1038.8 Median -347.6 3Q 1054.0 Max 6900.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.966e+02 4.172e+02 1.909 0.0569 . spend_total 1.737e-01 1.095e-02 15.862 <2e-16 *** incumbIncumbent 2.522e+03 2.207e+02 11.424 <2e-16 *** electorate -4.827e-04 5.404e-03 -0.089 0.9289 minister -1.303e+02 3.965e+02 -0.329 0.7425 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1847 on 458 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.6478, Adjusted R-squared: 0.6447 F-statistic: 210.6 on 4 and 458 DF, p-value: < 2.2e-16 CLRM: Basic Assumptions 1. Specification: I I I I Relationship between X and Y in the population is linear: E(Y ) = X β No extraneous variables in X No omitted independent variables Parameters (β) are constant 2. E() = 0 3. Error terms: I I Var() = σ 2 , or homoskedastic errors E(ri ,j ) = 0, or no auto-correlation CLRM: Basic Assumptions (cont.) 4. X is non-stochastic, meaning observations on independent variables are fixed in repeated samples I I I implies no measurement error in X implies no serial correlation where a lagged value of Y would be used an independent variable no simultaneity or endogenous X variables 5. N > k, or number of observations is greater than number of independent variables (in matrix terms: rank(X ) = k), and no exact linear relationships exist in X 6. Normally distributed errors: |X ∼ N(0, σ 2 ). Technically however this is a convenience rather than a strict assumption Ordinary Least Squares (OLS) I Objective: minimize I I P P (Yi − Ŷi )2 , where Ŷi = b0 + b1 Xi error ei = (Yi − Ŷi ) b = = I ei2 = P (Xi − X̄ )(Yi − Ȳ ) P (Xi − X̄ ) P XY P i 2i Xi The intercept is: b0 = Ȳ − b1 X̄ OLS rationale I Formulas are very simple I Closely related to ANOVA (sums of squares decomposition) I Predicted Y is sample mean when Pr(Y |X ) =Pr(Y ) I I I In the special case where Y has no relation to X , b1 = 0, then OLS fit is simply Ŷ = b0 Why? Because b0 = Ȳ − b1 X̄ , so Ŷ = Ȳ Prediction is then sample mean when X is unrelated to Y I Since OLS is then an extension of the sample mean, it has the same attractice properties (efficiency and lack of bias) I Alternatives exist but OLS has generally the best properties when assumptions are met OLS in matrix notation I Formula for coefficient β: Y 0 = Xβ + XY = X 0X β + X 0 X 0Y = X 0X β + 0 (X 0 X )−1 X 0 Y = β+0 β = (X 0 X )−1 X 0 Y I Formula for variance-covariance matrix: σ 2 (X 0 X )−1 I I In simple P case where y = β0 + β1 ∗ x, this gives σ 2 / (xi − x̄)2 for the variance of β1 Note how increasing the variation in X will reduce the variance of β1 The “hat” matrix I The hat matrix H is defined as: β̂ = (X 0 X )−1 X 0 y X β̂ = X (X 0 X )−1 X 0 y ŷ I I I = Hy H = X (X 0 X )−1 X 0 is called the hat-matrix P 2 Other important quantities, such as ŷ , ei (RSS) can be expressed as functions of H Corrections for heteroskedastic errors (“robust” standard errors) involve manipulating H Some important OLS properties to understand Applies to y = α + βx + I If β = 0 and the only regressor is the intercept, then this is the same as regressing y on a column of ones, and hence α = ȳ — the mean of the observations I If α = 0 so that therePis no intercept and one explanatory variable x, then β = P xy x2 I If there is an intercept and one explanatory variable, then P − x̄)(yi − ȳ ) i (x Pi β = (xi − x̄)2 P (xi − x̄)yi = Pi (xi − x̄)2 Some important OLS properties (cont.) I If the observations are expressed as deviations from P ∗ their P means, y ∗ = y − ȳ and x ∗ = x − x̄, then β = x y ∗ / x ∗2 I The intercept can be estimated as ȳ − βx̄. This implies that the intercept is estimated by the value that causes the sum of the OLS residuals to equal zero. I The mean of the ŷ values equals the mean y values – together with previous properties, implies that the OLS regression line passes through the overall mean of the data points Normally distributed errors OLS in R > dail <- read.dta("dail2002.dta") > mdl <- lm(votes1st ~ spend_total*incumb + minister, data=dail) > summary(mdl) Call: lm(formula = votes1st ~ spend_total * incumb + minister, data = dail) Residuals: Min 1Q -5555.8 -979.2 Median -262.4 3Q 877.2 Max 6816.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 469.37438 161.54635 2.906 0.00384 ** spend_total 0.20336 0.01148 17.713 < 2e-16 *** incumb 5150.75818 536.36856 9.603 < 2e-16 *** minister 1260.00137 474.96610 2.653 0.00826 ** spend_total:incumb -0.14904 0.02746 -5.428 9.28e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1796 on 457 degrees of freedom (2 observations deleted due to missingness) Multiple R-squared: 0.6672, Adjusted R-squared: 0.6643 F-statistic: 229 on 4 and 457 DF, p-value: < 2.2e-16 OLS in Stata . use dail2002 (Ireland 2002 Dail Election - Candidate Spending Data) . gen spendXinc = spend_total * incumb (2 missing values generated) . reg votes1st spend_total incumb minister spendXinc Source | SS df MS -------------+-----------------------------Model | 2.9549e+09 4 738728297 Residual | 1.4739e+09 457 3225201.58 -------------+-----------------------------Total | 4.4288e+09 461 9607007.17 Number of obs F( 4, 457) Prob > F R-squared Adj R-squared Root MSE = = = = = = 462 229.05 0.0000 0.6672 0.6643 1795.9 -----------------------------------------------------------------------------votes1st | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------spend_total | .2033637 .0114807 17.71 0.000 .1808021 .2259252 incumb | 5150.758 536.3686 9.60 0.000 4096.704 6204.813 minister | 1260.001 474.9661 2.65 0.008 326.613 2193.39 spendXinc | -.1490399 .0274584 -5.43 0.000 -.2030003 -.0950794 _cons | 469.3744 161.5464 2.91 0.004 151.9086 786.8402 ------------------------------------------------------------------------------ Sums of squares (ANOVA) TSS Total sum of squares P (yi − ȳ )2 P ESS Estimation or Regression sum of squares (ŷi − ȳ )2 P 2 P RSS Residual sum of squares ei = (ŷi − yi )2 The key to remember is that TSS = ESS + RSS Examining the sums of squares > yhat <- mdl$fitted.values # uses the lm object mdl from previous > ybar <- mean(mdl$model[,1]) > y <- mdl$model[,1] # can’t use dail$votes1st since diff N > TSS <- sum((y-ybar)^2) > ESS <- sum((yhat-ybar)^2) > RSS <- sum((yhat-y)^2) > RSS [1] 1473917120 > sum(mdl$residuals^2) [1] 1473917120 > (r2 <- ESS/TSS) [1] 0.6671995 > (adjr2 <- (1 - (1-r2)*(462-1)/(462-4-1))) [1] 0.6642865 > summary(mdl)$r.squared # note the call to summary() [1] 0.6671995 > RSS/457 [1] 3225202 > sqrt(RSS/457) [1] 1795.885 > summary(mdl)$sigma [1] 1795.885 Regression model return values Here we will talk about the quantities returned with the lm() command and lm class objects. Next (last) week I ANOVA - Analysis of variance I Regression with categorical independent variables I Regression with interaction terms (Brambor, Clark, and Golder) I Preview of Quant 2 with non-linear/normal models