Communication Studies 783 Notes Stuart Soroka, University of Michigan Working Draft, December 2015 These pages were written originally as my own lecture notes, in part for Poli 618 at McGill University, and now for Comm 783 at the University of Michigan. Versions are freely available online, at snsoroka.com. The notes draw on a good number of statistics texts, including Kennedy’s Econometrics, Greene’s Econometric Analysis, a number of volumes in Sage’s quantitative methods series, and several online textbooks. That said, please do keep in mind that they are just lecture notes — there are errors and omissions, and there is for no single topic enough information included in this file to actually learn statistics from the notes alone. (There are of course many textbooks that are better equipped for that purpose.) The notes may nonetheless a useful background guide to some of the themes in Comm 783 and perhaps, more generally, to some of the basic statistics most common in communication studies. If you find errors (and you will), please do let me know. Thanks, Stuart Soroka ssoroka@umich.edu Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 ! Table of Contents Variance, Covariance and Correlation ...................................................................3 Introducing Bivariate Ordinary Least Squares Regression .....................................5 Multivariate Ordinary Least Squares Regression .................................................11 Error, and Model Fit............................................................................................. 12 Assumptions of OLS regression........................................................................... 17 Nonlinearities ......................................................................................................18 Collinearity and Multicollinearity .........................................................................20 Heteroskedasticity ...............................................................................................22 Outliers ................................................................................................................24 Models for dichotomous data .............................................................................25 Linear Probability Models Nonlinear Probability Model: Logistic Regression An Alternative Description: The Latent Variable Model Nonlinear Probability Model: Probit Regression Maximum Likelihood Estimation .........................................................................31 Models for Categorical Data ...............................................................................32 Ordinal Outcomes Nominal Outcomes Appendix A: Significance Tests ........................................................................... 36 Distribution Functions The chi-square test The t test The F Test Appendix B: Factor Analysis ................................................................................43 Background: Correlations and Factor Analysis An Algebraic Description Factor Analysis Results Rotated Factor Analyses Appendix B: Taking Time Seriously .....................................................................51 Univariate Statistics Bivariate Statistics Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 ! Variance, Covariance and Correlation Let’s begin with Yi, a continuous variable measuring some value for each individual (i) in a representative sample of the population. Yi can be income, or age, or a thermometer score expressing degrees of approval for a presidential candidate. Variance in our variable Yi is calculated as follows: (1) (2) SY2 = SY2 = (Yi N N( Ȳ )2 1 , or Yi2 ) N (N ( Yi )2 1) , where both versions are equivalent, and the latter is referred to as the computational formula (because it is, in principle, easier to calculate by hand). Note that the equation is pretty simple: we are interested in variance in Yi, and Equation 1 is basically taking the average of each individual Yi’s variance around the mean ( Y ). There are a few tricky parts. First, the differences between each individual Yi and Y (that is, Yi − Y ) are squared in Equation 1, so that negative values do not € € cancel out positive values (since squaring will lead to only positive values). Second, we use N-1 as the denominator rather than N (where N is the number of € This produces a more conservative (slightly inflated) result, in light of the cases). fact that we’re working with a sample rather than the population – that is, the values of Yi in our (hopefully) representative sample, and the values of Yi that we believe may exist in the total real-world population. For a small-N samples, where we might suspect that we under-estimate the variance in the population, using N-1 effectively adjusts the estimated variance upwards. With a large-N sample, the difference between N-1 and N is increasingly marginal. That the adjustment matters more for small sample than for big samples reflects our increasing confidence in the representative-ness of our sample as it increases. (Note that some texts distinguish between SY2 and σ Y2 , where the Roman S is the sample variance and the Greek σ is the population variance. Indeed, some texts will distinguish between sample values and population values using Roman and € an estimated slope coefficient, for Greek versions across the board€ – B for € actual slope in the population. I am not this systematic instance, and β for an below.) The standard deviation is a simple function of variance: € Dec 2015 (3) Comm783 Notes, Stuart Soroka, University of Michigan SY = ⇥ SY2 = ⇤ (Yi N pg 4 ! Ȳ )2 , 1 So standard deviations are also indications of the extent to which a given variable varies around its mean. SY is important for understanding distributions and significance tests, as we shall see below. So far, we’ve looked only at univariate statistics – statistics describing a single variable. Most of the time, though, what we want to do is describe relationships between two (or more) variables. Covariance – a measure of common variance between two variables, or how much two variables change together – is calculated as follows: (4) SXY = (5) SXY = (Xi N Ȳ ) X̄)(Yi N 1 Xi Yi N (N Xi 1) , or Yi , the latter of which is the computational formula. Again, we use N-1 as the denominator, for the same reasons as above. Pearson’s correlation coefficient is also based on a ratio of covariances and standard deviations, as follows: (6) (7) r= SXY SX SY , or r= ⇥ (Xi (Xi X̄)(Yi X̄)2 (Yi Ȳ ) Ȳ )2 . where SXY is the sample covariance between Xi and Yi, and SX and SY are the sample standard deviations of Xi and Yi respectively. (Note the relationship between this Equation 7, and the preceding equations for standard deviations and covariances, Equation 3 and Equation 4.) Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 5 ! Introducing Bivariate Ordinary Least Squares Regression Take a simple data series, X 2 4 6 8 Y 2 3 4 5 and plot it… ! What we want to do is describe the relationship between X and Y. Essentially, we want to draw a line between the dots, and describe that line. Given that the data here are relatively simple, we can just do this by hand, and describe it using two basic properties, α and β : € € ! Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan where α , the constant, is in this case equal to 1, and pg 6 ! , the slope, is 1 (the increase in Y) divided by 2 (the increase in X) = .5. So we can produce an equation for this line allowing us to predict values of Y based on values of X. The € general model is, (8) Yi = + ⇥Xi And the particular model in this case is Y = 1 + .5X. Note that the constant is simply a function of the means of both X and Y, along with the slope. That is: (9) = Ȳ ⇥ X̄ X 2 4 6 8 5 mean So, following Equation 9, = Ȳ Y 2 3 4 5 3.5 ⇥ X̄ = 3.5 – (.5)*5 = 3.5 – 2.5 = 1 . This is pretty simple. The difficulty is that data aren’t like this –they don’t fall along a perfect line. They’re likely more like this: X 2 4 6 8 ! Y 3 2 5 5 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 7 ! Now, note that we can draw any number of lines that will satisfy Equation 8. All that matters is that the line goes through the means of X and Y. So the means are: mean X 2 4 6 8 5 Y 3 2 5 5 3.75 And let’s make up an equation where Y=3.75 when X=5… Y = α + βX 3.75 = α + ( β )*5 3.75 = 4 + ( β )*5 € € 3.75 = 4 + (-.05)*5 € € 3.75 = 4 + (-.25) € So here it is: Y = 4 + (-.05)X . Plotted, it looks like this: ! Note that this new model has to be expressed in a slightly different manner, including an error term: (10) Yi = + ⇥Xi + ⇤i , or, alternatively: (11) Yi = Ŷi + i , Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan where pg 8 ! are the estimated values of the actual Yi , and where the error can be expressed in the following ways: (12) i Ŷ . or ⇤i = Yi = Yi ( +€⇥Xi ) . So we’ve now accounted for the fact that we work with messy data, and that there will consequently be a certain degree of error in the model. This is inevitable, of course, since we’re trying to draw a straight line through points that are unlikely to be perfectly distributed along a straight line. Of course, the line above won’t do – it quite clearly does not describe the relationship between X and Y. What we need is a method of deriving a model that better describes the effect that X has on Y – essentially, a method that draws a line that comes as close to all the dots as possible. Or, more precisely, a model that minimizes the total amount of error( εi). € ! We first need a measure of the total amount of error – the degree to which our predictions ‘miss’ the actual values of Yi . We can’t simply take the sum of all errors, ! i, because positive and negative errors can cancel each other out. We could take the sum of the absolute values, ! | i |, which in fact is used in € some estimations. 2 The norm is to use the sum of squared errors, the SSE or ! i . This sum is most greatly affected by large errors – by squaring residuals, large residuals take on 2 very large magnitudes. An estimation of Equation 10 that tries to minimize ! i accordingly tries especially hard to avoid large errors. (By implication, outlying cases will have a particularly strong effect on the overall estimation. We return to this in the section on outliers below.) Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 9 ! This is what we are trying to do in ordinary least squares (OLS) regression: minimize the SSE, and have an estimate of β (on which our estimate of α relies) that comes as close to all the dots as is possible. Least-squares coefficients for simple bivariate regression € are estimated as € follows: (13) = (14) = (Xi N X̄)(Yi Ȳ ) , or (Xi X̄)2 Yi Xi N Xi2 Yi Xi . ( Xi )2 The latter is referred to as the computational formula, as it’s supposed to be easier to compute by hand. (I actually prefer the former, which I find easier to compute, and has the added advantage of nicely illustrating the important features of OLS regression.) We can use Equation 13 to calculate the Least Squares estimate for the above data: The data… € Calculated values (used in Equation 13)… Xi Yi Xi − X Yi − Y (X i − X )(Yi − Y ) (X i − X ) 2 2 4 6 €8 3 2 5 5 -3 -1 1 €3 -0.75 -1.75 1.25 €1.25 2.25 1.75 1.25 3.75€ ∑= 9 9 1 1 9 ∑ =20 X i =5 € Yi =3.75 So solving Equation 13 with the values above looks like this: € € ! = (Xi € € X̄)(Yi Ȳ ) 9 = = .45 2 20 (Xi X̄) And we can use these results in Equation 9 to find the constant: ! = Ȳ ⇥ X̄ = 3.75 (.45) ⇥ 5 = 3.75 So the final model looks like this: Yi = 1.5 + (.45) ⇥ Xi 2.25 = 1.5 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !0 Using this model, we can easily see what the individual predicted values ( Yˆi ) are, as well as the associated errors ( εi ): Xi Yi Yˆi 2 4 6 € 8 3 2 5 € 5 € 2.4 X i =5 Yi =3.75 € € εi = Yˆi − Yi 3.3 € 4.2 5.1 € 0.6 -1.3 0.8 -0.1 One € further note about Equation 13, and our means of estimating OLS slope coefficients: Recall the equations for variance (Equation 1) and covariance (Equation 4). If we take the ratio of covariance and variance, as follows, SXY (15) = Sx2 P (Xi X̄)(Yi Ȳ ) N 1 P (Xi X̂)2 N 1 , we can adjust somewhat to produce the following, (16) SXY = Sx2 P (Xi X̄)(Yi Ȳ ) P (Xi X̂)2 , where Equation 16 simply drops the N-1 denominators, which cancel each other out. More importantly, Equation 16 looks suspiciously – indeed, exactly – like the formula for β (Equation 13). β is thus essentially a ratio between the covariance between X and Y, and the variance of X, as follows: € € Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan (17) = YX pg 1 !1 SY X 2 SX This should make sense when we consider the standard interpretation of β : for a one-unit shift in X, how much does Y change? € Multivariate Ordinary Least Squares Regression Things are more complicated for multiple, or multivariate, regression, where there is more than one independent variable. The standard OLS multivariate model is nevertheless a relatively simple extension of bivariate regression – imagine, for instance, plotting a line through dots plotted along two X axes, in what amounts to three-dimensional space: ! This is all we’re doing in multivariate regression – drawing a line through these dots, where values of Y are driven by a combination of X1 and X2, and where the model itself would be as follows: (18) Yi = + ⇥1 X1 i + ⇥2 X2 i + ⇤i . That said, when we have more than two regressors, we start plotting lines through four- and five-dimensional space, and that gets hard to draw. Least squares coefficients for multiple regression with two regressors, as in Equation 18, are calculated as follows: Dec 2015 (19) β1 = Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !2 (∑ (X1i − X1 )(Yi − Y )∑ (X 2i − X 2 )) − (∑ (X 2i − X 2 )(Y i−Y )∑ (X1i − X1 )(X 2i − X 2 )) (∑ (X1i − X1 ) 2 ∑ (X 2i − X 2 ) 2 ) − (∑ (X1i − X1 )(X 2i − X 2 )) 2 , and € (20) β 2 = (∑ (X 2i − X 2 )(Yi − Y )∑ (X1i − X1 )) − (∑ (X1i − X1 )(Y i−Y )∑ (X1i − X1 )(X 2i − X 2 )) (∑ (X1i − X1 ) 2 ∑ (X 2i − X 2 ) 2 ) − (∑ (X1i − X1 )(X 2i − X 2 )) 2 , and the constant is now estimated as follows: € = Ȳ (21) ⇥1 X̄1 ⇥2 X̄2 . Error, and Model Fit The standard deviation of the residuals, or the standard error of the slope, is as follows, (22) SE = ⇥ 2 i N 2, Or, more generally, (23) SE = ⇥ 2 i N K 2, Equation 22 is the same as Equation 23, except that the former is a simple version that applies to bivariate regression only, and the latter is a more general version that applies to multivariate regression with any number of independent variables. N in these equations refers to the total number of cases, while K is the total number of independent variables in the model. The SE β is a useful measure of the fit of a regression slope – it gives you the average error of the prediction. It’s also used to test the significance of the slope coefficient. For instance, if we are going to be 95% confident that our estimate is € significantly different from zero, zero should not fall within the interval β ± 2(SE β ) . Alternatively, if we are using t-statistics to examine coefficients’ significance, then the ratio of β to SE β should be roughly 2. € Assuming you remember the basic sampling and distributional material in your basic statistics course, this reasoning should sound familiar. Here’s a quick € € refresher: Testing model fit is based on some standard beliefs about Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !3 distributions. Normal distributions are unimodel, symmetric, and are described by the following probability distribution: (24) p(Y ) = e (Y 2 µY )2 /2⇥Y 2 ⇥Y2 where p(Y) refers to the probability of a given value of Y, and where the shape of the curve is determined by only two values: the population mean, variance, , and its . (Also see our discussion of distribution functions, below.) Assuming two distributions with the same mean (of zero, for instance), the effect of changing variances is something like this: ! We know that many natural phenomena follow a normal distribution. So we assume that many political phenomena do as well. Indeed, where the current case is concerned, we believe that our estimated slope coefficient, β , is one of a distribution of possible β s we might find in repeated samples. These β s are normally distributed, with a standard deviation that we try to estimate from our € data. € € We also know that in any normal distribution, roughly 68% of all cases fall within plus or minus one standard deviation from the mean, and 95% of all cases fall within plus or minus two standard deviations from the mean. It follows that our slope should not be within two standard errors of zero. If it is, we cannot be 95% confidence that our coefficient is significantly different from zero – that is, we cannot reject the null hypothesis that there is no significant effect. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !4 Going through this process step-by-step is useful. Let’s begin with our estimated bivariate model from page 8, where the model is Yi = 1.5 + (.45)*Xi, and the data are, Xi Yi 2 4 6€ 8 3 2 5€ 5 € X i =5 Yˆi εi = Yˆi − Yi εi2 2.4 3.3 €4.2 5.1 0.6 -1.3 0.8 -0.1 0.36 1.69 0.64 0.01 € 2.7 Yi =3.75 Based on Equation 22, we calculate the standard error of the slope as follows: € SE = ! ⇤€ 2 i 2 N = ⇥ 2.7 = 4 2 ⇥ 2.7 ⇥ = 1.35 = 1.16 2 So, we can be 95% confident that the slope estimate in the population is .45 ± (2 ×1.16) , or .45 ± 2.32 . Zero is certainly within this interval, so our results are not statistically significant. This is mainly due to our very small sample size. Imagine the same slope and SE β , but based on a sample of 200 cases: € € SE = ! ⇤ €N 2 i 2 = ⇥ 2.7 = 200 2 ⇥ ⇥ 2.7 = .014 = .118 198 Now we can be 95% confident that the slope estimate in the population is .45 ± (2 × .118) , or .45 ± .236. Zero is not within this interval, so our results in this case would be statistically significant. Just to recap, our decision about the statistical significance of the slope is based € on a combination of the magnitude of the slope ( β ), the total amount of error in € the estimate (using the SE β ), and the sample size (N, used in our calculation of the SE β ). Any one of these things can contribute to significant findings: a € greater slope, less error, and/or a larger sample size. (Here, we saw the effect € that sample size can have.) € Another means of examining the overall model fit – that is, including all independent variables in a multivariate context – is by looking at proportion of the total variation in Yi explained by the model. First, total variation can be decomposed into ‘explained’ and ‘unexplained’ components as follows: Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !5 TSS is the Total Sum of Squares RSS is the Regression Sum of Squares (note that some texts call this RegSS) ESS is the Error Sum of Squares (some texts call this the residual sum of squares, RSS) So, TSS = RSS + ESS, where (25) T SS = (Yi Ȳ )2 , (26) RSS = (Ŷi Ȳ )2 , and = (Yi Ŷ )2 (27) ESS We’re basically dividing up the total variance in Yi around its mean (TSS) into two parts: the variance accounted for in the regression model (RSS), and the variance not accounted for by the regression model (ESS). Indeed, we can illustrate on a case-by-case basis the variance from the mean that is accounted for by the model, and the remaining, unaccounted for, variance: ! All the explained variance (squared) is summed to form RSS; all the unexplained variance (squared) is summed to form ESS. Using these terms, the coefficient of determination, more commonly, the R2, is calculated as follows: (28) R2 = RSS , or T SS R2 = 1 ESS , or T SS R2 = T SS ESS . T SS Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !6 Or, alternatively, following from Equation 25-Equation 27: (29) R2 = (Yˆ1 (Yi RSS = T SS Ȳ )2 = Ȳ )2 (Yi Ȳ )2 (Yi (Yi Ȳ )2 Ŷi )2 And we can estimate all of this as follows: Xi Yi 2 4 6€ 8 3 2 5€ 5 X i =5 Yi =3.75 € € Yˆi (Yi − Y ) 2 (Yˆi − Y ) 2 (Yi − Yˆi ) 2 2.4 3.3 €4.2 5.1 0.56 3.06 1.56 € 1.56 1.82 0.2 0.2 € 1.82 0.36 1.69 0.64 0.01 TSS=6.74 RSS=4.04 ESS=2.7 R2 = € The coefficient of determination is thus ! RSS 4.04 = = .599 T SS 6.74 . The coefficient of determination is calculated the same way for multivariate regression. The R2 has one problem, though – it can only ever increase or stay equal as variables are added to the equation. More to the point, including extra variables can never lower the R2, and the measure accordingly does not reward for model parsimony. If you want a measure that does so, you need to use a ‘correction’ for degrees of freedom (sometimes called an adjusted R-squared): (30) R˜2 = 1 RSS N K 1 T SS N 1 Note that this should only make a difference when the sample size is relatively small, or the number of independent variables is relatively large. But you can see in Equation 30 that if the sample size is small, increasing the number of variables will reduce the numerator, and thus reduce the adjusted R2. One further note about the coefficient of determination: note that the R2 is equivalent to the square of Pearson’s r (Equation 6). That is, (31) r= SXY = SX SY 2 RXY , There is, then, a clear relationship between the correlation coefficient and the coefficient of determination. There is also a relationship between a bivariate Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !7 correlation coefficient and the regression coefficient. Let’s begin with an equation for the regression coefficient, as in Equation 17 above: (32) XY = SXY 2 , SX and rearrange these terms to isolate the covariance: (33) SXY = XY 2 , SX Now, let’s substitute this for in the equation for correlation (Equation 6): (34) rXY 2 SXY XY SX = = . SX SY SX SY So the correlation coefficient and bivariate regression coefficient are products of each other. More clearly: (35) (36) rXY = XY XY = rXY SX SY , and SY . SX The relationship between the two in multivariate regression is of course much more complicated. But the point is that all these measures - measures capturing various aspects of the relationship between two (or more) variables - are related to each other, each a function of a given set of variances and covariances. Assumptions of OLS regression The preceding OLS linear regression models are unbiased and efficient (that is, they provide the Best Linear Unbiased Estimator, or BLUE) provided five assumptions are not violated. If any of these assumptions are violated, the regular linear OLS model ceases to be unbiased and/or efficient. The assumptions themselves, as well as problems resulting from violating each one, are listed below (drawn from Kennedy, Econometrics). Of course, many data or models violate one or more of these assumptions, so much of what we have to cover now is how to deal with these problems. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !8 1. Y can be calculated as a linear function of X, plus a disturbance term. Problems: wrong regressors, nonlinearity, changing parameters 2. Expected value of e is zero; the mean of e is zero. Problems: biased intercept 3. Disturbance terms have the same variance and are not correlated with one another Problems: heteroskedasticity, autocorrelated errors 4. Observations of Y are fixed in repeated samples; it is possible to repeat the sample with the same independent values Problems: errors in variables, autoregression, simultaneity 5. Number of observations is greater than the number of independent variables, and there are no exact linear relationships between the independent variables. Problems: multicollinearity Nonlinearities So far, we’ve assumed that the relationship between Yi and Xi is linear. In many cases, this will not be true. We could imagine any number of non-linear relationships. Here are two just common possibilities: ! ! We can of course estimate a linear relationship in both cases – it doesn’t capture the actual relationship very well, though. In order to better capture the relationship between Y and X, we may want to adjust our variables to represent this non-linearity. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 1 !9 Let’s begin with the basic multivariate model, (37) Yi = + ⇥1 X1i + ⇥2 X2i + ⇤i . Where a single X is believed to have a nonlinear relationship with Y, the simplest approach is to manipulate the X – to use X2 in place of X, for instance: (38) Yi = 2 + ⇥1 X1i + ⇥2 X2i + ⇤i , This may capture the exponential increase depicted in the first figure above. To capture the ceiling effect in the second figure, we could use both the linear (X) and quadratic (X2), with the expectation that the coefficient for the former ( β1 ) would be positive and large, and the coefficient for the latter ( β 2 ) would be negative and small: (39) Yi = + ⇥1 X1i + 2 ⇥2 X1i + ⇥3 X2i + ⇤i , € € This coefficient on the quadratic will gradually, and increasingly, reduce the positive effect of X1. Indeed, if the effect of the quadratic is great enough, it can in combination with the linear version of X1 produce a line that increases, peaks, and then begins to decrease. Of course, these are just two of the simplest (and most common) nonlinearities. You can imagine any number of different non-linear relationships; most can be captured by some kind of mathematical adjustment to regressors. Sometimes we believe there is a nonlinear relationship between all the Xs and Y – that is, all Xs combined have a nonlinear effect on Y, for instance: (40) Yi = ( + ⇥1 X1i + ⇥3 X2i )2 + ⇤i . The easiest way to estimate this is not Equation 40, though, but rather an adjustment as follows: (41) Yi = + ⇥1 X1i + ⇥3 X2i + ⇤i . Here, we simply transform the dependent variable. I’ve replaced the squared version of the right hand side (RHS) variables with the square root of the left hand side (LHS) because it’s a simple example of a nonlinear transformation. It’s not the most common, however. The most common is taking the log of Y, as follows: (42) ln(Yi ) = + ⇥1 X1i + ⇥3 X2i + ⇤i . Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !0 Doing so serves two purposes. First, we might believe that the shape of the effect of our RHS variables on Yi is actually nonlinear – and specifically, logistic in shape (a S-curve). This transformation may quite nicely capture this nonlinearity. Second, taking the log of Yi can solve a distributional problem with that variable. OLS estimations will work more efficiently with variables that are normally distributed. If Yi has a great many small values, and a long right-hand tail (as many of our variables will; for instance, income), then taking the log of Yi often does a nice job of generating a more normal distribution. This example highlights a second reason for transforming a variable, on the LHS or RHS. Sometimes, a transformation is based on a particular shape of an effect, based on theory. Other times, a transformation is used to ‘fix’ a non-normally distributed variable. The first transformation is based on theoretical expectations; the second is based on a statistical problems. (In practice, separating the two is not always easy.) Collinearity and Multicollinearity When there is a linear relationship among the regressors, the OLS coefficients are not uniquely identified. This is not a problem if your goal is only to predict Y – multicollinearity will not affect the overall prediction of the regression model. If your goal is to understand how the individual RHS variables impact Y, however, multicollinearity is a big problem. One problem is that the individual p-values can be misleading – confidence intervals on the regression coefficients will be very wide. Essentially, what we are concerned about is the correlation amongst regressors, for instance, X1 and X2: (43) r12 = ⇥ (X1 (Xi X̄2 )(X2 X̄1 )2 (X2 X̄2 ) X̄2 , )2 This is of course just a simple adjustment to the Pearson’s r equation (Equation 7). Equation 43 deals just with the relationship between two variables, however, and we are often worried about a more complicated situation – one in which a given regressor is correlated with a combination of several, or even all, the other regressors in a model. (Note that this multicollinearity can exist even if there are no striking bivariate relationships between regressors.) Multicollinearity is perhaps most easily depicted as a regression model in which one X is regressed on all others. That is, for the regression model, Dec 2015 (44) Comm783 Notes, Stuart Soroka, University of Michigan Yi = pg 2 !1 + ⇥1 X1i + ⇥2 X2i + ⇥3 X3i + ⇥4 X4i + ⇤i we might be concerned that the following regression produces strong results: (45) X1i = + ⇥2 X2i + ⇥3 X3i + ⇥4 X4i + ⇤i If X1 is well predicted by X2 through X4, it will be very difficult to identify the slope (and error) for X1 from the set of other slopes (and errors). (The slopes and errors for the other slopes may be affected as well.) Variance inflation factors are one measure that can be used to detect multicollinearity. Essentially, VIFs are a scaled version of the multiple correlation coefficient between variable j and the rest of the independent variables. Specifically, (46) V IFj = 1 1 Rj2 where R2j would be based on results from a model as in Equation 45. If R2j equals zero (i.e., no correlation between Xj and the remaining independent variables), then VIFj equals 1. This is the minimum value. As R2j increases, however, the denominator of Equation 46 decreases, and the estimated VIF rises as a consequence. A value greater than 10 represents a pretty big multicollinearity problem. VIFs tell us how much the variance of the estimated regression coefficient is 'inflated' by the existence of correlation among the predictor variables in the model. The square root of the VIF actually tells us how much the standard error is inflated. This table, drawn from the Sage volume by Fox, shows the relationship between a given R2j, the VIF, and the estimated amount by which the standard error of Xj is inflated by multicollinearity. Coefficient Variance Inflation as a Function of Inter-Regressor Multiple Correlation VIF R 2j € 0 0.2 0.4 0.6 0.8 0.9 0.99 (impact on 1 1.04 1.19 1.56 2.78 5.26 50.3 € € 1 1.02 1.09 1.25 1.67 2.29 7.09 SE β j ) Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !2 Ways of dealing with multicollinearity include (a) dropping variables, (b) combining multiple collinear variables into a single measure, and/or (c) if collinearity is only moderate, and all variables are of substantive importance to the model, simply interpreting coefficients and standard errors taking into account the effects of multicollinearity. Heteroskedasticity Heteroskedasticity refers to unequal variance in the regression errors. Note that there can be heteroskedasticity relating to the effect of individual independent variables, and also heteroskedasticity related to the combined effect of all independent variables. (In addition, there can be heteroskedasticity in terms of unequal variance over time.) The following figure portrays the standard case of heteroskedasticity, where the variance in Y (and thus the regression error as well) is systematically related to values of X. ! The difficulty here is that the error of the slope will be poorly estimated – it will over-estimate the error at small values of X, and under-estimate the error at large values of X. Diagnosing heteroskedasticity is often easiest by looking at a plot of errors ( εi ) by values of the dependent variable ( Yi ). Basically, we begin with the standard bivariate model of Yi , (47) Yi = α + βX i + εi , € € and then€plot the resulting values of εi by Yi . If we did so for the data in the € preceding figure, then the resulting residuals plot would look as follows: € € Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !3 ! As Yi increases here, so too does the variance in εi . There are of course other possible (heteroskedastic) relationships between Yi and εi, for instance, € € € € ! where variance in much greater in the middle. Any version of heteroskedasticity presents problems for OLS models. When the sample size is relatively small, these diagnostic graphs are probably the best means of identifying heteroskedasticity. When the sample size is large, there are too many dots on the graph to distinguish what’s going on. There are several tests for heteroskedasticity, however. The Breusch-Pagan test tests for a relationship between the error and the independent variables. It starts with a standard multivariate regression model, (48) Yi = + ⇥1 X1i + ⇥2 X2i + ... + ⇥k Xki + ⇤i , and then substitutes the estimated errors, squared, for the dependent variable, (49) ⇤2i = + ⇥1 x1i + ⇥2 x2i + ... + +⇥k xki + ⌅i . We then use a standard F-test to test the joint significance of coefficients in Equation 49. If they are significant, there is some kind of systematic relationship between the independent variables and the error. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !4 Outliers Recall that OLS regression pays particularly close attention to avoiding large errors. It follows that outliers – cases that are unusual – can have a particularly large effect on an estimated regression slope. Consider the following two possibilities, where a single outlier has a huge effect on the estimated slope: ! Hat values (hi) are the common measure of leverage in a regression. It is possible to express the fitted values of (50) in terms of the observed values Yˆj = h1j Y1 + h2 Y2 + ... + hnj Yn = : n Hij Yi . i=1 The coefficient, or weight, hij captures the contribution of each observation the fitted value to . Outlying cases can usually not be discovered by looking at residuals – OLS estimation tries, after all, to minimize the error for high-leverage cases. In fact, the variance in residuals is in part a function of leverage, (51) V (Ei ) = 2 (1 hi ) . The greater the hat value in Equation 51, the lower the variance. How can we identify high-leverage cases? Sometimes, simply plotting data can be very helpful. Also, we can look closely at residuals. Start with the model for standardized residuals, as follows, (52) Ei = E ⇥ i , SE 1 hi Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !5 which simply expresses each residual as a number (or increment) of standard deviations in Ei. The problem with Equation 52 is that case i is included in the estimation of the variance; what we really want is a sense for how i looks in relation to the variance in all other cases. This is a studentized residual, (53) Ei⇥ = SE( Ei ⇥ 1) 1 hi . and it provides a good indication of just how far ‘out’ a given case is in relation to all other cases. (To test significance, the statistic follows a t-distribution with N-K-2 degrees of freedom.) Note that you can estimate studentized residuals in a quite different way (though with the same results). Start by defining a variable D, equal to 1 for case i and equal to 0 for all other cases. Now, for a multivariate regression model as follows: (54) Yi = + ⇥1 X1 + ⇥2 X2 + ... + ⇥k Xk + ⇤i . add variable D and estimate, (55) Yi = + ⇥1 X1 + ⇥2 X2 + ... + ⇥k Xk + ⇤Di + ⌅i . This is referred to as a mean-shift outlier model, and the t-statistic for γ provides a test equivalent to the studentized residual. What do we do if we have outliers? That depends. If there are reasons to believe € the case is abnormal, then sometimes it’s best just to drop it from the dataset. If you believe the case is ‘correct’, or justifiable, however, in spite of the fact that it’s an outlier, then you may choose to keep it in the model. At a minimum, you will want to test your model with and without this outlier, to explore the extent to which you results are driven by a single case (or, in case of several outliers, a small number of cases). Models for dichotomous data Linear Probability Models Let’s begin with a simple definition of our binary dependent variable. We have variable, Yi, which only takes on the values 0 or 1. We want to predict when Yi is equal to 0, or 1; put differently, we want to know for each individual case i the probability that Yi is equal to 1, given Xi. More formally, Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan (56) E(Yi ) = P r(Yi = 1|Xi ) , pg 2 !6 which states that the expected value of Yi is equal to the probability that Yi is equal to one, given Xi. Now, a linear probability model simply estimates Pr(Yi = 1) in same way as we would estimate an interval-level Yi: (57) P r(Yi = 1) = + ⇥Xi . € There are two difficulties with this kind of model. First, while the estimated slope coefficients are good, the standard errors are incorrect due to heteroskedasticity (errors increase in the middle range, first negative, then positive). Graphing the data with a regular linear regression line, for instance, would look something like this: ! The second problem with the linear probability model is that it will generate predictions that are greater than 1 and/or less than 0 (as shown in the preceding figure) even though these are nonsensical where probabilities are concerned. As a consequence, it is desirable to try and transform either the LHS or RHS of the model so predictions are both realistic and efficient. Nonlinear Probability Model: Logistic Regression One option is to transform Yi, to develop a nonlinear probability model. To extend the range beyond 0 to 1, we first transform the probability into the odds… (58) P r(Yi = 1|Xi ) P r(Yi = 1|Xi ) = , P r(Yi = 0|Xi ) 1 P r(Yi = 1|Xi ) Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !7 which indicate how often something happens relative to how often it does not, and range from 0 to infinity as Xi approaches 1. We then take the log of this to get, (59) ln( P r(Yi = 1|Xi ) ), 1 P r(Yi = 1|Xi ) or more simply, (60) ln( pi 1 pi ), where, (61) pi = Yi = 1|Xi . Modeling what we’ve seen in equation 60 then captures the log odds that something will happen. By taking the log, we’ve effectively stretched out the ends of the 0 to 1 range, and consequently have a comparatively unconstrained dependent variable that can be used without difficulty in an OLS regression, where (62) ln( pi 1 pi ) = Xi . Just to make clear the effects of our transformation, here’s what taking the log odds of a simple probability looks like: Probability Odds 0.01 0.05 0.1 0.3 1/99=.0101 0.5 0.7 0.9 0.95 5/5=1 0.99 5/95=.0526 1/9=.1111 3/7=.4286 Logit -4.6 -2.94 -2.2 -0.85 95/5=19 0 0.85 2.2 2.94 99/1=99 4.6 7/3=2.3333 9/1=9 Note that there is another way of representing a logit model, essentially the inverse (un-logging of both sides) of Equation 72: Dec 2015 (63) Comm783 Notes, Stuart Soroka, University of Michigan P r(Yi = 1|Xi ) = pg 2 !8 exp X1 . 1 + exp Xi Just to be clear, we can work our way backwards from equation Equation 73 to Equation 72 as follows: (64) exp X1 P r(Yi = 1|Xi ) = , and 1 + exp Xi P r(Yi = 0|Xi ) = 1 1 + exp exp Xi 1 exp Xi . X or So, (65) p1 p1 = = p0 1 p1 exp Xi 1+exp Xi 1 1+exp Xi = exp xi , 1 and, (66) pi 1 pi = exp Xi , which when logging both sides becomes, (67) ln( pi 1 pi ) = Xi . The notation in Equation 72 is perhaps the most useful in connecting logistic with probit and other non-linear estimations for binary data. The logit transformation is just one possible transformation that effectively maps the linear prediction into the 0 to 1 interval – allowing us to retain the fundamentally linear structure of the model while at the same time avoiding the contradiction of probabilities below 0 or above 1. Many cumulative density functions (CDFs) will meet this requirement. (Note that CDFs define the probability mass to the left of a given value of X; they are of course related – in that they are slight adjustment of – PDFs, which are dealt with in more detail in the section on significance tests.) Equation 73 is in contrast useful for thinking about the logit model as just one example of transformations in which Pr(Yi=1) is a function of a non-linear transformation of the RHS variables, based on any number of CDFs. A more general version of Equation 73 is, then, Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 2 !9 (68) P r(Yi = 1|Xi ) = F ( Xi ). where F is the logistic CDF for the logit model, as follows, (69) P r(Yi = 1|Xi ) = F ( Xi ), where F = 1 1 + exp (x µ)/s , but could just as easily be the normal CDF for the probit model, or a variety of other CDFs. How do we know which CDF to use? The CDF we choose should reflect our beliefs about the distribution of Yi, or, alternatively (and equivalently) the distribution of error in Yi. We discuss this more below. An Alternative Description: The Latent Variable Model Another way to draw the link between logistic and regular regression is through the latent variable model, which posits that there is an unobserved, latent variable Yi*, where (70) Yi = Xi + ⇥i , €and the link between the observed binary Yi and the latent Yi* is as follows: (71) Yi = 1 if Yi > 0 , and (72) Yi = 0 if Yi 0 . € € Using this example, the relationship between the observed binary Yi and the latent Yi can be graphed as follows: ! Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !0 So, at any given value of Xi there is a given probability that Yi is greater than zero. This figure also shows how our beliefs about the distribution of error ( εi ) are fundamental – there is a distribution of possible outcomes in Yi* when, in this figure, Xi=4. For a probit model, we assume that Var(εi ) = 1 ; for a logit model, € we assume that Var(εi ) = π 2 /3 . Other CDFs make other assumptions. € The distribution of error ( εi ) at any given € value of Xi is related to a non-linear increase € in the probability that Yi=1. Indeed, we can show this non-linear shift first by plotting a distribution of εi at each value of Xi, € € ! and then by looking at how the movement of this distribution across the zero line shifts the probability that Yi=1: ! As the thick part of the distribution moves across the zero line, the probability increases dramatically. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !1 Nonlinear Probability Model: Probit Regression As noted above, probit models are based on the same logic as logistic models. Again, they can be thought of as a non-linear transformation of the LHS or RHS variables. The only difference for probit models is that rather than assume a logistic distribution, we assume a normal one. In equation 68, then, F would now be the cumulative density function for a normal distribution. Why assume a normal distribution? The critical question is why assume a logistic one? We typically assume a logistic distribution because it is very close to normal, and estimating a logistic model is computationally much easier than estimating probit model. We now have faster computers, so there is now less reason to rely on logit rather than probit models. That said, logit has some advantages where teaching is concerned. Compared to probit, it’s very simple. Maximum Likelihood Estimation Models for categorical variables are not estimated using OLS, but using maximum likelihood. ML estimates are the values of the parameters that have the greatest likelihood (that is, the maximum likelihood) of generating the observed sample of data if the assumptions of the model are true. For a simple model like Yi = α + βX i , an ML estimation looks at many different possible values of and , and finds the combination which is most likely to generating the observed values of Yi. € ! Take, for instance, the above graph, which shows the observed values of Yi on the bottom axis. There are two different probability distributions, one produced by one set of parameters, A, and one produced by another set of parameters, B. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !2 MLE asks which distribution seems more likely to have produced the observed data. Here, it looks like the B parameters have an estimated distribution more likely to produce the observed data. Alternatively, consider the following. If we are interested in the probability that Yi=1, given a certain set of parameters (p), then an ML estimation is interested in the likelihood of p given the observed data (73) L(p|Yi ) . This is a likelihood function. Finding the best set of parameters is an iterative process, which starts somewhere and starts searching; different ‘optimization algorithms’ may start in slightly different places, and conduct the search differently; all base their decision about searching for parameters on the rate of improvement in the model. (The way in which model fit is judged is addressed below.) Note that our being vague about ‘parameters’ here is purposeful. As analysts, the parameters we are thinking about are the coefficients for the various independent variables ( βX ). The parameters critical to the ML estimation, however, are those that define the shape of the distribution; for a normal distribution, for instance, these are the mean ( µ ) and variance ( σ ) (see Equation € 24). Every set of parameters, βX , however, produces a given estimated normal distribution of Yi with mean µ and variance σ ; the ML estimation tries to find € € the βX producing the distribution most likely to have generated our observed € data. € € € Not also that while we speak about ML estimations maximizing the likelihood equation, in practice programs maximize the log of the likelihood, which simplifies computations considerably (and gets the same results). Because the likelihood is always between 0 and 1, the log likelihood is always negative… Models for Categorical Data Ordinal Outcomes For models where the dependent variable is categorical, but ordered, ordered logit is the most appropriate modelling strategy. A typical description begins with a latent variable Yi*which is a function of * (74) Yi = βX i + εi , € € Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !3 and a link between an observed binary Yi and a latent Yi* as follows: (75) Yi = 1 if Yi* ≤ δ1 , and € , and Yi = 2 if δ1 ≤ Yi* ≤ δ2 € € € €Y = 3 if Y * ≥ δ i i 2 € , € δ1 and δ2 are unknown parameters to be estimated along with the β in where equation X. We can restate the model, then, as follows: € (76) Pr(Yi , and = 1 x) = Pr(βX i + εi ≤ δ1 ) = Pr(εi ≤ δ1 − βX i ) Pr(Yi = 2 X i ) = Pr(δ1 ≤ βx + εi ≤ δ2 ) = Pr(δ1 − βX i < εi ≤ δ2 − βX i ) , and € € € . Pr(Yi = 3 X i ) = Pr(βX i + εi ≥ δ2 ) = Pr(εi ≥ δ2 − βX i ) The last statement of each line here makes clear the importance that the distribution of error plays in the estimation: the probability of a given outcome can be expressed as the probability that the error is – in the first line, for instance – smaller than the difference between theta and the estimated value. This set of statements can also be expressed as follows, adding ‘hats’ to denote estimated values, substituting predicted Yˆ for βX , and inserting a given cumulative distribution function, F, from which we derive our probability estimates: (77) € pˆ i1 = Pr(εi ≤ δˆ1 − Yˆi ) =€F(δˆ1 € − Yˆi ) , and pˆ i2 = Pr(δˆ1 − Yˆi < εi ≤ δˆ2 − Yˆi ) = F(δˆ2 − Yˆi ) − F(δˆ1 − Yˆ1 ) , and pˆ i3 = Pr(εi ≥ δˆ2 − Yˆi ) = 1− F(δˆ2 − Yˆi ) , € Where F can again be the logistic CDF (for ordered logit), but also the normal € CDF (for ordered probit), and so on. Again, using the logistic version as the example is far easier, and we can express the whole system in another way, as follows: p1 p + p2 p + p2 + ...+ pk ) = βX , ln( 1 ) = βX , ln( 1 ) = βX , 1− p1 1− p1 − p2 1− p1 − p2 − ...− pk (78) ln( where € . € € Note that these models rest on the parallel slopes assumption: the slope coefficients do not vary between different categories of the dependent variable Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !4 (i.e., from the first to second category, the second to third category, and so on). If this assumption is unreasonable, a multinomial model is more appropriate. (In fact, this assumption can be tested by fitting a multinomial model and examining differences and similarities in coefficients across categories.) And now, when we talk about odds ratios, we are talking about a shift in the odds of falling into a given category (m), (79) OR(m) = Pr(Yi ≤ m) . Pr(Yi < m) Nominal Outcomes € Multinomial logit is essentially a series of logit regressions examining the probability that Yi = m rather than Yi = k, where k is a reference category. This means that one category of the dependent variable is set aside as the reference category, and all models show the probability of Yi being one outcome rather than outcome k. Say, for instance, there are four outcomes k, m, n, and q, where k is the reference category. The models estimated are: (80) ln( € These models explore the variables that distinguish each of m, n, and q from k. Any category can be the base category, of course. It may be that it is € Pr(Yi = k) Pr(Yi = m) Pr(Yi = n) ) = β k X , ln( ) = β m xX , ln( ) = βn X Pr(Yi = q) Pr(Yi = q) Pr(Yi = q) Results for multinomial logit models aren’t expressed as odds ratios, since odds ratios refer to the probability of an outcome divided by 1. Rather, multinomial € Pr(Yi = m) ) = βm X , Pr(Yi = k) the risk ratio is, € € results are expressed as a risk-ratio, or relative risk, which is easily calculated by taking the exponential of the log risk-ratio. Where, the log risk-ratio is (82) ln( € € additionally interesting to see how q is distinguished from the other categories, in which case the following models can be estimated: (81) ln( € Pr(Yi = m) Pr(Yi = n) Pr(Yi = q) ) = β m X , ln( ) = β n X , ln( ) = βq X Pr(Yi = k) Pr(Yi = k) Pr(Yi = k) (83) Pr(Yi = m) = exp(β m X) . Pr(Yi = k) Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !5 The estimation of a multinomial logit model requires that the covariance between the error terms (relating each alternative) is zero. As a result, a critical assumption of the multinomial logit model is IIA, or the independence of irrelevant alternatives, which is as follows: if a chooser is comparing two alternatives according to a preference relationship, the ordinal ranking of these alternatives should not be affected by the addition or subtraction of other alternatives from the choice set. This is not always a reasonable assumption, however; when IIA is not a reasonable assumption, other multinomial models should be used, such as multinomial probit. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !6 Appendix A: Significance Tests Having already described regression analysis for continuous, binary, ordinal and categorical variables, this appendix provides some background: some basic information on distributions, probability theory, some of the standard tests of statistical significance. Distribution Functions We have thus far assumed a familiarity with distributions, probability functions, and so on. Before we talk about significance tests, however, it may be worth reviewing some of that material. Distributional characteristics can be important to understanding social phenomena; understanding distributions is also important to understanding hypothesis testing. So, let’s begin with a general probability mass function, (84) € f (x) = p(X = x) , which assigns a probability for the random variable X equaling the specific numerical outcome x. This is a general statement, which can then take on different forms based on, for instance, levels of measurement (e.g., binary, multinomial, or discrete). PMFs are useful when are dealing with discrete variables. If we are dealing with continuous variables, however, defining a given x becomes much less attractive (indeed – impossible if you consider an infinite number of decimal places). We accordingly replace the PMF with a PDF – a probability density function. There is a wide range of possible PDFs, of course. The exponential PDF, for instance, is as follows: (85) € f (x | β ) = 1 exp[−x / β ] , β 0 ≤ x < ∞ , 0 < α, β . Note that the exact form the function takes will vary based on a single € Because €the spread of the distribution is affected by parameter, β the mean. values of β , it is referred to as the scale parameter. € € Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !7 ! Other distributions are more flexible. The gamma PDF includes an additional shape parameter, α , which affects the ‘peakedness’ of the distribution. (86) € 1 f (x | α, β ) = x α −1 exp[−x / β ] , α Γ(α )β € € 0 ≤ x < ∞ , 0 < α, β , € Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !8 where the mean is now αβ . 1 An important special case of the gamma PDF is the chi-square ( χ 2 ) distribution – a gamma where α = df /2 and β = 2, and df is a positive integer value called the degrees of freedom. € The Normal, or Gaussian, PDF is as follows, € (87) € f (x | µ,σ 2 ) = 1 2πσ 2 e − 1 2σ 2 (x− µ ) € 2 , −∞ < x , µ < ∞ , 0 < σ 2 . Here, the critical values are µ and σ 2 , where the former defines the mean and € € € the latter is the variance, and defines the dispersion. The normal distribution is just one of many location-scale distributions – referred to as such because one € € € only the location, and another parameter, σ 2, changes parameter, µ , moves only the scale. When µ = 0 and σ 2 = 1, we have what is called a standard € normal distribution, € which plays an important role in statistical theory and practice. The PDF for this distribution is a much-simplified version of Equation X, € € (88) € f (x) = 1 2πσ 2 e − 1 2σ 2 x2 , −∞ < x < ∞ . Note that simple mathematical transformations can covert any normal € distribution into its standard form; probabilities can always be calculated, then, using this standard normal distribution. There are many different PDFs, and finding the one that characterizes the distribution of a given variable can tell us much about the data-generating process. We can look at the degree to which budgetary data reflects incremental change, for instance (see work by Baumgartner and Jones). We also make assumptions about the distribution of our variables in the population when we select an estimation method. OLS assumes our variables are normally distributed; logit assumes the distribution of error in our latent, non-linear variable follows the logistic PDF; probit assumes the distribution of error matches the normal distribution. Distributions – or, at least, assumptions about distributions – thus play a critical role in selecting an estimator, as well as in tests of statistical significance. here is an extension of the factorial function, so that it is defined for more than just non-negative integers. The factorial function of an integer n is written n!, and is equal to . 1 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 3 !9 The chi-square test Distributions are also critical to hypothesis testing. The standard chi-square ( χ 2 ) test is based on a chi-square distribution. To construct a χ 2 variable, begin with a normally distributed population of observations, having mean µY and standard € deviation σ Y2 . Then take a random sample of k cases, transform every € observation into standardized (Z-score) form, and square it. That is, do this, € 2 € Zi (89) = (Yi − µY ) σ Y2 2 , and then this, k (90) Q = € ∑Z 2 i , i=1 Q is distributed as chi-square with k degrees of freedom: € (91) Q ~ χ k2 . The chi-square distribution is, in short, the distribution of a sum of squared € normally distributed cases. The PDF for a chi-square distribution is defined exclusively by a single degrees of freedom value (above, k). It looks like this, (92) f (x | k) = (1/2) k / 2 k / 2−1 −x / β x e . Γ(k /2) When the number of cases (k) is very small, a chi square distribution is very € skewed. Most of the cases in a normal distribution lie between -1 and +1, after all, so we shouldn’t expect a chi-square value of much more than 1, and 0 (the mean Z-score in a normal distribution) is much more likely. As we add additional cases, however, the mean for a chi-square distribution increases – indeed, the mean will be equivalent to the number of cases (k), or rather, the degrees of freedom (df). And the variance of the chi-square is simple too – it’s 2 × df . When we use a chi-square statistic to look at the relationship in a crosstabulation of two categorical variables, you’ll recall, the df is (R-1)(C-1), where R € is the number of rows and C is the number of columns. And recall that when we are trying to exceed a given chi-squared value (to refute the null hypothesis of no relationship), what we are trying to find is a value that is far enough along the tail of a (chi-square) distribution to be clearly different from the mean (which we now know to be equal to the df). That we want the statistic to exceed a given Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !0 value, based on a desired level of statistical significance, is of course not exclusive to chi-square tests; the same idea is discussed further below. The t test When we talk about a t-test, we are interested in a test of the null hypothesis that a given coefficient is not different than zero, (93) H 0 :βj = 0 , against an alternative, that is, € (94) H1 : β j ≠0 . The rejection rule for H0 is as follows: € (95) t βˆ > c , j where c is a chosen critical value, and where € (96) t βˆ j = βˆ j ⌢ , se(β j ) which is of course just the ratio of the coefficient to its standard error. Now, c € requires some description. For a standard, two-tailed (where we allow for the possibility that β is either positive or negative) t-test, c is chosen to make the area in each tail of the distribution equal to 2.5% - that is, it is chosen to find a middle range that equals 95% of the distribution. The value of c is then based € t distribution, described above as a special case of the gamma on the distribution. The PDF for a t distribution is as follows, (97) f (x | v) = Γ(v + 1) /2 (1+ x 2 /v)−(v +1)/ 2 , vπ Γ(v /2) where v=n-1. € The shape of a t distribution varies with sample size and sample standard deviations. Like normal distributions, all t distributions are symmetrical, bellshaped and have a mean of zero. While normal distributions are based on a population variance, however, t distributions are based on a sample variance – useful, since in most cases we do not know the population variance. T distributions thus differ from normal distributions in at least one important way: a t distribution has a larger variance than a normal distribution, since we believe Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !1 the sample variance underestimates the population variance. This difference between the two distributions narrows as N increases, of course. And the bar for achieving statistical significance is thus a little greater under a t distribution. The F Test The F test was designed to make inferences about the equality of the variances of two populations. It is based on an F distribution, which requires that random, independent samples be drawn from two normal populations that have the same variance (i.e., σ Y21 = σ Y22 ). An F ratio is then formed as the ratio of two chisquares, each divided by its degrees of freedom: (98) F € ( χ V21 /v1 ) € . ( χ V22 /v 2 ) This ratio, it turns out, is distributed as an F random variable with two different degrees of freedom. That is: (99) F € = ~ Fv1 ,v 2 , where the F distribution itself is defined by two parameters – the two separate degrees of freedom. It is non-symmetric and ranges across the nonnegative numbers. Its shape depends on the degrees of freedom associated with both the numerator and the denominator. In regression analysis, the F test is a frequently used joint hypothesis test. It appears in the top right of all regression results in STATA, for instance, and is used there to test null hypothesis that all coefficients in the model are not different than zero. That is, in the following regression, (100) Yi = α + β1 X1 + β 2 X 2 + β 3 X 3 + ε , an F test is used to test the possibility that € € (101) H 0 : β1 = 0, β 2 = 0, β 3 = 0 . Of course, the test could also deal with a subset of these coefficients. We use this example here simply because it speaks directly to the F test results in STATA. Testing this joint hypothesis test is different than looking at the t-statistics for each individual coefficient, since any particular t statistic tests a hypothesis that puts no restrictions on the other parameters. Here, we can produce a single test of these joint restrictions. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !2 More general versions of Equations 100 and 101 are as follows. Begin with an unrestricted model, (102) Yi € = α + β1 X1 + β 2 X 2 + β 3 X 3 + ...+ β k X k + ε , where the number of parameters in the model is k+1 (because we include the intercept). Suppose then that we have q exclusion restrictions to test: that is, the null hypothesis states that q of the variables have zero coefficients. If we assume that it is the last q variables that we are interested in (order doesn’t matter for estimation, but makes the following statement easier), then the general hypothesis is as follows, (103) H 0 : β k−q +1 = 0,..., β k = 0 . And Equation 102, with these restrictions imposed, now looks like this: € € (104) Yi = α + β1 X1 + ...+ β k−q X k−q + ε , In short, it is the model with the last q coefficients excluded – effectively, then, restricting those last (now non-existent) coefficients to zero. What we want to know now is whether there is a significant difference between the restricted and unrestricted equations. If there is a difference, then the last q coefficients matter; if there is no difference, then the null hypothesis that all these q coefficients are not different than zero is supported. To test this null hypothesis, we use the following F test: (ESSr − ESSur ) q , (105) F ≡ ESSur n − k −1 where ESSr is the sum of squared residuals from the restricted model and ESSur € is the sum of squared residuals from the unrestricted model. Both (ESSr-ESSur) and (ESSur) are assumed to be distributed chi-square normal, and the degrees of freedom are q and n-k-1. Again, in order to reject H0, we must find a value for F that, based on the PDF for an F distribution, defined by k1 and k2, suggests our value is significantly different from the mean. Which, as it happens, will be v2/(v2-2); or, in this case, (n-k-1)/(n-k-3). Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !3 Appendix B: Factor Analysis Background: Correlations and Factor Analysis Factor analysis is a data reduction technique, used to explain variability among observed random variables in terms of fewer unobserved random variables called factors. A factor analysis usually begins with a correlation matrix. Take, for instance, the following 5 x 5 correlation matrix, R, for variables a through e (in rows j and columns k): 1.00 .72 .63 .54 .45 .72 1.00 .56 .48 .40 R .63 .56 1.00 .42 .35 .54 .48 .42 1.00 .30 .45 .40 .35 .30 1.00 This matrix is consistent with there being a single common factor, g, whose correlations with the 5 observed variables are respectively .9, .8, .7, .6, and .5. That is, if there were a variable g, correlated with a at .9. b at .8, and so on, we’d get a correlation matrix exactly as above. How can we tell? First, note that the correlation between a and b, where both are correlated with g, is going to be the product of the correlation between a and g, and b and g. More precisely, (106) rab = rag × rbg . For the cell in row 1, column 2 of matrix R, then, we have the correlation € between a and g (.9) and the correlation between b and g (.8), and the resulting correlation between a and b: .9 × .8 = .72. This works across the board for the above matrix – each cell is the product of the correlation between g and the variable in row j and column k. € This is the kind of pattern we’re interested in finding in a factor analysis – a latent variable that captures the variance (and covariance) amongst a set of measured variables. Just like in a regression, when we partition the total variance (TSS) into the explained (RSS) and error (ESS) variances, we can think of a factor analysis as Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !4 decomposing a correlation matrix R into a common portion C, and an unexplained portion U, where, (107) R = C + U , or € U = R − C , and so on. In fact, there can be several common portions – one for each of several latent factors which might account for common variance. So a more general model could be (108) R = € ∑C q +U , where there are q common factors. In matrix R above, there is clearly just one € common component. The common portion of the variance is thus captured in a single matrix, C1 , as follows, .81 .72 .63 .54 .45 .72 .64 .56 .48 .40 C1 .63 .56 .49 .42 .35 .54 .48 .42 .36 .30 .45 .40 .35 .30 .25 where each cell for row j and column k is simply the product of (a) the correlation between g and the variable j, and (b) the correlation between g and the variable k. The off-diagonal entries in matrix C1 show the common variance between two variables j and k that is captured by the latent factor g. The diagonal entries show the amount of variance in a single variable explained by the latent factor g. So, latent factor g, correlated with variable a at .9, explains .81 of the variance in a. (This should make sense: recall that R 2 = r 2.) These diagonal values are sometimes referred to as communality estimates – the proportion of variance in a given y variable that is accounted for by a common factor. € The unexplained variance is then captured in matrix U, which is simply R - C1, .19 .00 .00 .00 .00 .00 .36 .00 .00 .00 U .00 .00 .51 .00 .00 .00 .00 .00 .64 .00 .00 .00 .00 .00 .75 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !5 Diagonal entries here show the proportion of variance that is unique in each variable – that is, the variance that is not accounted for by the latent factor g. Off-diagonal entries show the covariance between variables j and k that is not accounted for by factor g. In this case, g perfectly captures the covariance between all variables, so the off-diagonal entries are all zero. however, a factor will not perfectly capture these covariances. In most cases, An Algebraic Description The matrix stuff is a useful background to factor analysis, but it isn’t the easiest way to understand where exactly factor analysis results come from. The standard algebraic description of factor analysis is a little more helpful there. Imagine that we have a number of measured outcome variables, y1, y2, … yn, and we believe these are systematically related to some mysterious set of latent, that is, unmeasured, factors. More systematically, our y variables are related to a set of functions, (109) y1 = ω11F1 + ω12 F2 + ...+ ω1m Fm , y 2 = ω 21F1 + ω 22 F2 + ...+ ω 2m Fm … € y n = ω n1F1 + ω n 2 F2 + ...+ ω nm Fm € where F are not variables but functions of unmeasured variables, and ω are the € weights attached to each function (and where n is the number of y variables and m is the number of functions). Note that only the y variables are known – the € entire RHS of each model has to be estimated. In short, the basic factor model assumes that the variables (y) are additive composites of a number of weighted ( ω ) factors (F). The factor loadings emerging from a factor analysis are the weights ω (sometimes referred to as the constants); the factors are the F functions. The size of each loading for each factor measures how much that€specific function is related to y. And factor scores – which can be generated for€every value of the y variables – are the predictions based on the results of equation 109. 2 What factor analysis does is discover the unknown F functions and ω weights. In order to do so, we need to impose some assumptions – otherwise, there are Note that there is no assumption here that the variables y are€linearly related – only that each is linearly related to the factors. 2 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !6 an infinite number of solutions for equation 109. And different factor analytic procedures vary mainly in the assumptions they impose in order to estimate the various ω and F. 3 € To restate things: in estimating a factor analysis the y variables are defined as linear functions of the weighted F factors. Indeed, all y variables’ statistics (mean, variance, correlations) are defined as functions of the weights and factors. So, for instance, the mean of y1 can be expressed as follows, f (110) y1 = ∑ω m Fm , m=1 where the mean of y1 is equal to the sum of each individual factor (Fk), where € there are m factors, multiplied by its corresponding weight ( ω k). By defining all variables’ statistics in this way, it becomes possible to estimate a set of weights and factors that account for the variances and covariances amongst the y variables. € Note that equation 110 sets out the model for what is referred to as a full component factor model – a model in which the y variables are perfectly (that is, completely) a function of a set of latent factors. Usually, we use a common factor model – one in which there is a set of common factors, but also a degree of unique variance, or uniqueness, for each variable. (In fact, this uniqueness can be viewed as a separate unique factor for each y variable.) This common factor model can be expressed with a simple addition to equation 109, (111) y n = ω n1F1 + ω n 2 F2 + ...+ ω nm Fm + ω nuU n , where U captures the unique variance in yn. Note that main difference between € the full component and the common factor approach is that the former assumes that diagonal elements of matrix C are equal to one – that all the variance in each variable is accounted for by the F factors. The common factor model, which allows for unique variance in each variable, un-accounted for by the factors, makes no such assumption. You also see ‘principle components factor analysis used in the literature. This method extracts the principle factors (those ‘best’ capturing covariance amongst In a standard models the two critical assumptions are: (1) variables can be calculated from the factors by multiplying each factor by the appropriate weight and summing across all factors, and (2) all factors have a mean of zero and a standard deviation of one. Other assumptions can include, for instance, whether or not, or to what degree, the estimated F factors can be correlated. 3 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !7 variables) from a component model. So factor are estimated until they account for all the variance of each variable – this can of course mean many factors. But only those factors which are sufficiently common are reported. Factor Analysis Results The results of a factor analysis generally look like this (from R.J. Rummel): ! where there are a number of F factors across the top, measured y variables in each row, and factor loadings ( ω ) in each cell. The number of factors (columns) is the number of substantively meaningful independent (uncorrelated) patterns of relationship among the variables. € The loadings, ω , measure which variables are involved in which factor pattern and to what degree. The square of the loading multiplied by 100 equals the percent variation that a variable has in common with a given latent variable. € The first factor pattern delineates the largest pattern of relationships in the data; the second delineates the next largest pattern that is independent of (uncorrelated with) the first; the third pattern delineates the third largest pattern that is independent of the first and second; and so on. Thus the amount of variation in the data described by each pattern decreases successively with each factor; the first pattern defines the greatest amount of variation, the last pattern Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !8 the least. Note that these initial, unrotated factor patterns are uncorrelated with each other. The column headed h2 displays the communality of each variable. This is the proportion of a variable's total variation that is involved in the patterns. The coefficient (communality) shown in this column, multiplied by 100, gives the percent of variation of a variable in common with each pattern. The h2 value for a variable is calculated by summing the squares of the variable's loadings. Thus for power in the above table we have (.58)2 + (-.42)2 + (-.42)2 + (.43)2 = .87. This communality may also be looked at as a measure of uniqueness. By subtracting the percent of variation in common with the patterns from 100, the uniqueness of a variable is determined. This indicates to what degree a variable is unrelated to the others--to what degree the data on a variable cannot be derived from (predicted from) the data on the other variables. The ratio of the sum of the values in the h2 column to the number of variables, multiplied by 100, equals the percent of total variation in the data that is patterned. Thus it measures the order, uniformity, or regularity in the data. As can be seen in the above table, for the ten national characteristics the four patterns involve 80.1 percent of the variation in the data. That is, we could reproduce 80.1 percent of the relative variation among the fourteen nations on these ten characteristics by knowing the nation scores on the four patterns. At the foot of the factor columns in the table, the percent of total variance figures show the percent of total variation among the variables that is related to a factor pattern. This figure thus measures the relative variation among the fourteen nations in the original data matrix that can be reproduced by a pattern: it measures a pattern's comprehensiveness and strength. The percent of total variance figure for a factor is determined by summing the column of squared loadings for a factor, dividing by the number of variables, and multiplying by 100. The eigenvalues equal the sum of the column of squared loadings for each factor. They measure the amount of variation accounted for by a pattern. Dividing the eigenvalues either by the number of variables or by the sum of h2 values and multiplying by 100 determines the percent of either total or common variance, respectively. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 4 !9 Rotated Factor Analyses The unrotated factors successively define the most general patterns of relationship in the data. Not so with the rotated factors. They delineate the distinct clusters of relationships, if such exist. The best way to understand ‘rotation’ is to think about a geometric representation of a factor model. Take as an example the following hypothetical results Variable 1 2 3 4 Factor 1 0.7 -0.3 1 .-6 Factor 2 0.7 0.2 0 0.8 Uniqueness 0.98 0.13 1 0.1 We can represent these results geometrically, in the following way, ! where the two factors are axes, and the variables are located at the appropriate place (their loading) on each axis. ‘Rotating’ a factor analysis, then, is a process of taking the structure discovered in the unrotated analysis – the structure of axes in the diagram above, for instance – and rotating – essentially via a re-estimation of factor loadings – that structure. The aim is to produce a structure that better distinguishes between the factors – more precisely, that better distinguishes variables loading on one factor from variables loading on another. variables plotted across two factors: Here’s an example with around 20 Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 5 !0 ! There are two different kinds of rotations – orthogonal and oblique. The former require that the factors are orthogonal – that is, uncorrelated – while the latter has no such restriction. Orthogonal is more typical; it’s the example above (where the axes remain perpendicular to each other); it’s also the default in STATA. That said, for many the purpose of rotating is to allow for correlation amongst factors. There are also different rotation criteria. For instance: varimax rotations maximize the variance of the squared loadings within factors – the columns of a loading matrix for a given factor F; quartimax rotations maximize the variance of the squared loadings within the variables – the rows of a loading matrix for a given factor F. Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 5 !1 Appendix B: Taking Time Seriously Time series data usually violates one of the critical assumptions of OLS regression – that the disturbances are uncorrelated. That is, it is normally assumed that errors will be distributed randomly over time: ! Note the subscript t, which denotes a given time period. (Our cases are now defined by time units, say, months, rather than individuals.) Positively autocorrelated variables usually lead to positive correlated residuals, which might look something like this: ! This figure shows a typical first-order autoregressive process, where (112) εt € = ρεt−1 + v , Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 5 !2 where εt is the error term, ρ is a coefficient, and v is the remaining, random error. A consequence of this process tends to be that variances and standard errors are underestimated; we are thus more likely find significant results. It is € important, then, €that we understand fully the extent and nature of autocorrelation in time series data, and try to account for this in our models. Take a standard nonautoregressive model, (113) Yt + ⇥Xt + ⇤t . If it is true that Xt is highly correlated with Xt-1, as is the case with many time series, then it will also tend to be true that is correlated with . This presents serious estimation difficulties, and most time series methods are all about solving this problem. Essentially, these methods (like the Durbin 2-stage method, the Cochrane-Orcutt transformation, or Prais-Winsten estimations) are all about pulling the autocorrelation out of the error term – that is, pulling it out of the error term and into the estimated coefficients. (Indeed, it’s a little like missing variable bias – there is some missing variable, the absence of which leads to a particular type of heteroskedasticity; so long as it’s included, though, the model will be fine.) There are ways to do this that are much simpler (and sometimes more effective) than complicated transformations. For a terrific description of modeling strategies for time series, see Pickup’s Introduction to Time Series Analysis (Sage). Below, we look briefly at univariate statistics. Univariate Statistics To explore the extent of autocorrelation in a single time series, we us autocorrelation functions (ACFs) and partial autocorrelation functions (PACFs). ACFs plot the average correlation between xt and xt-1, xt and xt-2, and so on. PACFs provide a measure of correlation between observations k units apart after the correlation at intermediate lags has been controlled for, or ‘partialed out’. So, where a typical correlation coefficient (between two variables, x and y) looks like this: (114) , an ACF of error terms j units apart looks like this: Dec 2015 Comm783 Notes, Stuart Soroka, University of Michigan pg 5 !3 . (115) Often, we examine autocorrelation using a correlogram, which for a standard first-order process looks something like this, for positive correlation: ! and this for negative autocorrelation: ! A combination of ACF and PACF plots usually gives us a good sense for the magnitude and structure of autocorrelation in a single time series. These are important to our understanding of how an individual time series works (i.e., just how much does one value depend on previous values?). It also points towards how many, or which, lags will be required in multivariate models. Bivariate Statistics A first test of the relationship between two time series usually takes the form of a cross-correlation function (CCF), or cross-correlogram, which displays the average correlations between xt and yt, xt and yt-1, and so on (and, conversely, xt and yt, xt and yt-1, and so on)…