Communication Studies 783 Notes

advertisement
Communication Studies 783 Notes
Stuart Soroka, University of Michigan
Working Draft, December 2015
These pages were written originally as my own lecture notes, in part for Poli
618 at McGill University, and now for Comm 783 at the University of
Michigan. Versions are freely available online, at snsoroka.com. The notes
draw on a good number of statistics texts, including Kennedy’s
Econometrics, Greene’s Econometric Analysis, a number of volumes in Sage’s
quantitative methods series, and several online textbooks. That said, please
do keep in mind that they are just lecture notes — there are errors and
omissions, and there is for no single topic enough information included in
this file to actually learn statistics from the notes alone. (There are of course
many textbooks that are better equipped for that purpose.) The notes may
nonetheless a useful background guide to some of the themes in Comm 783
and perhaps, more generally, to some of the basic statistics most common in
communication studies.
If you find errors (and you will), please do let me know. Thanks,
Stuart Soroka
ssoroka@umich.edu
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!
Table of Contents
Variance, Covariance and Correlation ...................................................................3
Introducing Bivariate Ordinary Least Squares Regression .....................................5
Multivariate Ordinary Least Squares Regression .................................................11
Error, and Model Fit............................................................................................. 12
Assumptions of OLS regression........................................................................... 17
Nonlinearities ......................................................................................................18
Collinearity and Multicollinearity .........................................................................20
Heteroskedasticity ...............................................................................................22
Outliers ................................................................................................................24
Models for dichotomous data .............................................................................25
Linear Probability Models
Nonlinear Probability Model: Logistic Regression
An Alternative Description: The Latent Variable Model
Nonlinear Probability Model: Probit Regression
Maximum Likelihood Estimation .........................................................................31
Models for Categorical Data ...............................................................................32
Ordinal Outcomes
Nominal Outcomes
Appendix A: Significance Tests ........................................................................... 36
Distribution Functions
The chi-square test
The t test
The F Test
Appendix B: Factor Analysis ................................................................................43
Background: Correlations and Factor Analysis
An Algebraic Description
Factor Analysis Results
Rotated Factor Analyses
Appendix B: Taking Time Seriously .....................................................................51
Univariate Statistics
Bivariate Statistics
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!
Variance, Covariance and Correlation
Let’s begin with Yi, a continuous variable measuring some value for each
individual (i) in a representative sample of the population. Yi can be income, or
age, or a thermometer score expressing degrees of approval for a presidential
candidate. Variance in our variable Yi is calculated as follows:
(1)
(2)
SY2 =
SY2 =
(Yi
N
N(
Ȳ )2
1 , or
Yi2 )
N (N
( Yi )2
1)
,
where both versions are equivalent, and the latter is referred to as the
computational formula (because it is, in principle, easier to calculate by hand).
Note that the equation is pretty simple: we are interested in variance in Yi, and
Equation 1 is basically taking the average of each individual Yi’s variance around
the mean ( Y ).
There are a few tricky parts. First, the differences between each individual Yi and
Y (that is, Yi − Y ) are squared in Equation 1, so that negative values do not
€
€
cancel out positive values (since squaring will lead to only positive values).
Second, we use N-1 as the denominator rather than N (where N is the number of
€ This produces a more conservative (slightly inflated) result, in light of the
cases).
fact that we’re working with a sample rather than the population – that is, the
values of Yi in our (hopefully) representative sample, and the values of Yi that we
believe may exist in the total real-world population. For a small-N samples,
where we might suspect that we under-estimate the variance in the population,
using N-1 effectively adjusts the estimated variance upwards. With a large-N
sample, the difference between N-1 and N is increasingly marginal. That the
adjustment matters more for small sample than for big samples reflects our
increasing confidence in the representative-ness of our sample as it increases.
(Note that some texts distinguish between SY2 and σ Y2 , where the Roman S is the
sample variance and the Greek σ is the population variance. Indeed, some texts
will distinguish between sample values and population values using Roman and
€ an estimated slope coefficient, for
Greek versions across the board€ – B for
€ actual slope in the population. I am not this systematic
instance, and β for an
below.)
The standard deviation is a simple function of variance:
€
Dec 2015
(3)
Comm783 Notes, Stuart Soroka, University of Michigan
SY =
⇥
SY2 =
⇤
(Yi
N
pg 4
!
Ȳ )2
,
1
So standard deviations are also indications of the extent to which a given
variable varies around its mean. SY is important for understanding distributions
and significance tests, as we shall see below.
So far, we’ve looked only at univariate statistics – statistics describing a single
variable. Most of the time, though, what we want to do is describe relationships
between two (or more) variables. Covariance – a measure of common variance
between two variables, or how much two variables change together – is
calculated as follows:
(4)
SXY =
(5)
SXY =
(Xi
N
Ȳ )
X̄)(Yi
N 1
Xi Yi
N (N
Xi
1)
, or
Yi
,
the latter of which is the computational formula. Again, we use N-1 as the
denominator, for the same reasons as above.
Pearson’s correlation coefficient is also based on a ratio of covariances and
standard deviations, as follows:
(6)
(7)
r=
SXY
SX SY , or
r= ⇥
(Xi
(Xi
X̄)(Yi
X̄)2
(Yi
Ȳ )
Ȳ )2
.
where SXY is the sample covariance between Xi and Yi, and SX and SY are the
sample standard deviations of Xi and Yi respectively. (Note the relationship
between this Equation 7, and the preceding equations for standard deviations
and covariances, Equation 3 and Equation 4.)
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 5
!
Introducing Bivariate Ordinary Least Squares
Regression
Take a simple data series,
X
2
4
6
8
Y
2
3
4
5
and plot it…
!
What we want to do is describe the relationship between X and Y. Essentially,
we want to draw a line between the dots, and describe that line. Given that the
data here are relatively simple, we can just do this by hand, and describe it using
two basic properties, α and β :
€
€
!
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
where α , the constant, is in this case equal to 1, and
pg 6
!
, the slope, is 1 (the
increase in Y) divided by 2 (the increase in X) = .5. So we can produce an
equation for this line allowing us to predict values of Y based on values of X. The
€ general model is,
(8)
Yi =
+ ⇥Xi
And the particular model in this case is Y = 1 + .5X.
Note that the constant is simply a function of the means of both X and Y, along
with the slope. That is:
(9)
= Ȳ
⇥ X̄
X
2
4
6
8
5
mean
So, following Equation 9,
= Ȳ
Y
2
3
4
5
3.5
⇥ X̄ = 3.5 – (.5)*5 = 3.5 – 2.5 = 1 .
This is pretty simple. The difficulty is that data aren’t like this –they don’t fall
along a perfect line. They’re likely more like this:
X
2
4
6
8
!
Y
3
2
5
5
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 7
!
Now, note that we can draw any number of lines that will satisfy Equation 8. All
that matters is that the line goes through the means of X and Y. So the means
are:
mean
X
2
4
6
8
5
Y
3
2
5
5
3.75
And let’s make up an equation where Y=3.75 when X=5…
Y = α + βX
3.75 = α + ( β )*5
3.75 = 4 + ( β )*5
€ €
3.75 = 4 + (-.05)*5
€ €
3.75 = 4 + (-.25)
€
So here it is: Y = 4 + (-.05)X . Plotted,
it looks like this:
!
Note that this new model has to be expressed in a slightly different manner,
including an error term:
(10)
Yi =
+ ⇥Xi + ⇤i ,
or, alternatively:
(11)
Yi = Ŷi + i ,
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
where
pg 8
!
are the estimated values of the actual Yi , and where the error can be
expressed in the following ways:
(12)
i
Ŷ . or ⇤i = Yi
= Yi
( +€⇥Xi ) .
So we’ve now accounted for the fact that we work with messy data, and that
there will consequently be a certain degree of error in the model. This is
inevitable, of course, since we’re trying to draw a straight line through points
that are unlikely to be perfectly distributed along a straight line.
Of course, the line above won’t do – it quite clearly does not describe the
relationship between X and Y. What we need is a method of deriving a model
that better describes the effect that X has on Y – essentially, a method that
draws a line that comes as close to all the dots as possible. Or, more precisely, a
model that minimizes the total amount of error( εi).
€
!
We first need a measure of the total amount of error – the degree to which our
predictions ‘miss’ the actual values of Yi . We can’t simply take the sum of all
errors, !
i,
because positive and negative errors can cancel each other out.
We could take the sum of the absolute values, ! | i |, which in fact is used in
€
some estimations.
2
The norm is to use the sum of squared errors, the SSE or ! i . This sum is most
greatly affected by large errors – by squaring residuals, large residuals take on
2
very large magnitudes. An estimation of Equation 10 that tries to minimize ! i
accordingly tries especially hard to avoid large errors. (By implication, outlying
cases will have a particularly strong effect on the overall estimation. We return to
this in the section on outliers below.)
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 9
!
This is what we are trying to do in ordinary least squares (OLS) regression:
minimize the SSE, and have an estimate of β (on which our estimate of α relies)
that comes as close to all the dots as is possible.
Least-squares coefficients for simple bivariate regression €
are estimated as
€
follows:
(13)
=
(14)
=
(Xi
N
X̄)(Yi Ȳ )
, or
(Xi X̄)2
Yi Xi
N Xi2
Yi Xi
.
( Xi )2
The latter is referred to as the computational formula, as it’s supposed to be
easier to compute by hand. (I actually prefer the former, which I find easier to
compute, and has the added advantage of nicely illustrating the important
features of OLS regression.)
We can use Equation 13 to calculate the Least Squares estimate for the above
data:
The data…
€
Calculated values (used in Equation 13)…
Xi
Yi
Xi − X
Yi − Y
(X i − X )(Yi − Y )
(X i − X ) 2
2
4
6
€8
3
2
5
5
-3
-1
1
€3
-0.75
-1.75
1.25
€1.25
2.25
1.75
1.25
3.75€
∑= 9
9
1
1
9
∑ =20
X i =5
€
Yi =3.75
So solving Equation 13 with the values above looks like this:
€
€
!
=
(Xi
€
€
X̄)(Yi Ȳ )
9
=
= .45
2
20
(Xi X̄)
And we can use these results in Equation 9 to find the constant:
! = Ȳ
⇥ X̄ = 3.75
(.45) ⇥ 5 = 3.75
So the final model looks like this:
Yi = 1.5 + (.45) ⇥ Xi
2.25 = 1.5
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!0
Using this model, we can easily see what the individual predicted values ( Yˆi ) are,
as well as the associated errors ( εi ):
Xi
Yi
Yˆi
2
4
6
€
8
3
2
5
€
5
€ 2.4
X i =5
Yi =3.75
€
€
εi = Yˆi − Yi
3.3
€ 4.2
5.1
€
0.6
-1.3
0.8
-0.1
One
€ further note about Equation 13, and our means of estimating OLS slope
coefficients: Recall the equations for variance (Equation 1) and covariance
(Equation 4). If we take the ratio of covariance and variance, as follows,
SXY
(15)
=
Sx2
P
(Xi X̄)(Yi Ȳ )
N 1
P
(Xi X̂)2
N 1
,
we can adjust somewhat to produce the following,
(16)
SXY
=
Sx2
P
(Xi X̄)(Yi Ȳ )
P
(Xi X̂)2
,
where Equation 16 simply drops the N-1 denominators, which cancel each other
out. More importantly, Equation 16 looks suspiciously – indeed, exactly – like the
formula for β (Equation 13). β is thus essentially a ratio between the covariance
between X and Y, and the variance of X, as follows:
€
€
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
(17)
=
YX
pg 1
!1
SY X
2
SX
This should make sense when we consider the standard interpretation of β : for
a one-unit shift in X, how much does Y change?
€
Multivariate Ordinary Least Squares Regression
Things are more complicated for multiple, or multivariate, regression, where
there is more than one independent variable. The standard OLS multivariate
model is nevertheless a relatively simple extension of bivariate regression –
imagine, for instance, plotting a line through dots plotted along two X axes, in
what amounts to three-dimensional space:
!
This is all we’re doing in multivariate regression – drawing a line through these
dots, where values of Y are driven by a combination of X1 and X2, and where the
model itself would be as follows:
(18)
Yi =
+ ⇥1 X1 i + ⇥2 X2 i + ⇤i .
That said, when we have more than two regressors, we start plotting lines
through four- and five-dimensional space, and that gets hard to draw.
Least squares coefficients for multiple regression with two regressors, as in
Equation 18, are calculated as follows:
Dec 2015
(19) β1 =
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!2
(∑ (X1i − X1 )(Yi − Y )∑ (X 2i − X 2 )) − (∑ (X 2i − X 2 )(Y i−Y )∑ (X1i − X1 )(X 2i − X 2 ))
(∑ (X1i − X1 ) 2 ∑ (X 2i − X 2 ) 2 ) − (∑ (X1i − X1 )(X 2i − X 2 )) 2
,
and
€
(20) β 2 =
(∑ (X 2i − X 2 )(Yi − Y )∑ (X1i − X1 )) − (∑ (X1i − X1 )(Y i−Y )∑ (X1i − X1 )(X 2i − X 2 ))
(∑ (X1i − X1 ) 2 ∑ (X 2i − X 2 ) 2 ) − (∑ (X1i − X1 )(X 2i − X 2 )) 2
,
and the constant is now estimated as follows:
€
= Ȳ
(21)
⇥1 X̄1
⇥2 X̄2 .
Error, and Model Fit
The standard deviation of the residuals, or the standard error of the slope, is as
follows,
(22)
SE =
⇥
2
i
N
2,
Or, more generally,
(23)
SE =
⇥
2
i
N
K
2,
Equation 22 is the same as Equation 23, except that the former is a simple
version that applies to bivariate regression only, and the latter is a more general
version that applies to multivariate regression with any number of independent
variables. N in these equations refers to the total number of cases, while K is the
total number of independent variables in the model.
The SE β is a useful measure of the fit of a regression slope – it gives you the
average error of the prediction. It’s also used to test the significance of the slope
coefficient. For instance, if we are going to be 95% confident that our estimate is
€
significantly different from zero, zero should not fall within the interval
β ± 2(SE β ) . Alternatively, if we are using t-statistics to examine coefficients’
significance, then the ratio of β to SE β should be roughly 2.
€
Assuming you remember the basic sampling and distributional material in your
basic statistics course, this reasoning should sound familiar. Here’s a quick
€
€
refresher: Testing model fit is based on some standard beliefs about
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!3
distributions. Normal distributions are unimodel, symmetric, and are described
by the following probability distribution:
(24)
p(Y ) =
e
(Y
2
µY )2 /2⇥Y
2 ⇥Y2
where p(Y) refers to the probability of a given value of Y, and where the shape of
the curve is determined by only two values: the population mean,
variance,
, and its
. (Also see our discussion of distribution functions, below.)
Assuming two distributions with the same mean (of zero, for instance), the effect
of changing variances is something like this:
!
We know that many natural phenomena follow a normal distribution. So we
assume that many political phenomena do as well. Indeed, where the current
case is concerned, we believe that our estimated slope coefficient, β , is one of
a distribution of possible β s we might find in repeated samples. These β s are
normally distributed, with a standard deviation that we try to estimate from our
€
data.
€
€
We also know that in any normal distribution, roughly 68% of all cases fall within
plus or minus one standard deviation from the mean, and 95% of all cases fall
within plus or minus two standard deviations from the mean. It follows that our
slope should not be within two standard errors of zero. If it is, we cannot be 95%
confidence that our coefficient is significantly different from zero – that is, we
cannot reject the null hypothesis that there is no significant effect.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!4
Going through this process step-by-step is useful. Let’s begin with our estimated
bivariate model from page 8, where the model is Yi = 1.5 + (.45)*Xi, and the data
are,
Xi
Yi
2
4
6€
8
3
2
5€
5
€
X i =5
Yˆi
εi = Yˆi − Yi
εi2
2.4
3.3
€4.2
5.1
0.6
-1.3
0.8
-0.1
0.36
1.69
0.64
0.01
€
2.7
Yi =3.75
Based on Equation 22, we calculate the standard error of the slope as follows:
€
SE =
!
⇤€
2
i
2
N
=
⇥
2.7
=
4 2
⇥
2.7 ⇥
= 1.35 = 1.16
2
So, we can be 95% confident that the slope estimate in the population is
.45 ± (2 ×1.16) , or .45 ± 2.32 . Zero is certainly within this interval, so our results
are not statistically significant. This is mainly due to our very small sample size.
Imagine the same slope and SE β , but based on a sample of 200 cases:
€
€
SE =
!
⇤
€N
2
i
2
=
⇥
2.7
=
200 2
⇥
⇥
2.7
= .014 = .118
198
Now we can be 95% confident that the slope estimate in the population is
.45 ± (2 × .118) , or .45 ± .236. Zero is not within this interval, so our results in this
case would be statistically significant.
Just to recap, our decision about the statistical significance of the slope is based
€
on a combination of the magnitude of the slope ( β ), the total amount of error in
€
the estimate (using the SE β ), and the sample size (N, used in our calculation of
the SE β ). Any one of these things can contribute to significant findings: a
€
greater slope, less error, and/or a larger sample size. (Here, we saw the effect
€
that sample size can have.)
€
Another means of examining the overall model fit – that is, including all
independent variables in a multivariate context – is by looking at proportion of
the total variation in Yi explained by the model. First, total variation can be
decomposed into ‘explained’ and ‘unexplained’ components as follows:
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!5
TSS is the Total Sum of Squares
RSS is the Regression Sum of Squares (note that some texts call this RegSS)
ESS is the Error Sum of Squares (some texts call this the residual sum of squares,
RSS)
So, TSS = RSS + ESS, where
(25)
T SS =
(Yi
Ȳ )2 ,
(26)
RSS =
(Ŷi
Ȳ )2 , and
=
(Yi
Ŷ )2
(27) ESS
We’re basically dividing up the total variance in Yi around its mean (TSS) into two
parts: the variance accounted for in the regression model (RSS), and the variance
not accounted for by the regression model (ESS). Indeed, we can illustrate on a
case-by-case basis the variance from the mean that is accounted for by the
model, and the remaining, unaccounted for, variance:
!
All the explained variance (squared) is summed to form RSS; all the unexplained
variance (squared) is summed to form ESS.
Using these terms, the coefficient of determination, more commonly, the R2, is
calculated as follows:
(28)
R2 =
RSS
, or
T SS
R2 = 1
ESS
, or
T SS
R2 =
T SS ESS
.
T SS
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!6
Or, alternatively, following from Equation 25-Equation 27:
(29)
R2 =
(Yˆ1
(Yi
RSS
=
T SS
Ȳ )2
=
Ȳ )2
(Yi
Ȳ )2
(Yi
(Yi
Ȳ )2
Ŷi )2
And we can estimate all of this as follows:
Xi
Yi
2
4
6€
8
3
2
5€
5
X i =5
Yi =3.75
€
€
Yˆi
(Yi − Y ) 2
(Yˆi − Y ) 2
(Yi − Yˆi ) 2
2.4
3.3
€4.2
5.1
0.56
3.06
1.56
€
1.56
1.82
0.2
0.2
€
1.82
0.36
1.69
0.64
0.01
TSS=6.74
RSS=4.04
ESS=2.7
R2 =
€
The coefficient of determination is thus !
RSS
4.04
=
= .599
T SS
6.74
.
The coefficient of determination is calculated the same way for multivariate
regression. The R2 has one problem, though – it can only ever increase or stay
equal as variables are added to the equation. More to the point, including extra
variables can never lower the R2, and the measure accordingly does not reward
for model parsimony. If you want a measure that does so, you need to use a
‘correction’ for degrees of freedom (sometimes called an adjusted R-squared):
(30)
R˜2 = 1
RSS
N K 1
T SS
N 1
Note that this should only make a difference when the sample size is relatively
small, or the number of independent variables is relatively large. But you can see
in Equation 30 that if the sample size is small, increasing the number of variables
will reduce the numerator, and thus reduce the adjusted R2.
One further note about the coefficient of determination: note that the R2 is
equivalent to the square of Pearson’s r (Equation 6). That is,
(31)
r=
SXY
=
SX SY
2
RXY
,
There is, then, a clear relationship between the correlation coefficient and the
coefficient of determination. There is also a relationship between a bivariate
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!7
correlation coefficient and the regression coefficient. Let’s begin with an
equation for the regression coefficient, as in Equation 17 above:
(32)
XY
=
SXY
2 ,
SX
and rearrange these terms to isolate the covariance:
(33)
SXY =
XY
2
,
SX
Now, let’s substitute this
for
in the equation for correlation (Equation
6):
(34)
rXY
2
SXY
XY SX
=
=
.
SX SY
SX SY
So the correlation coefficient and bivariate regression coefficient are products of
each other. More clearly:
(35)
(36)
rXY =
XY
XY
= rXY
SX
SY , and
SY
.
SX
The relationship between the two in multivariate regression is of course much
more complicated. But the point is that all these measures - measures capturing
various aspects of the relationship between two (or more) variables - are related
to each other, each a function of a given set of variances and covariances.
Assumptions of OLS regression
The preceding OLS linear regression models are unbiased and efficient (that is,
they provide the Best Linear Unbiased Estimator, or BLUE) provided five
assumptions are not violated. If any of these assumptions are violated, the
regular linear OLS model ceases to be unbiased and/or efficient. The
assumptions themselves, as well as problems resulting from violating each one,
are listed below (drawn from Kennedy, Econometrics). Of course, many data or
models violate one or more of these assumptions, so much of what we have to
cover now is how to deal with these problems.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!8
1. Y can be calculated as a linear function of X, plus a disturbance term.
Problems: wrong regressors, nonlinearity, changing parameters
2. Expected value of e is zero; the mean of e is zero.
Problems: biased intercept
3. Disturbance terms have the same variance and are not correlated with one
another
Problems: heteroskedasticity, autocorrelated errors
4. Observations of Y are fixed in repeated samples; it is possible to repeat the
sample with the same independent values
Problems: errors in variables, autoregression, simultaneity
5. Number of observations is greater than the number of independent
variables, and there are no exact linear relationships between the independent
variables.
Problems: multicollinearity
Nonlinearities
So far, we’ve assumed that the relationship between Yi and Xi is linear. In many
cases, this will not be true. We could imagine any number of non-linear
relationships. Here are two just common possibilities:
!
!
We can of course estimate a linear relationship in both cases – it doesn’t capture
the actual relationship very well, though. In order to better capture the
relationship between Y and X, we may want to adjust our variables to represent
this non-linearity.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 1
!9
Let’s begin with the basic multivariate model,
(37)
Yi =
+ ⇥1 X1i + ⇥2 X2i + ⇤i .
Where a single X is believed to have a nonlinear relationship with Y, the simplest
approach is to manipulate the X – to use X2 in place of X, for instance:
(38)
Yi =
2
+ ⇥1 X1i
+ ⇥2 X2i + ⇤i ,
This may capture the exponential increase depicted in the first figure above. To
capture the ceiling effect in the second figure, we could use both the linear (X)
and quadratic (X2), with the expectation that the coefficient for the former ( β1 )
would be positive and large, and the coefficient for the latter ( β 2 ) would be
negative and small:
(39)
Yi =
+ ⇥1 X1i +
2
⇥2 X1i
+ ⇥3 X2i + ⇤i ,
€
€
This coefficient on the quadratic will gradually, and increasingly, reduce the
positive effect of X1. Indeed, if the effect of the quadratic is great enough, it can
in combination with the linear version of X1 produce a line that increases, peaks,
and then begins to decrease.
Of course, these are just two of the simplest (and most common) nonlinearities.
You can imagine any number of different non-linear relationships; most can be
captured by some kind of mathematical adjustment to regressors.
Sometimes we believe there is a nonlinear relationship between all the Xs and Y
– that is, all Xs combined have a nonlinear effect on Y, for instance:
(40)
Yi = ( + ⇥1 X1i + ⇥3 X2i )2 + ⇤i .
The easiest way to estimate this is not Equation 40, though, but rather an
adjustment as follows:
(41)
Yi =
+ ⇥1 X1i + ⇥3 X2i + ⇤i .
Here, we simply transform the dependent variable. I’ve replaced the squared
version of the right hand side (RHS) variables with the square root of the left
hand side (LHS) because it’s a simple example of a nonlinear transformation. It’s
not the most common, however. The most common is taking the log of Y, as
follows:
(42)
ln(Yi ) =
+ ⇥1 X1i + ⇥3 X2i + ⇤i .
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!0
Doing so serves two purposes. First, we might believe that the shape of the
effect of our RHS variables on Yi is actually nonlinear – and specifically, logistic in
shape (a S-curve). This transformation may quite nicely capture this nonlinearity.
Second, taking the log of Yi can solve a distributional problem with that variable.
OLS estimations will work more efficiently with variables that are normally
distributed. If Yi has a great many small values, and a long right-hand tail (as
many of our variables will; for instance, income), then taking the log of Yi often
does a nice job of generating a more normal distribution. This example
highlights a second reason for transforming a variable, on the LHS or RHS.
Sometimes, a transformation is based on a particular shape of an effect, based
on theory. Other times, a transformation is used to ‘fix’ a non-normally
distributed variable. The first transformation is based on theoretical
expectations; the second is based on a statistical problems. (In practice,
separating the two is not always easy.)
Collinearity and Multicollinearity
When there is a linear relationship among the regressors, the OLS coefficients
are not uniquely identified. This is not a problem if your goal is only to predict Y
– multicollinearity will not affect the overall prediction of the regression model. If
your goal is to understand how the individual RHS variables impact Y, however,
multicollinearity is a big problem. One problem is that the individual p-values
can be misleading – confidence intervals on the regression coefficients will be
very wide.
Essentially, what we are concerned about is the correlation amongst regressors,
for instance, X1 and X2:
(43)
r12 = ⇥
(X1
(Xi
X̄2 )(X2
X̄1
)2
(X2
X̄2 )
X̄2
,
)2
This is of course just a simple adjustment to the Pearson’s r equation (Equation
7). Equation 43 deals just with the relationship between two variables, however,
and we are often worried about a more complicated situation – one in which a
given regressor is correlated with a combination of several, or even all, the other
regressors in a model. (Note that this multicollinearity can exist even if there are
no striking bivariate relationships between regressors.) Multicollinearity is
perhaps most easily depicted as a regression model in which one X is regressed
on all others. That is, for the regression model,
Dec 2015
(44)
Comm783 Notes, Stuart Soroka, University of Michigan
Yi =
pg 2
!1
+ ⇥1 X1i + ⇥2 X2i + ⇥3 X3i + ⇥4 X4i + ⇤i
we might be concerned that the following regression produces strong results:
(45)
X1i =
+ ⇥2 X2i + ⇥3 X3i + ⇥4 X4i + ⇤i
If X1 is well predicted by X2 through X4, it will be very difficult to identify the
slope (and error) for X1 from the set of other slopes (and errors). (The slopes and
errors for the other slopes may be affected as well.)
Variance inflation factors are one measure that can be used to detect multicollinearity. Essentially, VIFs are a scaled version of the multiple correlation
coefficient between variable j and the rest of the independent variables.
Specifically,
(46)
V IFj =
1
1
Rj2
where R2j would be based on results from a model as in Equation 45. If R2j
equals zero (i.e., no correlation between Xj and the remaining independent
variables), then VIFj equals 1. This is the minimum value. As R2j increases,
however, the denominator of Equation 46 decreases, and the estimated VIF rises
as a consequence. A value greater than 10 represents a pretty big
multicollinearity problem.
VIFs tell us how much the variance of the estimated regression coefficient is
'inflated' by the existence of correlation among the predictor variables in the
model. The square root of the VIF actually tells us how much the standard error
is inflated. This table, drawn from the Sage volume by Fox, shows the
relationship between a given R2j, the VIF, and the estimated amount by which
the standard error of Xj is inflated by multicollinearity.
Coefficient Variance Inflation as a Function of Inter-Regressor
Multiple Correlation
VIF
R 2j
€
0
0.2
0.4
0.6
0.8
0.9
0.99
(impact on
1
1.04
1.19
1.56
2.78
5.26
50.3
€
€
1
1.02
1.09
1.25
1.67
2.29
7.09
SE β j )
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!2
Ways of dealing with multicollinearity include (a) dropping variables, (b)
combining multiple collinear variables into a single measure, and/or (c) if
collinearity is only moderate, and all variables are of substantive importance to
the model, simply interpreting coefficients and standard errors taking into
account the effects of multicollinearity.
Heteroskedasticity
Heteroskedasticity refers to unequal variance in the regression errors. Note that
there can be heteroskedasticity relating to the effect of individual independent
variables, and also heteroskedasticity related to the combined effect of all
independent variables. (In addition, there can be heteroskedasticity in terms of
unequal variance over time.)
The following figure portrays the standard case of heteroskedasticity, where the
variance in Y (and thus the regression error as well) is systematically related to
values of X.
!
The difficulty here is that the error of the slope will be poorly estimated – it will
over-estimate the error at small values of X, and under-estimate the error at large
values of X.
Diagnosing heteroskedasticity is often easiest by looking at a plot of errors ( εi )
by values of the dependent variable ( Yi ). Basically, we begin with the standard
bivariate model of Yi ,
(47)
Yi = α + βX i + εi ,
€
€
and then€plot the resulting values of εi by Yi . If we did so for the data in the
€
preceding figure, then the resulting residuals plot would look as follows:
€
€
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!3
!
As Yi increases here, so too does the variance in εi . There are of course other
possible (heteroskedastic) relationships between Yi and εi, for instance,
€
€
€
€
!
where variance in much greater in the middle. Any version of heteroskedasticity
presents problems for OLS models.
When the sample size is relatively small, these diagnostic graphs are probably
the best means of identifying heteroskedasticity. When the sample size is large,
there are too many dots on the graph to distinguish what’s going on. There are
several tests for heteroskedasticity, however. The Breusch-Pagan test tests for a
relationship between the error and the independent variables. It starts with a
standard multivariate regression model,
(48)
Yi =
+ ⇥1 X1i + ⇥2 X2i + ... + ⇥k Xki + ⇤i ,
and then substitutes the estimated errors, squared, for the dependent variable,
(49)
⇤2i =
+ ⇥1 x1i + ⇥2 x2i + ... + +⇥k xki + ⌅i .
We then use a standard F-test to test the joint significance of coefficients in
Equation 49. If they are significant, there is some kind of systematic relationship
between the independent variables and the error.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!4
Outliers
Recall that OLS regression pays particularly close attention to avoiding large
errors. It follows that outliers – cases that are unusual – can have a particularly
large effect on an estimated regression slope. Consider the following two
possibilities, where a single outlier has a huge effect on the estimated slope:
!
Hat values (hi) are the common measure of leverage in a regression. It is possible
to express the fitted values of
(50)
in terms of the observed values
Yˆj = h1j Y1 + h2 Y2 + ... + hnj Yn =
:
n
Hij Yi .
i=1
The coefficient, or weight, hij captures the contribution of each observation
the fitted value
to
.
Outlying cases can usually not be discovered by looking at residuals – OLS
estimation tries, after all, to minimize the error for high-leverage cases. In fact,
the variance in residuals is in part a function of leverage,
(51)
V (Ei ) =
2
(1
hi ) .
The greater the hat value in Equation 51, the lower the variance. How can we
identify high-leverage cases? Sometimes, simply plotting data can be very
helpful. Also, we can look closely at residuals. Start with the model for
standardized residuals, as follows,
(52)
Ei =
E
⇥ i
,
SE 1 hi
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!5
which simply expresses each residual as a number (or increment) of standard
deviations in Ei. The problem with Equation 52 is that case i is included in the
estimation of the variance; what we really want is a sense for how i looks in
relation to the variance in all other cases. This is a studentized residual,
(53)
Ei⇥ =
SE(
Ei
⇥
1)
1
hi
.
and it provides a good indication of just how far ‘out’ a given case is in relation
to all other cases. (To test significance, the statistic follows a t-distribution with
N-K-2 degrees of freedom.)
Note that you can estimate studentized residuals in a quite different way (though
with the same results). Start by defining a variable D, equal to 1 for case i and
equal to 0 for all other cases. Now, for a multivariate regression model as
follows:
(54)
Yi =
+ ⇥1 X1 + ⇥2 X2 + ... + ⇥k Xk + ⇤i .
add variable D and estimate,
(55)
Yi =
+ ⇥1 X1 + ⇥2 X2 + ... + ⇥k Xk + ⇤Di + ⌅i .
This is referred to as a mean-shift outlier model, and the t-statistic for γ provides
a test equivalent to the studentized residual.
What do we do if we have outliers? That depends. If there are reasons to believe
€
the case is abnormal, then sometimes it’s best just to drop it from the dataset. If
you believe the case is ‘correct’, or justifiable, however, in spite of the fact that
it’s an outlier, then you may choose to keep it in the model. At a minimum, you
will want to test your model with and without this outlier, to explore the extent
to which you results are driven by a single case (or, in case of several outliers, a
small number of cases).
Models for dichotomous data
Linear Probability Models
Let’s begin with a simple definition of our binary dependent variable. We have
variable, Yi, which only takes on the values 0 or 1. We want to predict when Yi is
equal to 0, or 1; put differently, we want to know for each individual case i the
probability that Yi is equal to 1, given Xi. More formally,
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
(56) E(Yi )
= P r(Yi = 1|Xi ) ,
pg 2
!6
which states that the expected value of Yi is equal to the probability that Yi is
equal to one, given Xi.
Now, a linear probability model simply estimates Pr(Yi = 1) in same way as we
would estimate an interval-level Yi:
(57)
P r(Yi = 1) =
+ ⇥Xi .
€
There are two difficulties with this kind of model. First, while the estimated slope
coefficients are good, the standard errors are incorrect due to heteroskedasticity
(errors increase in the middle range, first negative, then positive). Graphing the
data with a regular linear regression line, for instance, would look something like
this:
!
The second problem with the linear probability model is that it will generate
predictions that are greater than 1 and/or less than 0 (as shown in the preceding
figure) even though these are nonsensical where probabilities are concerned.
As a consequence, it is desirable to try and transform either the LHS or RHS of
the model so predictions are both realistic and efficient.
Nonlinear Probability Model: Logistic Regression
One option is to transform Yi, to develop a nonlinear probability model. To
extend the range beyond 0 to 1, we first transform the probability into the
odds…
(58)
P r(Yi = 1|Xi )
P r(Yi = 1|Xi )
=
,
P r(Yi = 0|Xi )
1 P r(Yi = 1|Xi )
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!7
which indicate how often something happens relative to how often it does not,
and range from 0 to infinity as Xi approaches 1. We then take the log of this to
get,
(59)
ln(
P r(Yi = 1|Xi )
),
1 P r(Yi = 1|Xi )
or more simply,
(60)
ln(
pi
1
pi
),
where,
(61)
pi = Yi = 1|Xi .
Modeling what we’ve seen in equation 60 then captures the log odds that
something will happen. By taking the log, we’ve effectively stretched out the
ends of the 0 to 1 range, and consequently have a comparatively unconstrained
dependent variable that can be used without difficulty in an OLS regression,
where
(62)
ln(
pi
1
pi
) = Xi .
Just to make clear the effects of our transformation, here’s what taking the log
odds of a simple probability looks like:
Probability
Odds
0.01
0.05
0.1
0.3
1/99=.0101
0.5
0.7
0.9
0.95
5/5=1
0.99
5/95=.0526
1/9=.1111
3/7=.4286
Logit
-4.6
-2.94
-2.2
-0.85
95/5=19
0
0.85
2.2
2.94
99/1=99
4.6
7/3=2.3333
9/1=9
Note that there is another way of representing a logit model, essentially the
inverse (un-logging of both sides) of Equation 72:
Dec 2015
(63)
Comm783 Notes, Stuart Soroka, University of Michigan
P r(Yi = 1|Xi ) =
pg 2
!8
exp X1
.
1 + exp Xi
Just to be clear, we can work our way backwards from equation Equation 73 to
Equation 72 as follows:
(64)
exp X1
P r(Yi = 1|Xi ) =
, and
1 + exp Xi
P r(Yi = 0|Xi ) =
1
1 + exp
exp Xi
1 exp Xi .
X or
So,
(65)
p1
p1
=
=
p0
1 p1
exp Xi
1+exp Xi
1
1+exp Xi
=
exp xi
,
1
and,
(66)
pi
1
pi
= exp
Xi
,
which when logging both sides becomes,
(67) ln(
pi
1
pi
) = Xi .
The notation in Equation 72 is perhaps the most useful in connecting logistic
with probit and other non-linear estimations for binary data. The logit
transformation is just one possible transformation that effectively maps the linear
prediction into the 0 to 1 interval – allowing us to retain the fundamentally linear
structure of the model while at the same time avoiding the contradiction of
probabilities below 0 or above 1. Many cumulative density functions (CDFs)
will meet this requirement. (Note that CDFs define the probability mass to the
left of a given value of X; they are of course related – in that they are slight
adjustment of – PDFs, which are dealt with in more detail in the section on
significance tests.)
Equation 73 is in contrast useful for thinking about the logit model as just one
example of transformations in which Pr(Yi=1) is a function of a non-linear
transformation of the RHS variables, based on any number of CDFs. A more
general version of Equation 73 is, then,
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 2
!9
(68) P r(Yi = 1|Xi ) = F ( Xi ).
where F is the logistic CDF for the logit model, as follows,
(69) P r(Yi = 1|Xi ) = F ( Xi ), where F =
1
1 + exp (x
µ)/s
,
but could just as easily be the normal CDF for the probit model, or a variety of
other CDFs. How do we know which CDF to use? The CDF we choose should
reflect our beliefs about the distribution of Yi, or, alternatively (and equivalently)
the distribution of error in Yi. We discuss this more below.
An Alternative Description: The Latent Variable Model
Another way to draw the link between logistic and regular regression is through
the latent variable model, which posits that there is an unobserved, latent
variable Yi*, where
(70) Yi =
Xi + ⇥i ,
€and the link between the observed binary Yi and the latent Yi* is as follows:
(71) Yi = 1 if Yi > 0 , and
(72) Yi = 0 if Yi
0 .
€
€
Using this example, the relationship between the observed binary Yi and the
latent Yi can be graphed as follows:
!
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!0
So, at any given value of Xi there is a given probability that Yi is greater than
zero. This figure also shows how our beliefs about the distribution of error ( εi )
are fundamental – there is a distribution of possible outcomes in Yi* when, in this
figure, Xi=4. For a probit model, we assume that Var(εi ) = 1 ; for a logit model,
€
we assume that Var(εi ) = π 2 /3 . Other CDFs make other assumptions.
€
The distribution of error ( εi ) at any given
€ value of Xi is related to a non-linear
increase
€ in the probability that Yi=1. Indeed, we can show this non-linear shift
first by plotting a distribution of εi at each value of Xi,
€
€
!
and then by looking at how the movement of this distribution across the zero
line shifts the probability that Yi=1:
!
As the thick part of the distribution moves across the zero line, the probability
increases dramatically.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!1
Nonlinear Probability Model: Probit Regression
As noted above, probit models are based on the same logic as logistic models.
Again, they can be thought of as a non-linear transformation of the LHS or RHS
variables. The only difference for probit models is that rather than assume a
logistic distribution, we assume a normal one. In equation 68, then, F would now
be the cumulative density function for a normal distribution.
Why assume a normal distribution? The critical question is why assume a logistic
one? We typically assume a logistic distribution because it is very close to
normal, and estimating a logistic model is computationally much easier than
estimating probit model. We now have faster computers, so there is now less
reason to rely on logit rather than probit models. That said, logit has some
advantages where teaching is concerned. Compared to probit, it’s very simple.
Maximum Likelihood Estimation
Models for categorical variables are not estimated using OLS, but using
maximum likelihood. ML estimates are the values of the parameters that have
the greatest likelihood (that is, the maximum likelihood) of generating the
observed sample of data if the assumptions of the model are true. For a simple
model like Yi = α + βX i , an ML estimation looks at many different possible values
of
and
, and finds the combination which is most likely to generating the
observed values of Yi.
€
!
Take, for instance, the above graph, which shows the observed values of Yi on
the bottom axis. There are two different probability distributions, one produced
by one set of parameters, A, and one produced by another set of parameters, B.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!2
MLE asks which distribution seems more likely to have produced the observed
data. Here, it looks like the B parameters have an estimated distribution more
likely to produce the observed data.
Alternatively, consider the following. If we are interested in the probability that
Yi=1, given a certain set of parameters (p), then an ML estimation is interested in
the likelihood of p given the observed data
(73) L(p|Yi ) .
This is a likelihood function. Finding the best set of parameters is an iterative
process, which starts somewhere and starts searching; different ‘optimization
algorithms’ may start in slightly different places, and conduct the search
differently; all base their decision about searching for parameters on the rate of
improvement in the model. (The way in which model fit is judged is addressed
below.)
Note that our being vague about ‘parameters’ here is purposeful. As analysts,
the parameters we are thinking about are the coefficients for the various
independent variables ( βX ). The parameters critical to the ML estimation,
however, are those that define the shape of the distribution; for a normal
distribution, for instance, these are the mean ( µ ) and variance ( σ ) (see Equation
€
24). Every set of parameters, βX , however, produces a given estimated normal
distribution of Yi with mean µ and variance σ ; the ML estimation tries to find
€
€
the βX producing the distribution most likely to have generated our observed
€
data.
€
€
€
Not also that while we speak about ML estimations maximizing the likelihood
equation, in practice programs maximize the log of the likelihood, which
simplifies computations considerably (and gets the same results). Because the
likelihood is always between 0 and 1, the log likelihood is always negative…
Models for Categorical Data
Ordinal Outcomes
For models where the dependent variable is categorical, but ordered, ordered
logit is the most appropriate modelling strategy. A typical description begins
with a latent variable Yi*which is a function of
*
(74) Yi
= βX i + εi ,
€
€
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!3
and a link between an observed binary Yi and a latent Yi* as follows:
(75) Yi
= 1 if Yi* ≤ δ1
, and
€ , and
Yi = 2 if δ1 ≤ Yi* ≤ δ2
€
€
€
€Y = 3 if Y * ≥ δ
i
i
2
€
,
€ δ1 and δ2 are unknown parameters to be estimated along with the β in
where
equation X. We can restate the model, then, as follows:
€
(76) Pr(Yi
, and
= 1 x) = Pr(βX i + εi ≤ δ1 ) = Pr(εi ≤ δ1 − βX i )
Pr(Yi = 2 X i ) = Pr(δ1 ≤ βx + εi ≤ δ2 ) = Pr(δ1 − βX i < εi ≤ δ2 − βX i ) , and
€
€
€
.
Pr(Yi = 3 X i ) = Pr(βX i + εi ≥ δ2 ) = Pr(εi ≥ δ2 − βX i )
The last statement of each line here makes clear the importance that the
distribution of error plays in the estimation: the probability of a given outcome
can be expressed as the probability that the error is – in the first line, for instance
– smaller than the difference between theta and the estimated value. This set of
statements can also be expressed as follows, adding ‘hats’ to denote estimated
values, substituting predicted Yˆ for βX , and inserting a given cumulative
distribution function, F, from which we derive our probability estimates:
(77)
€
pˆ i1 = Pr(εi ≤ δˆ1 − Yˆi ) =€F(δˆ1 €
− Yˆi )
, and
pˆ i2 = Pr(δˆ1 − Yˆi < εi ≤ δˆ2 − Yˆi ) = F(δˆ2 − Yˆi ) − F(δˆ1 − Yˆ1 )
, and
pˆ i3 = Pr(εi ≥ δˆ2 − Yˆi ) = 1− F(δˆ2 − Yˆi )
,
€
Where F can again be the logistic CDF (for ordered logit), but also the normal
€
CDF (for ordered probit), and so on. Again, using the logistic version as the
example is far easier, and we can express the whole system in another way, as
follows:
p1
p + p2
p + p2 + ...+ pk
) = βX , ln( 1
) = βX , ln( 1
) = βX ,
1− p1
1− p1 − p2
1− p1 − p2 − ...− pk
(78) ln(
where
€
.
€
€
Note that these models rest on the parallel slopes assumption: the slope
coefficients do not vary between different categories of the dependent variable
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!4
(i.e., from the first to second category, the second to third category, and so on).
If this assumption is unreasonable, a multinomial model is more appropriate. (In
fact, this assumption can be tested by fitting a multinomial model and
examining differences and similarities in coefficients across categories.)
And now, when we talk about odds ratios, we are talking about a shift in the
odds of falling into a given category (m),
(79) OR(m) =
Pr(Yi ≤ m)
.
Pr(Yi < m)
Nominal Outcomes
€
Multinomial logit is essentially a series of logit regressions examining the
probability that Yi = m rather than Yi = k, where k is a reference category. This
means that one category of the dependent variable is set aside as the reference
category, and all models show the probability of Yi being one outcome rather
than outcome k. Say, for instance, there are four outcomes k, m, n, and q, where
k is the reference category. The models estimated are:
(80) ln(
€
These models explore the variables that distinguish each of m, n, and q from k.
Any category can be the base category, of course. It may be that it is
€
Pr(Yi = k)
Pr(Yi = m)
Pr(Yi = n)
) = β k X , ln(
) = β m xX , ln(
) = βn X
Pr(Yi = q)
Pr(Yi = q)
Pr(Yi = q)
Results for multinomial logit models aren’t expressed as odds ratios, since odds
ratios refer to the probability of an outcome divided by 1. Rather, multinomial
€
Pr(Yi = m)
) = βm X ,
Pr(Yi = k)
the risk ratio is,
€
€
results are expressed as a risk-ratio, or relative risk, which is easily calculated by
taking the exponential of the log risk-ratio. Where, the log risk-ratio is
(82) ln(
€
€
additionally interesting to see how q is distinguished from the other categories,
in which case the following models can be estimated:
(81) ln(
€
Pr(Yi = m)
Pr(Yi = n)
Pr(Yi = q)
) = β m X , ln(
) = β n X , ln(
) = βq X
Pr(Yi = k)
Pr(Yi = k)
Pr(Yi = k)
(83)
Pr(Yi = m)
= exp(β m X) .
Pr(Yi = k)
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!5
The estimation of a multinomial logit model requires that the covariance
between the error terms (relating each alternative) is zero. As a result, a critical
assumption of the multinomial logit model is IIA, or the independence of
irrelevant alternatives, which is as follows: if a chooser is comparing two
alternatives according to a preference relationship, the ordinal ranking of these
alternatives should not be affected by the addition or subtraction of other
alternatives from the choice set. This is not always a reasonable assumption,
however; when IIA is not a reasonable assumption, other multinomial models
should be used, such as multinomial probit.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!6
Appendix A: Significance Tests
Having already described regression analysis for continuous, binary, ordinal and
categorical variables, this appendix provides some background: some basic
information on distributions, probability theory, some of the standard tests of
statistical significance.
Distribution Functions
We have thus far assumed a familiarity with distributions, probability functions,
and so on. Before we talk about significance tests, however, it may be worth
reviewing some of that material. Distributional characteristics can be important
to understanding social phenomena; understanding distributions is also
important to understanding hypothesis testing. So, let’s begin with a general
probability mass function,
(84)
€
f (x) = p(X = x) ,
which assigns a probability for the random variable X equaling the specific
numerical outcome x. This is a general statement, which can then take on
different forms based on, for instance, levels of measurement (e.g., binary,
multinomial, or discrete).
PMFs are useful when are dealing with discrete variables. If we are dealing with
continuous variables, however, defining a given x becomes much less attractive
(indeed – impossible if you consider an infinite number of decimal places). We
accordingly replace the PMF with a PDF – a probability density function. There
is a wide range of possible PDFs, of course. The exponential PDF, for instance,
is as follows:
(85)
€
f (x | β ) =
1
exp[−x / β ] ,
β
0 ≤ x < ∞ , 0 < α, β .
Note that the exact form the function takes will vary based on a single
€ Because €the spread of the distribution is affected by
parameter, β the mean.
values of β , it is referred to as the scale parameter.
€
€
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!7
!
Other distributions are more flexible. The gamma PDF includes an additional
shape parameter, α , which affects the ‘peakedness’ of the distribution.
(86)
€
1
f (x | α, β ) =
x α −1 exp[−x / β ] ,
α
Γ(α )β
€
€
0 ≤ x < ∞ , 0 < α, β ,
€
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!8
where the mean is now αβ . 1 An important special case of the gamma PDF is
the chi-square ( χ 2 ) distribution – a gamma where α = df /2 and β = 2, and df is
a positive integer value called the degrees of freedom.
€
The Normal, or Gaussian, PDF is as follows,
€
(87)
€
f (x | µ,σ 2 ) =
1
2πσ
2
e
−
1
2σ
2
(x− µ )
€
2
,
−∞ < x , µ < ∞ , 0 < σ 2 .
Here, the critical values are µ and σ 2 , where the former defines the mean and
€
€
€
the latter is the variance, and defines the dispersion. The normal distribution is
just one of many location-scale distributions – referred to as such because one
€
€
€ only the location, and another parameter, σ 2, changes
parameter, µ , moves
only the scale.
When µ = 0 and σ 2 = 1, we have what is called a standard
€ normal distribution,
€
which plays an important role in statistical theory and practice. The PDF for this
distribution is a much-simplified version of Equation X,
€
€
(88)
€
f (x) =
1
2πσ 2
e
−
1
2σ
2
x2
,
−∞ < x < ∞ .
Note that simple mathematical transformations can covert any normal
€
distribution into its standard
form; probabilities can always be calculated, then,
using this standard normal distribution.
There are many different PDFs, and finding the one that characterizes the
distribution of a given variable can tell us much about the data-generating
process.
We can look at the degree to which budgetary data reflects
incremental change, for instance (see work by Baumgartner and Jones). We also
make assumptions about the distribution of our variables in the population when
we select an estimation method. OLS assumes our variables are normally
distributed; logit assumes the distribution of error in our latent, non-linear
variable follows the logistic PDF; probit assumes the distribution of error
matches the normal distribution. Distributions – or, at least, assumptions about
distributions – thus play a critical role in selecting an estimator, as well as in tests
of statistical significance.
here is an extension of the factorial function, so that it is defined for more than just
non-negative integers. The factorial function of an integer n is written n!, and is equal to .
1
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 3
!9
The chi-square test
Distributions are also critical to hypothesis testing. The standard chi-square ( χ 2 )
test is based on a chi-square distribution. To construct a χ 2 variable, begin with
a normally distributed population of observations, having mean µY and standard
€
deviation σ Y2 . Then take a random sample of k cases, transform every
€
observation into standardized (Z-score) form, and square it. That is, do this,
€
2
€ Zi
(89)
=
(Yi − µY )
σ Y2
2
, and then this,
k
(90) Q =
€
∑Z
2
i
,
i=1
Q is distributed as chi-square with k degrees of freedom:
€
(91) Q ~
χ k2 .
The chi-square distribution is, in short, the distribution of a sum of squared
€
normally distributed cases.
The PDF for a chi-square distribution is defined exclusively by a single degrees
of freedom value (above, k). It looks like this,
(92)
f (x | k) =
(1/2) k / 2 k / 2−1 −x / β
x
e
.
Γ(k /2)
When the number of cases (k) is very small, a chi square distribution is very
€
skewed. Most of the cases in a normal distribution lie between -1 and +1, after
all, so we shouldn’t expect a chi-square value of much more than 1, and 0 (the
mean Z-score in a normal distribution) is much more likely. As we add additional
cases, however, the mean for a chi-square distribution increases – indeed, the
mean will be equivalent to the number of cases (k), or rather, the degrees of
freedom (df). And the variance of the chi-square is simple too – it’s 2 × df .
When we use a chi-square statistic to look at the relationship in a crosstabulation of two categorical variables, you’ll recall, the df is (R-1)(C-1), where R
€
is the number of rows and C is the number of columns. And recall that when we
are trying to exceed a given chi-squared value (to refute the null hypothesis of
no relationship), what we are trying to find is a value that is far enough along the
tail of a (chi-square) distribution to be clearly different from the mean (which we
now know to be equal to the df). That we want the statistic to exceed a given
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!0
value, based on a desired level of statistical significance, is of course not
exclusive to chi-square tests; the same idea is discussed further below.
The t test
When we talk about a t-test, we are interested in a test of the null hypothesis
that a given coefficient is not different than zero,
(93) H 0
:βj = 0 ,
against an alternative, that is,
€
(94) H1 : β j
≠0 .
The rejection rule for H0 is as follows:
€
(95)
t βˆ > c ,
j
where c is a chosen critical value, and where
€
(96) t βˆ
j
=
βˆ j
⌢ ,
se(β j )
which is of course just the ratio of the coefficient to its standard error. Now, c
€
requires some description. For a standard, two-tailed (where we allow for the
possibility that β is either positive or negative) t-test, c is chosen to make the
area in each tail of the distribution equal to 2.5% - that is, it is chosen to find a
middle range that equals 95% of the distribution. The value of c is then based
€ t distribution, described above as a special case of the gamma
on the
distribution. The PDF for a t distribution is as follows,
(97)
f (x | v) =
Γ(v + 1) /2
(1+ x 2 /v)−(v +1)/ 2 ,
vπ Γ(v /2)
where v=n-1.
€
The shape of a t distribution varies with sample size and sample standard
deviations. Like normal distributions, all t distributions are symmetrical, bellshaped and have a mean of zero. While normal distributions are based on a
population variance, however, t distributions are based on a sample variance –
useful, since in most cases we do not know the population variance. T
distributions thus differ from normal distributions in at least one important way: a
t distribution has a larger variance than a normal distribution, since we believe
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!1
the sample variance underestimates the population variance. This difference
between the two distributions narrows as N increases, of course. And the bar for
achieving statistical significance is thus a little greater under a t distribution.
The F Test
The F test was designed to make inferences about the equality of the variances
of two populations. It is based on an F distribution, which requires that random,
independent samples be drawn from two normal populations that have the
same variance (i.e., σ Y21 = σ Y22 ). An F ratio is then formed as the ratio of two chisquares, each divided by its degrees of freedom:
(98) F
€
( χ V21 /v1 )
€
.
( χ V22 /v 2 )
This ratio, it turns out, is distributed as an F random variable with two different
degrees of freedom. That is:
(99) F
€
=
~ Fv1 ,v 2 ,
where the F distribution itself is defined by two parameters – the two separate
degrees of freedom. It is non-symmetric and ranges across the nonnegative
numbers. Its shape depends on the degrees of freedom associated with both
the numerator and the denominator.
In regression analysis, the F test is a frequently used joint hypothesis test. It
appears in the top right of all regression results in STATA, for instance, and is
used there to test null hypothesis that all coefficients in the model are not
different than zero. That is, in the following regression,
(100) Yi
= α + β1 X1 + β 2 X 2 + β 3 X 3 + ε ,
an F test is used to test the possibility that
€
€
(101) H 0
: β1 = 0, β 2 = 0, β 3 = 0 .
Of course, the test could also deal with a subset of these coefficients. We use
this example here simply because it speaks directly to the F test results in
STATA.
Testing this joint hypothesis test is different than looking at the t-statistics for
each individual coefficient, since any particular t statistic tests a hypothesis that
puts no restrictions on the other parameters. Here, we can produce a single test
of these joint restrictions.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!2
More general versions of Equations 100 and 101 are as follows. Begin with an
unrestricted model,
(102) Yi
€
= α + β1 X1 + β 2 X 2 + β 3 X 3 + ...+ β k X k + ε ,
where the number of parameters in the model is k+1 (because we include the
intercept). Suppose then that we have q exclusion restrictions to test: that is, the
null hypothesis states that q of the variables have zero coefficients. If we assume
that it is the last q variables that we are interested in (order doesn’t matter for
estimation, but makes the following statement easier), then the general
hypothesis is as follows,
(103) H 0
: β k−q +1 = 0,..., β k = 0 .
And Equation 102, with these restrictions imposed, now looks like this:
€
€
(104) Yi
= α + β1 X1 + ...+ β k−q X k−q + ε ,
In short, it is the model with the last q coefficients excluded – effectively, then,
restricting those last (now non-existent) coefficients to zero.
What we want to know now is whether there is a significant difference between
the restricted and unrestricted equations. If there is a difference, then the last q
coefficients matter; if there is no difference, then the null hypothesis that all
these q coefficients are not different than zero is supported.
To test this null hypothesis, we use the following F test:
(ESSr − ESSur )
q
,
(105) F ≡
ESSur
n − k −1
where ESSr is the sum of squared residuals from the restricted model and ESSur
€
is the sum of squared residuals from the unrestricted model. Both (ESSr-ESSur)
and (ESSur) are assumed to be distributed chi-square normal, and the degrees of
freedom are q and n-k-1.
Again, in order to reject H0, we must find a value for F that, based on the PDF
for an F distribution, defined by k1 and k2, suggests our value is significantly
different from the mean. Which, as it happens, will be v2/(v2-2); or, in this case,
(n-k-1)/(n-k-3).
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!3
Appendix B: Factor Analysis
Background: Correlations and Factor Analysis
Factor analysis is a data reduction technique, used to explain variability among
observed random variables in terms of fewer unobserved random variables
called factors.
A factor analysis usually begins with a correlation matrix. Take, for instance, the
following 5 x 5 correlation matrix, R, for variables a through e (in rows j and
columns k):
1.00 .72 .63 .54 .45
.72 1.00 .56 .48 .40
R
.63 .56 1.00 .42 .35
.54 .48 .42 1.00 .30
.45 .40 .35 .30 1.00
This matrix is consistent with there being a single common factor, g, whose
correlations with the 5 observed variables are respectively .9, .8, .7, .6, and .5.
That is, if there were a variable g, correlated with a at .9. b at .8, and so on, we’d
get a correlation matrix exactly as above.
How can we tell? First, note that the correlation between a and b, where both
are correlated with g, is going to be the product of the correlation between a
and g, and b and g. More precisely,
(106) rab
= rag × rbg .
For the cell in row 1, column 2 of matrix R, then, we have the correlation
€
between a and g (.9) and the correlation between b and g (.8), and the resulting
correlation between a and b: .9 × .8 = .72. This works across the board for the
above matrix – each cell is the product of the correlation between g and the
variable in row j and column k.
€
This is the kind of pattern we’re interested in finding in a factor analysis – a
latent variable that captures the variance (and covariance) amongst a set of
measured variables.
Just like in a regression, when we partition the total variance (TSS) into the
explained (RSS) and error (ESS) variances, we can think of a factor analysis as
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!4
decomposing a correlation matrix R into a common portion C, and an
unexplained portion U, where,
(107) R = C + U , or
€
U = R − C , and so on.
In fact, there can be several common portions – one for each of several latent
factors which might account for common variance. So a more general model
could be
(108) R =
€
∑C
q
+U ,
where there are q common factors. In matrix R above, there is clearly just one
€
common component. The common portion of the variance is thus captured in a
single matrix, C1 , as follows,
.81 .72 .63 .54 .45
.72 .64 .56 .48 .40
C1
.63 .56 .49 .42 .35
.54 .48 .42 .36 .30
.45 .40 .35 .30 .25
where each cell for row j and column k is simply the product of (a) the correlation
between g and the variable j, and (b) the correlation between g and the variable
k. The off-diagonal entries in matrix C1 show the common variance between
two variables j and k that is captured by the latent factor g. The diagonal entries
show the amount of variance in a single variable explained by the latent factor g.
So, latent factor g, correlated with variable a at .9, explains .81 of the variance in
a. (This should make sense: recall that R 2 = r 2.)
These diagonal values are
sometimes referred to as communality estimates – the proportion of variance in
a given y variable that is accounted for by a common factor.
€
The unexplained variance is then captured in matrix U, which is simply R - C1,
.19 .00 .00 .00 .00
.00 .36 .00 .00 .00
U
.00 .00 .51 .00 .00
.00 .00 .00 .64 .00
.00 .00 .00 .00 .75
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!5
Diagonal entries here show the proportion of variance that is unique in each
variable – that is, the variance that is not accounted for by the latent factor g.
Off-diagonal entries show the covariance between variables j and k that is not
accounted for by factor g. In this case, g perfectly captures the covariance
between all variables, so the off-diagonal entries are all zero.
however, a factor will not perfectly capture these covariances.
In most cases,
An Algebraic Description
The matrix stuff is a useful background to factor analysis, but it isn’t the easiest
way to understand where exactly factor analysis results come from.
The
standard algebraic description of factor analysis is a little more helpful there.
Imagine that we have a number of measured outcome variables, y1, y2, … yn,
and we believe these are systematically related to some mysterious set of latent,
that is, unmeasured, factors. More systematically, our y variables are related to a
set of functions,
(109) y1
= ω11F1 + ω12 F2 + ...+ ω1m Fm ,
y 2 = ω 21F1 + ω 22 F2 + ...+ ω 2m Fm
…
€
y n = ω n1F1 + ω n 2 F2 + ...+ ω nm Fm
€
where F are not variables but functions of unmeasured variables, and ω are the
€
weights attached to each function (and where n is the number of y variables and
m is the number of functions). Note that only the y variables are known – the
€
entire RHS of each model has to be estimated.
In short, the basic factor model assumes that the variables (y) are additive
composites of a number of weighted ( ω ) factors (F). The factor loadings
emerging from a factor analysis are the weights ω (sometimes referred to as the
constants); the factors are the F functions. The size of each loading for each
factor measures how much that€specific function is related to y. And factor
scores – which can be generated for€every value of the y variables – are the
predictions based on the results of equation 109. 2
What factor analysis does is discover the unknown F functions and ω weights.
In order to do so, we need to impose some assumptions – otherwise, there are
Note that there is no assumption here that the variables y are€linearly related – only
that each is linearly related to the factors.
2
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!6
an infinite number of solutions for equation 109. And different factor analytic
procedures vary mainly in the assumptions they impose in order to estimate the
various ω and F. 3
€
To restate things: in estimating a factor analysis the y variables are defined as
linear functions of the weighted F factors. Indeed, all y variables’ statistics
(mean, variance, correlations) are defined as functions of the weights and
factors. So, for instance, the mean of y1 can be expressed as follows,
f
(110) y1
= ∑ω m Fm ,
m=1
where the mean of y1 is equal to the sum of each individual factor (Fk), where
€
there are m factors, multiplied by its corresponding weight ( ω k). By defining all
variables’ statistics in this way, it becomes possible to estimate a set of weights
and factors that account for the variances and covariances amongst the y
variables.
€
Note that equation 110 sets out the model for what is referred to as a full
component factor model – a model in which the y variables are perfectly (that is,
completely) a function of a set of latent factors. Usually, we use a common
factor model – one in which there is a set of common factors, but also a degree
of unique variance, or uniqueness, for each variable. (In fact, this uniqueness
can be viewed as a separate unique factor for each y variable.) This common
factor model can be expressed with a simple addition to equation 109,
(111) y n
= ω n1F1 + ω n 2 F2 + ...+ ω nm Fm + ω nuU n ,
where U captures the unique variance in yn. Note that main difference between
€
the full component and the common factor approach is that the former assumes
that diagonal elements of matrix C are equal to one – that all the variance in
each variable is accounted for by the F factors. The common factor model,
which allows for unique variance in each variable, un-accounted for by the
factors, makes no such assumption.
You also see ‘principle components factor analysis used in the literature. This
method extracts the principle factors (those ‘best’ capturing covariance amongst
In a standard models the two critical assumptions are: (1) variables can be calculated
from the factors by multiplying each factor by the appropriate weight and summing
across all factors, and (2) all factors have a mean of zero and a standard deviation of one.
Other assumptions can include, for instance, whether or not, or to what degree, the
estimated F factors can be correlated.
3
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!7
variables) from a component model. So factor are estimated until they account
for all the variance of each variable – this can of course mean many factors. But
only those factors which are sufficiently common are reported.
Factor Analysis Results
The results of a factor analysis generally look like this (from R.J. Rummel):
!
where there are a number of F factors across the top, measured y variables in
each row, and factor loadings ( ω ) in each cell.
The number of factors (columns) is the number of substantively meaningful
independent (uncorrelated) patterns of relationship among the variables.
€
The loadings, ω , measure which variables are involved in which factor pattern
and to what degree. The square of the loading multiplied by 100 equals the
percent variation that a variable has in common with a given latent variable.
€
The first factor pattern delineates the largest pattern of relationships in the data;
the second delineates the next largest pattern that is independent of
(uncorrelated with) the first; the third pattern delineates the third largest pattern
that is independent of the first and second; and so on. Thus the amount of
variation in the data described by each pattern decreases successively with each
factor; the first pattern defines the greatest amount of variation, the last pattern
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!8
the least. Note that these initial, unrotated factor patterns are uncorrelated with
each other.
The column headed h2 displays the communality of each variable. This is the
proportion of a variable's total variation that is involved in the patterns. The
coefficient (communality) shown in this column, multiplied by 100, gives the
percent of variation of a variable in common with each pattern. The h2 value for
a variable is calculated by summing the squares of the variable's loadings. Thus
for power in the above table we have (.58)2 + (-.42)2 + (-.42)2 + (.43)2 = .87.
This communality may also be looked at as a measure of uniqueness. By
subtracting the percent of variation in common with the patterns from 100, the
uniqueness of a variable is determined. This indicates to what degree a variable
is unrelated to the others--to what degree the data on a variable cannot be
derived from (predicted from) the data on the other variables.
The ratio of the sum of the values in the h2 column to the number of variables,
multiplied by 100, equals the percent of total variation in the data that is
patterned. Thus it measures the order, uniformity, or regularity in the data. As
can be seen in the above table, for the ten national characteristics the four
patterns involve 80.1 percent of the variation in the data. That is, we could
reproduce 80.1 percent of the relative variation among the fourteen nations on
these ten characteristics by knowing the nation scores on the four patterns.
At the foot of the factor columns in the table, the percent of total variance
figures show the percent of total variation among the variables that is related to
a factor pattern. This figure thus measures the relative variation among the
fourteen nations in the original data matrix that can be reproduced by a pattern:
it measures a pattern's comprehensiveness and strength. The percent of total
variance figure for a factor is determined by summing the column of squared
loadings for a factor, dividing by the number of variables, and multiplying by
100.
The eigenvalues equal the sum of the column of squared loadings for each
factor. They measure the amount of variation accounted for by a pattern.
Dividing the eigenvalues either by the number of variables or by the sum of h2
values and multiplying by 100 determines the percent of either total or common
variance, respectively.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 4
!9
Rotated Factor Analyses
The unrotated factors successively define the most general patterns of
relationship in the data. Not so with the rotated factors. They delineate the
distinct clusters of relationships, if such exist.
The best way to understand ‘rotation’ is to think about a geometric
representation of a factor model. Take as an example the following hypothetical
results
Variable
1
2
3
4
Factor 1
0.7
-0.3
1
.-6
Factor 2
0.7
0.2
0
0.8
Uniqueness
0.98
0.13
1
0.1
We can represent these results geometrically, in the following way,
!
where the two factors are axes, and the variables are located at the appropriate
place (their loading) on each axis.
‘Rotating’ a factor analysis, then, is a process of taking the structure discovered
in the unrotated analysis – the structure of axes in the diagram above, for
instance – and rotating – essentially via a re-estimation of factor loadings – that
structure. The aim is to produce a structure that better distinguishes between
the factors – more precisely, that better distinguishes variables loading on one
factor from variables loading on another.
variables plotted across two factors:
Here’s an example with around 20
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 5
!0
!
There are two different kinds of rotations – orthogonal and oblique. The former
require that the factors are orthogonal – that is, uncorrelated – while the latter
has no such restriction. Orthogonal is more typical; it’s the example above
(where the axes remain perpendicular to each other); it’s also the default in
STATA. That said, for many the purpose of rotating is to allow for correlation
amongst factors.
There are also different rotation criteria. For instance: varimax rotations
maximize the variance of the squared loadings within factors – the columns of a
loading matrix for a given factor F; quartimax rotations maximize the variance of
the squared loadings within the variables – the rows of a loading matrix for a
given factor F.
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 5
!1
Appendix B: Taking Time Seriously
Time series data usually violates one of the critical assumptions of OLS
regression – that the disturbances are uncorrelated. That is, it is normally
assumed that errors will be distributed randomly over time:
!
Note the subscript t, which denotes a given time period. (Our cases are now
defined by time units, say, months, rather than individuals.)
Positively autocorrelated variables usually lead to positive correlated residuals,
which might look something like this:
!
This figure shows a typical first-order autoregressive process, where
(112) εt
€
= ρεt−1 + v ,
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 5
!2
where εt is the error term, ρ is a coefficient, and v is the remaining, random
error. A consequence of this process tends to be that variances and standard
errors are underestimated; we are thus more likely find significant results. It is
€ important, then, €that we understand fully the extent and nature of
autocorrelation in time series data, and try to account for this in our models.
Take a standard nonautoregressive model,
(113) Yt
+ ⇥Xt + ⇤t .
If it is true that Xt is highly correlated with Xt-1, as is the case with many time
series, then it will also tend to be true that
is correlated with
. This
presents serious estimation difficulties, and most time series methods are all
about solving this problem. Essentially, these methods (like the Durbin 2-stage
method, the Cochrane-Orcutt transformation, or Prais-Winsten estimations) are
all about pulling the autocorrelation out of the error term – that is, pulling it out
of the error term and into the estimated coefficients. (Indeed, it’s a little like
missing variable bias – there is some missing variable, the absence of which
leads to a particular type of heteroskedasticity; so long as it’s included, though,
the model will be fine.)
There are ways to do this that are much simpler (and sometimes more effective)
than complicated transformations. For a terrific description of modeling
strategies for time series, see Pickup’s Introduction to Time Series Analysis
(Sage). Below, we look briefly at univariate statistics.
Univariate Statistics
To explore the extent of autocorrelation in a single time series, we us
autocorrelation functions (ACFs) and partial autocorrelation functions (PACFs).
ACFs plot the average correlation between xt and xt-1, xt and xt-2, and so on.
PACFs provide a measure of correlation between observations k units apart after
the correlation at intermediate lags has been controlled for, or ‘partialed out’.
So, where a typical correlation coefficient (between two variables, x and y) looks
like this:
(114)
,
an ACF of error terms j units apart looks like this:
Dec 2015
Comm783 Notes, Stuart Soroka, University of Michigan
pg 5
!3
.
(115)
Often, we examine autocorrelation using a correlogram, which for a standard
first-order process looks something like this, for positive correlation:
!
and this for negative autocorrelation:
!
A combination of ACF and PACF plots usually gives us a good sense for the
magnitude and structure of autocorrelation in a single time series. These are
important to our understanding of how an individual time series works (i.e., just
how much does one value depend on previous values?). It also points towards
how many, or which, lags will be required in multivariate models.
Bivariate Statistics
A first test of the relationship between two time series usually takes the form of a
cross-correlation function (CCF), or cross-correlogram, which displays the
average correlations between xt and yt, xt and yt-1, and so on (and, conversely, xt
and yt, xt and yt-1, and so on)…
Download