Introduction to Regression and Data Analysis Handout

advertisement
Introduction to Regression and Data Analysis
Statlab Workshop
February 28, 2003
Tom Pepinsky and Jennifer Tobin
I.
Regression: An Introduction:
A.
What is regression?
The idea behind regression in the social sciences is that the researcher would like to find
the relationship between two or more variables. Regression is a statistical technique that
allows the scientist to examine the existence and extent of this relationship. Regression
shows that given a population, if the researcher can either examine the entire population
or perform a random sample of sufficient size, it is possible to mathematically recover the
parameters that describe the relationships between variables. Once the researcher has
established such a relationship, she can then use these parameters to predict values of a
new dependent variable given a new independent variable. Regression does not make
any specifications about the way that the independent variables are distributed or
measured (discrete, continuous, binary, etc.), but in order for regression to be the
appropriate technique, the Gauss-Markov assumptions must be fulfilled.
In its simplest (bivariate) form, regression shows the relationship between one
independent variable (X) and a dependent variable (Y). The magnitude and direction of
that relation are given by a parameter (β1), and an intercept term (β0) captures the status
of the dependent variable when the independent variable is absent. A final error term (u)
captures the amount of variation that is not predicted by the slope and intercept terms.
The regression coefficient (R2) shows how well the values fit the data. More
sophisticated forms of regression allow for more independent variables, interactions
between the independent variables, and other complexities in the way that one variable
affects another.
Regression thus shows us how variation in one variable co-occurs with variation in
another. What regression cannot show is causation; causation is only demonstrated
analytically, through substantive theory. For example, a regression with shoe size as an
independent variable and foot size as a dependent variable would show a very high
regression coefficient and highly significant parameter estimates, but we should not
conclude that higher shoe size causes higher foot size. All that the mathematics can tell
us is whether or not they are correlated, and if so, by how much.
B.
Difference between correlation and regression.
It is important to recognize that regression analysis is fundamentally different from
ascertaining the correlations among different variables. Correlation can tell you how the
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
1
values of your variables co-vary, but regression analysis is aimed at making a stronger
claim: demonstrating how one variable, your independent variable, causes another
variable, your dependent variable. Correlation determines the strength of the relationship
between variables, while regression attempts to describe the relationship between these
variables. Of course, it is apparent that regression may lead to what is called “spurious
correlation,” where the co-variation of two variables implies a causal relationship that
does not exist. For example, we might find that there is a significant relationship
between being a basketball player and being tall. Of course, being a basketball player
does not cause one to become taller; the relationship is almost certainly the opposite. It is
important to recognize that regression analysis cannot itself establish causation, only
describe correlation. Causation is established through theory.
SPSS syntax: Analyze: correlate: bivariate:
Correlations
X
X
Pearson
Correlation
Sig. (2-tailed)
N
Y
Pearson
Correlation
Sig. (2-tailed)
N
Y
1
-.954(**)
.
.001
7
7
-.954(**)
1
.001
.
7
7
** Correlation is significant at the 0.01 level (2-tailed).
II.
Before We Get Started: The Basics
A.
Your variables may take several forms, and it will be important later that you are
aware of, and understand, the nature of your variables. The following variables are those
which you are most likely to encounter in your research.

Categorical variables
Such variables include anything sort of measure that is “qualitative” or otherwise
not amenable to actual quantification. There are a few subclasses of such
variables.
 Dummy variables take only two possible values, 0 and 1. They signify
conceptual opposites: war vs. peace, fixed exchange rate vs. floating
exchange rate, etc.
 Nominal variables can range over any number of non-negative
integers. They signify conceptual categories that have no inherent
relationship to one another: red vs. green vs. black, Christian vs.
Jewish vs. Muslim, etc.
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
2
 Ordinal variables are like nominal variables, only there is an ordered
relationship among them: no vs. maybe vs. yes, etc.

Numerical variables
Such variables describe data that can be readily quantified. Like categorical
variables, there are a few relevant subclasses of numerical variables.
 Continuous variables can appear as fractions; in reality, they can have
an infinite number of values. Examples include temperature, GDP,
etc.
 Discrete variables can only take the form of whole numbers. Most
often, these appear as count variables, signifying the number of times
that something occurred: the number of firms invested in a country, the
number of hate crimes committed in a county, etc.
When you begin a statistical analysis of your data, a useful starting point is to get a
handle on your variables. Are they qualitative or quantitative? If they are the latter, are
they discrete or continuous? Another useful practice is to ascertain how your data are
distributed. Do your variables all cluster around the same value, or do you have a large
amount of variation in your variables? Are they normally distributed, or not?
B.
We are only going to deal with the linear regression model
The simple (or bivariate) LRM model is designed to study the relationship between a pair
of variables that appear in a data set. The multiple LRM is designed to study the
relationship between one variable and several of other variables.
In both cases, the sample is considered a random sample from some population. The two
variables, X and Y, are two measured outcomes for each observation in the data set. For
example, lets say that we had data on the prices of homes on sale and the actual number
of sales of homes:
Price(thousands of $) Sales of new homes
X
y
160
126
180
103
200
82
220
75
240
82
260
40
280
20
And we want to know the relationship between X and Y. Well, what does our data look like?
SPSS syntax: graph: scatter: simple: enter x and y
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
3
y
126
20
160
280
x
We need to specify the population regression function, the model we specify to study the
relationship between X and Y.
This is written in any number of ways, but we will specify it as:
where


Y is an observed random variable (also called the endogenous variable, the left-hand side
variable).
X is an observed non-random or conditioning variable (also called the exogenous or
right-hand side variable).

is an unknown population parameter, known as the constant or intercept term.


is an unknown population parameter, known as the coefficient or slope parameter.
u is is an unobserved random variable, known as the disturbance or error term.
Once we have specified our model, we can accomplish 2 things:

Estimation: How do we get a "good" estimates of
the PRF make a given estimator a good one?
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
and
? What assumptions about
4

Inference: What can we infer about
and
from sample information? That is, how
do we form confidence intervals for
and
and/or test hypotheses about them.
The answer to these questions depends upon the assumptions that the linear regression model
makes about the variables.
The Ordinary Least Squres (OLS) regression procedure will compute the values of the
parameters
and
(the intercept and slope) that best fit the observations.
We want to fit a straight line through the data, from our example above, that would look like this:
y
Fitted values
126
y
ui
20
160
280
x
Obviously, no straight line can exactly run through all of the points. The vertical distance
between each observation and the line that fits “best”—the regression line—is called the error.
The OLS procedure calculates our parameter values by minimizing the sum of the squared errors
for all observations.
Why OLS? It is considered the most reliable method of estimating linear relationships between
economic variables. It can be used in a variety of environments, but can only be used when it
meets the following assumptions:
C.
Assumptions of the linear regression model (The Gauss-Markov Theorem)
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
5
The Gauss-Markov Theorem is essentially a claim about the ability of regression to
assess the relationship between a dependent variable and one or more independent
variables. The Gauss-Markov Theorem, however, requires that for all Yi, Xi, the
following conditions are met:
1) The conditional expectation of Y is an unchanging linear function of known
independent variables. That is, Y is generated through the following process:
Yi  β 0  β1 X 1i  ...  β k X ki  ε i
In the simple regression model, the dependent variable is assumed to be a
function of one or more independent variables plus an error introduced to
account for all other factors. In the regression equation specified above, Yi is
the dependent variable, Xi1, ...., Xik are the independent or explanatory
variables, and εi is the disturbance or error term. The goal of regression
analysis is to obtain estimates of the unknown parameters β1, ..., βk which
indicate how a change in one of the independent variables affects the values
taken by the dependent variable. Note that the model also assumes that the
relationships between the dependent variable and the independent variables
are linear.
Examples of violations: non-linear relationship between variables, including
the wrong variables
2) All X’s are fixed in repeated samples
The Gauss-Markov Theorem also assumes that the independent variables are
non-random. In an experiment, the values of the independent variable would
be fixed by the experimenter and repeated samples could be drawn with the
independent variables fixed at the same values in each sample. As a
consequence of this assumption, the independent variables will in fact be
independent of the disturbance. For non-experimental work, this will need to
be assumed directly along with the assumption that the independent variables
have finite variances.
Examples of violations: endogeneity, measurement error, autoregression
3) The expected value of the disturbance term is zero.
E[ ε i ]  Cov[ X i , ε i ]  0
The disturbance terms in the linear model above must also satisfy some
special criteria. First, the Gauss-Markov Theorem assumes that the expected
value of the disturbance term is zero. This means that on average, the errors
balance out.
Examples of violations: expected value of disturbance term is not zero
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
6
4) Disturbance have uniform variance and are uncorrelated
V[ε i ]  E[ε i2 ]  σ 2
Cov[ ε i , ε j ]  E[ε i , ε j ]  0 for all i, j
The Gauss-Markov Theorem further assumes that the variance of the error
term is a constant for all observations and in all time periods. Formally, this
assumption implies that the error is homoskedastic. If the variance of the
error term is not constant, then the error terms are heteroskedastic. Finally,
the Gauss-Markov Theorem assumes that the error term is non-correlated.
More specifically, it assumes that the values of the error term at different time
periods are independent of each other. So, the error terms for all observations,
or among observations at different time periods, are not correlated with each
other.
Examples of violations: heteroskedasticity, serial correlation of error terms
5) No exact linear relationship between independent variables and more
observations than independent variables
Abs(correlation [xi,xj])  1
T
_
 ( X t  X )2  0
t 1
n  k +1
The independent variables must be linearly independent of one another. That
is, no independent variable can be expressed as a non-zero linear combination
of the remaining independent variables. There also must be more
observations than there are independent variables in order to ensure that there
are enough degrees of freedom for the model to be identified.
Examples of violations: multicolinearity, micronumerosity
If the five Gauss-Markov Assumptions listed above are met, then the Gauss-Markov
Theorem states that Ordinary Least Squares regression estimator bi is the Best Linear
Unbiased Estimator of βi. (OLS is BLUE.) All estimators will be unbiased, and that will
have the least variance in the class of unbiased linear estimators. The formula for
deriving the OLS estimator of βi is as follows.
n
Scalar form:
b
 y i xi
i 1
n
 xi2
i 1
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
7
Matrix form: b   X' X  X' Y
1
The point of the regression equation is to find the best fitting line relating the variables to
one another. In this enterprise, we wish to minimize the sum of the squared deviations
(residuals) from this line. OLS will do this better than any other process as long as these
conditions are met.
III.
Now, Regression Itself
A.
So, now that we know the assumptions of the OLS model, how do we estimate
and
?
The Ordinary Least Squares estimates of
and
of
and
and
are defined as the particular values
that minimize the sum of squares for the sample data.
The best fit line associated with the n points (x1, y1), (x2, y2), . . . , (xn, yn) has the form
y = mx + b
where
slope  m 
n( xy) ( x)( y )
int ercept  b 
n (  x 2 ) (  x ) 2
 y m( x)
n
So we can take our data from above and substitute in to find our parameters:
Sum
Price(thousands of $) Sales of new homes
X
y
160
126
180
103
200
82
220
75
240
82
260
40
280
20
1540
528
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
xy
20,160
18,540
16,400
16,500
19,680
10,400
5,600
107280
x2
25,600
32,400
40,000
48,400
57,600
67,600
78,400
350000
8
n( xy) ( x)( y )
slope  m 
n (  x ) (  x )
2
int ercept  b 
2

7(107280)  (1540) * (528)  62160

 0.79286
78400
7(350000)  (1540) 2
 y m( x)  528  0.79286(1540)  249.8571
n
7
And now in SPSS: Analyze: Regression: Linear: statistics: confidence interval: Dependent
variable:Y Independent Variables: Xs:
Model Summary
Model
1
R
R Square
.954(a)
a Predictors: (Constant), X
.911
Adjusted R
Square
.893
Std. Error of
the Estimate
11.75706
Coefficients(a)
Unstandardized
Coefficients
Model
B
1
(Constant)
249.857
Price of
-.793
house
a Dependent Variable: # houses sold
Std. Error
Standardized
Coefficients
t
Sig.
Beta
Lower Bound
24.841
.111
95% Confidence Interval for B
-.954
Upper Bound
10.058
.000
186.000
313.714
-7.137
.001
-1.078
-.507
Thus our least squares line is
y = 249.857 0.79286x
Interpreting data:
Let’s take a look at the regression diagnostics from our example above:
Explaining coefficient: for every one unit increase in the price of a house, -.793 fewer houses
are sold—now this doesn’t make intuitive sense in this case, because you can’t sell .793 of a
house, but we could imagine this to be true for a continuous variable.
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
9
But, how do we know that our coefficient estimate is meaningful? We can do a test of statistical
significance—usually we want to know if our coefficient estimate is statistically different from
zero: this is called a t-test.
Formally, we say:
H0 : Bprice of a house = 0 In other words, the price of a house has no effect on house sales.
What we hope to do is to reject this hypothesis
Explaining t-statistic: Our t-statistic is simply our coefficent divided by its standard error. The
next step is to compare this statistic to its critical value that can be found in a table of t-statistics
in any statistics text book, or for a large enough sample, most researchers use the 95%
confidence interval, with the associated critical value of 1.96. Thus, for any t-statistic below
1.96 we cannot reject the hypothesis that our coefficient estimate is significantly different from
0. Note: We never say that we can accept a hypothesis, we can simply not reject or reject
hypothesis about coefficient estimates.
p-value (in SPSS Sig.)
P  .001 : The probabilit y that we would have gotten thi s estimate of B if its true value were 0
Confidence interval
95% confidence int erval :  1.078 to - 0.507  B̂  c * (se(B̂))
Under repeated sampling, B would lie in this inteval 95% of the time
R-squared:
Finally, we want to look at the R-squared statistic from our model summary statistics above.
Formally,
RSS
R 2  0.911 : 1 : the relative amount of variance in y explained by our independen t variable X
TSS
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
10
Now lets do it again for a multi-variate regression, where we add in number of red cars in
neighborhood as predictor of housing sales:
Model Summary
Model
1
R
.958(a)
R Square
.917
Adjusted R
Square
.876
Std. Error of
the Estimate
12.65044
a Predictors: (Constant), PRICE, REDCARS
Coefficients(a)
Unstandardized
Coefficients
Model
1
(Constant)
Price of
house
Red cars
B
223.157
Std. Error
54.323
-.708
.191
.376
.666
Standardized
Coefficients
t
Sig.
Beta
95% Confidence Interval for B
4.108
.015
Lower Bound
72.332
Upper Bound
373.983
-.853
-3.703
.021
-1.240
-.177
.130
.565
.603
-1.474
2.226
a Dependent Variable: # houses sold
Explaining coefficients:
This time, the estimate cannot be explained in exactly the same manner.
Here we would say: controlling for the effect of red cars, the marginal effect of a one unit
increase in the price of houses, housing sales decreasy by –0.708 units.
Let’s look at our t-statistic on red cars, what does that tell us?
So, should we take it out of our model?
Again, let’s think about p-values, p-value (in SPSS Sig.), Confidence intervals, and R-squared.
What happened to our R-squared in this case? It increased to .958—this is good, right, we
explained more of the variation in Y through adding this variable. NO.
B.
Interpreting log models
The log-log model:
Yi  1 X iB 2 e ui
rewrite :
ln Yi    B2 ln X i  u i
Is this linear in the
parameters?
How do we estimate this?
Where a=lnB1
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
11
Slope coefficient B2 measures the elasticity of Y with respect to X, that is, the percentage change in Y for
a given percentage change in X
The model assumes that the elasticity coefficent between Y and X, B2 remains constant throughout.—the
change in lnY per unit change in lnX (the elasticity B2) remains the same no matter at which lnX we
measure the elasticity.
demand
10
1
1
9
price
lndemand
2.30259
0
0
2.19722
lnprice
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
12
The log-linear model:
ln Yt  1   2 t  ut
B2 measures the constant proportional change or relative change in Y for a given absolute change in the
value of the regressor:
B2=relative change in regressand / absolute change in regressor
If we multiply the relative change in Y by 100, will then get the percentage change in Y for an absolute
change in X, the regressor.
gnp
420819
191857
17889
28798
var4
lngnp
12.95
12.1645
17889
28798
var4
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
13
The lin-log model:
Now interested in finding the absolute change in Y for a percent change in X
Yi  1   2 ln X i  ui
=Change in Y / relative change in X
The absolute change in Y is equal to B2 times the relative change in X—if the latter is multiplied by 100,
then it gives the absolute change in Y for a percentage change in X
IV.
When Linear Regression Does Not Work
A.
Violations of the Gauss-Markov assumptions
Some violations of the Gauss-Markov assumptions are more serious problems for linear
regression than others. Consider an instance where you have more independent variables
than observations. In such a case, in order to run linear regression, you must simply
gather more observations. Similarly, in a case where you have two variables that are very
highly correlated (say, GDP per capita and GDP), you may omit one of these variables
from your regression equation. If the expected value of your disturbance term is not zero,
then there is another independent variable that is systematically determining your
dependent variable. Finally, in a case where your theory indicates that you need a
number of independent variables, you may not have access to all of them. In this case, to
run the linear regression, you must either find alternate measures of your independent
variable, or find another way to investigate your research question.
Other violations of the Gauss-Markov assumptions are addressed below. In all of these
cases, be aware that Ordinary Least Squares regression, as we have discussed it today,
gives biased estimates of parameters and/or standard errors.





B.
Non-linear relationship between X and Y: use Non-linear Least Squares or
Maximum Likelihood Estimation
Endogeneity: use Two-Stage Least Squares or a lagged dependent variable
Autoregression: use a Time-Series Estimator (ARIMA)
Serial Correlation: use Generalized Least Squares
Heteroskedasticity: use Generalized Least Squares
Characteristics of the Dependent Variable
Until now, we have implicitly assumed that the dependent variable in our equation is a
continuous non-censored variable. This is necessary for linear regression analysis.
However, it will often be the case that your dependent variable is a dummy variable (0
for peace, 1 for war), a count variable (the number of bills passed by the legislature in a
given year), strictly non-negative (the distance from your house to the post office), or
censored (yearly income, with no data collected for those making less than $1,000 per
year). In these cases, linear regression is not an appropriate technique for uncovering the
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
14
relationships between X and Y. I discuss the appropriate remedies for these problems
below.



C.
Dependent variable is binary: use Logistic Regression or Probit Regression
Dependent variable is a count variable: use Poisson Regression
Dependent variable is strictly non-negative or censored: use Tobit Regression
Help
If you believe that the nature of your data will force you to use a more sophisticated
estimation technique than Ordinary Least Squares, you should first consult the resources
listed on the Statlab Web Page at http://statlab.stat.yale.edu/links/index.jsp. You may
also find that Google searches of these regression techniques often find simple tutorials
for the methods that you must use, complete with help on estimation, programming, and
interpretation. Finally, you should also be aware that the Statlab consultants are available
throughout regular Statlab hours to answer your questions (see the areas of expertise for
Statlab consultants at http://statlab.stat.yale.edu/people/showConsultants.jsp).
StatLab Introduction to Regression and Data Analysis Workshop 2/28/2003
15
Download