Elementary Statistics

advertisement
Multiple
Regression
In the previous section, we examined simple
regression, which has just one independent
variable on the right side of the equation.
In this section, we consider multiple regression,
in which there are two or more independent
variables on the right side of the equation.
Simple Regression
Multiple Regression
True
Relation
Yi =  + Xi +i
Yi =  + 1X1i + 2X2i +…+ kXki + i
Estimated
Relation
Yi = a + bXi +ei
Yi = a + b1X1i + b2X2i +…+ bkXki + ei
The number of X’s (independent variables) will be denoted as k. We
are estimating k+1 parameters, the k ’s and the constant .
We have similar assumptions to the ones we
used in simple regression. The assumptions are
• The Y values are independent of each other.
• The conditional distributions of Y given the
X’s are normal.
• The conditional standard deviations of Y
given the X’s are equal for all values of the
X’s.
We continue to use OLS (ordinary least squares).
It is much more difficult to do multiple regression
with a hand calculator than simple regression is,
but computer programs perform it very easily and
quickly.
As in simple regression, we have
SST 

SSR 

2
ˆ
(Yi - Y) .
SSE 

ˆ )2.
(Yi - Y
i
(Yi - Y) 2 .
The standard error of the regression
or the standard error of the estimate is
SER  Se 
SSE
n  k 1
In simple regression, there was only one X, so k was 1 and
our denominator was (n – 2) . Here the denominator is
generalized to (n – k – 1).
The Regression ANOVA Table is now:
Source of
Variation
Sum of squares
Degrees of
freedom
Mean square
ˆ - Y) 2
(Y
i
k
MSR
SSR/k
ˆ )2
(Yi - Y
i
n–k–1
MSE
SSE/(n – k – 1)
n–1
MST
SST/(n – 1)
Regression
SSR 

Error
SSE 

Total
SST 
 (Y - Y)
i
2
The hypotheses for testing the overall
significance of the regression are:
H0: 1 = 2 = … = k = 0 (all the slope coefficients are zero)
H1: at least one of the ’s is not zero.
The statistic for the test is
Fk, n - k -1
MSR
SSR k


MSE SSE ( n  k  1)
We can also test whether a particular coefficient
j is zero (or any other specified value), using a
t-statistic.
t n  k 1 
bj -  j
sb j
The calculation of sbj is very messy, but sbj is always
given on computer output.
We can do one-tailed and two-tailed tests.
Coefficient of determination or R2:
SSR
SSE

R 
 1SST
SST
2

 (Y  Y )
2
ˆ
( Yi  Y )
2
i
R2 adjusted or corrected for degrees of freedom:
SSE (n - k - 1)
 1SST (n - 1)
2
2  n 1 
or R c  1  (1  R )

 n  k  1
R c2
Dummy Variables
Dummy variables enable us to explore the effects of
qualitative rather than quantitative factors.
Side note: Cross-sectional data provides us with information
on a number of households, individuals, firms, etc. at a
particular point in time. Time-series data gives us information
on a particular household, firm, etc. at various points in time.
Suppose, for example, we have cross-sectional data on
income. Dummy variables can give us an understanding of
how race, gender, residence in an urban area can affect
income.
If we have time-series data on expenditures, dummy variables
can tell us about seasonal effects.
To capture the effects of a factor that has m categories, you
need m – 1 dummy variables. Here are some examples.
Gender: You are examining SAT scores. Since there are 2 gender
categories, you need 1 gender variable to capture the effect of
gender. If you include a variable that is 1 for male observations and
0 for females, the coefficient on that variable tells how male scores
compare to female scores. In this case, female is the reference
category.
Race: You are examining salaries and you have data for 4 races:
white, black, Asian, and Native American. You only need 3 dummy
variables. You might define a variable that is 1 for blacks and 0 for
non-blacks, a 2nd variable that is 1 for Asians and 0 for non-Asians,
and a 3rd variable that is 1 for Native Americans and 0 for nonNative Americans. Then white would be the reference category and
the coefficients of the 3 race variables would tell how salaries for
those groups compare to salaries for whites.
Coefficient interpretation example: You have estimated the regression
SALARY  10.0  1.0 (EDUC)  2.0 (EXP) - 5.0 (FEMALE)
where SALARY is measured in thousands of dollars, EDUC and EXP
are education and experience, each measured in years, and FEMALE
is a dummy variable equal to 1 for females and 0 for males. The
coefficients of the variables would be interpreted as follows.
If there are two people with the same experience and gender, and one
has 1 more unit of education (in this case, a year), that person would
be expected to have a salary that is 1.0 units higher (in this case, 1.0
thousand dollars higher).
If there are two people with the same education and gender, and one
has 1 more year of experience, that person would be expected to have
a salary that is 2.0 thousand dollars higher.
If there are two people with the same education and experience, and
one is male and one is female, the female is expected to have a salary
that is 5.0 thousand dollars less.
SALARY  10.0  1.0 (EDUC)  2.0 (EXP) - 5.0 (FEMALE)
Consider 4 people with the following characteristics.
education
experience
female
10
5
0
11
5
0
11
6
0
11
6
1
salary
SALARY  10.0  1.0 (EDUC)  2.0 (EXP) - 5.0 (FEMALE)
Consider 4 people with the following characteristics.
education
experience
female
salary
10
5
0
10 + 10 + 10 – 0 = 30
11
5
0
11
6
0
11
6
1
SALARY  10.0  1.0 (EDUC)  2.0 (EXP) - 5.0 (FEMALE)
Consider 4 people with the following characteristics.
education
experience
female
salary
10
5
0
10 + 10 + 10 – 0 = 30
11
5
0
10 + 11 + 10 – 0 = 31
11
6
0
11
6
1
If two people have
the same experience
and gender, the one
that has one more
year of education,
would be expected
to earn 1.0 thousand
dollars more.
SALARY  10.0  1.0 (EDUC)  2.0 (EXP) - 5.0 (FEMALE)
Consider 4 people with the following characteristics.
education
experience
female
salary
10
5
0
10 + 10 + 10 – 0 = 30
11
5
0
10 + 11 + 10 – 0 = 31
11
6
0
10 +11 + 12 – 0 = 33
11
6
1
If two people have
the same education
and gender, the one
that has one more
year of experience,
would be expected
to earn 2.0 thousand
dollars more.
SALARY  10.0  1.0 (EDUC)  2.0 (EXP) - 5.0 (FEMALE)
Consider 4 people with the following characteristics.
education
experience
female
salary
10
5
0
10 + 10 + 10 – 0 = 30
11
5
0
10 + 11 + 10 – 0 = 31
11
6
0
10 +11 + 12 – 0 = 33
11
6
1
10 + 11 +12 – 5 = 28
If two people have
the same education
and experience, the
female would be
expected to earn 5.0
thousand dollars
less than the male.
Suppose you have regression results based on quarterly data for a
particular household.
SPENDING  10.0  0.70 (INCOME)  3.0 (WINTER)  2.0 (SPRING) - 1.0 (SUMMER)
SPENDING and INCOME are in thousands of dollars. WINTER
equals 1 if the quarter is winter and 0 if it is fall, spring or summer.
SPRING is 1 if the quarter is spring and 0 if it is fall, winter or
summer. SUMMER is 1 if the quarter is summer and 0 if it is fall,
spring or winter.
Suppose, household income is 10 thousand dollars for all 4 quarters
of a particular year.
In the fall, spending would be expected to be 17 thousand dollars.
In the spring, spending would be expected to be 2.0 thousand dollars
higher than in fall or 19 thousand dollars.
In the winter, spending would be expected to be 3.0 thousand dollars
higher than in the fall or 20 thousand dollars.
In the summer, spending would be expected to be 1.0 thousand
dollars less than in the fall or 16 thousand dollars.
Example: You have run a regression with 30 observations. The
dependent variable, WGT, is weight measured in pounds. The
independent variables are HGT, height measured in inches and a
dummy variable, MALE, which is 1 if the person is male and 0 if
the person is female. The results are as shown below. Answer
the questions that follow.
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1171.99
1. Interpret the HGT coefficient.
If there are 2 people of the same gender and one is an inch
taller than the other, the taller one is expected to weigh 4.378
pounds more.
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1171.99
2. Interpret the MALE coefficient.
If there are 2 people of the same height, and one is male and
one is female, the male is expected to weigh 27.478 pounds
more.
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1171.99
3. Calculate and interpret the coefficient of determination R2.
Also calculate the adjusted R2.
SSR
25,414.01
R 

 0.7477
SST
33,987.81
2
About 75% of the variation in weight is explained by the
regression on height and gender.
SSE (n - k - 1)
8,573.80 27
R 11  0.729
SST (n - 1)
33,987.81 29
2
c
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1,171.99
4. Test at the 5% level whether the HGT coefficient is greater
than zero. (Note that is the alternative hypothesis.)
b j -  j 4.378 - 0
t 27 

 3.97
sb j
1.103
critical
region
From our t table, we see that for 27 dof,
and a 1-tailed 5% critical region, our
critical value is 1.703. Since the value
of our statistic is 3.97, we reject H0 and
accept H1: the HGT coefficient is greater
than zero.
.05
0
1.703
t27
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1,171.99
5. Test at the 1% level whether the MALE coefficient is different
from zero. (Note that is the alternative hypothesis.)
t 27 
bj -  j
sb j
27.478 - 0

 2.89
9.52
From our t table, we see that for 27 dof,
and a 2-tailed 1% critical region, our
critical values are 2.771 and -2.771.
Since the value of our statistic is 2.89,
we reject H0 and accept H1: the MALE
coefficient is different from zero.
critical
region
critical
region
.005
-2.771
.005
0
2.771
t27
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1,171.99
6. Test the overall significance of the regression at the 1% level.
F2, 27
MSR 12,707.01


 40.02
MSE
317.48
f(F2,27)
From our F table, we see that for 2 and
27 dof, and a 1% critical region, our
critical value is 5.49. Since the value of
our statistic is 40.02, we reject H0 and
accept H1: at least one of the slope
coefficients is not zero.
acceptance
region
crit. reg.
0.01
F2,27
5.49
variable
estimated
coefficient
estimated
std. error
source of
variation
sum of
squares
degrees of
freedom
mean
square
CONSTANT
-160.129
50.285
regression
25,414.01
2
12,707.01
HGT
4.378
1.103
error
8,573.80
27
317.48
MALE
27.478
9.520
total
33,987.81
29
1,171.99
Multicollinearity Problem
Multicollinearity arises when independent variables
X’s are highly correlated.
Then it is not possible to separate the effects of the
these variables on the dependent variable Y.
The slope coefficient estimates will tend to be
unreliable, and often are not significantly different
from zero.
The simplest solution is to delete one of the correlated
variables.
Example: You are exploring the factors influencing the
number of children that a couple has.
You have included as X’s the mother’s education and the
father’s education.
You find that neither appears to be statistically significantly
different from zero.
This may occur because the two education variables are
highly correlated.
One option is to include only the education of one parent.
Alternatively, you could use replace the two education
variables with just one variable that might be either the
average or total education of the parents.
Problem of Autocorrelation or Serial Correlation
This is a problem that may arise in time-series data, but generally
not in cross-sectional data.
It occurs when successive observations of the dependent variable
Y are not independent of each other.
For example, if you are examining the weight of a particular
person over time, if that that weight is particularly high in one
period, it is likely to be high in the next period as well.
So, if the residual ei  Yi - Ŷi  0 , in period 7, for example, it is
likely the residual will be greater th an zero in period 8 as well.
Therefore, the residuals tend to be correlated among themselves
(autocorrelated) rather than independent.
n
You can test for autocorrelation
using the Durbin-Watson statistic
d
2
(e
e
)
 i i-1
i2
n
2
e
i
i 1
The Durbin-Watson statistic d is always between 0 and 4.
When there is extreme negative autocorrelation, d will be near 4.
When there is extreme positive autocorrelation, d will be near 0.
When there is no problem of autocorrelation, d will be near 2.
In many computer statistical packages you can request that the
Durbin–Watson be provided as output.
You can look up critical values in a table that then allows you to
determine if you have an autocorrelation problem.
The Durbin-Watson table provides two numbers dL and
dU corresponding to the number n of observations and
the number k of explanatory variables (X’s).
Your textbook provides one-tailed values, so you can test for
“positive autocorrelation” or “negative autocorrelation” but
not “positive or negative autocorrelation” at the same time.
The null hypothesis is that there is no autocorrelation.
The diagram below indicates which regions are indicative of
positive autocorrelation, negative autocorrelation, no
autocorrelation, or are inconclusive.
positive
autocorrelation
0
dL
no autocorrelation
inconclusive
inconclusive
problem
dU
2
4 – dU
negative
autocorrelation
4 – dL
4
Example: You have run a time-series regression with 25
observations and 4 independent variables. Your Durbin-Watson
statistic d = 0.70 . Test at the 1% level whether you have a
positive autocorrelation problem.
The Durbin-Watson table indicates that for 25 observations
and 4 independent variables, dL = 0.83 and dU = 1.52 . This
implies the following diagram.
positive
autocorrelation
no autocorrelation
inconclusive
inconclusive
problem
negative
autocorrelation
dL
dU
2
4 – dU
4 – dL
4
0.83
1.52
2.48
3.17
You reject H0: no autocorrelation and accept H1: there is a
positive autocorrelation problem.
There are techniques for handling autocorrelation problems,
but they are beyond the scope of this course.
0
Download