November 17 -- Bivariate Regression

advertisement
Bivariate Regression
Assumptions and Testing of the
Model
Economics 224, Notes for November 17, 2008
Assignments
• Assignment 6 is optional. It will be handed
out next week and due on December 5.
• If you are satisfied with your grades on
Assignment 1-5, then you need not do
Assignment 6.
• If you do Assignment 6, then we will base your
mark for the assignments on the best five
marks.
Corrections from last day
• Significance of t values from Excel are for twotailed or two-directional tests.
• If alternative hypothesis is one-directional,
that is, lesser than or greater than, then cut
the P-value in half.
• I used H1 as the name of the alternative
hypothesis. The text uses Ha, so I will use that
from now on.
Example: The Consumption Function
• A key part of the Keynesian aggregate expenditure model.
• Let C = aggregate consumption and Y = aggregate demand
– Key role of the marginal propensity to consume (MPC) out
of real GDP = ∆C/∆Y.
• Estimating C = β0 + β1Y + ε.
• Data set posted on UR Courses.
• Find estimates b1 of the slope β1 and b0 of intercept β0 to
produce an estimate of the consumption function:
Cˆ  b0  b1Y
• In a revised model, you might use total income or disposable
income for Y and include other relevant variables.
4
Hypotheses
• H0: β1 = 0. Real GDP has no relation to
consumption or MPC = 0.
• Ha: β1 > 0. Real GDP has a positive
relationship with consumption or MPC > 0.
Quarter
I 1995
II 1995
III 1995
IV 1995
I 1996
II 1996
III 1996
IV 1996
I 1997
II 1997
III 1997
IV 1997
I 1998
II 1998
III 1998
IV 1998
I 2004
II 2004
III 2004
IV 2004
Consumption
(y)
Real GDP (x)
472101
831286
475537
830162
480115
831707
480041
835395
485805
836765
486454
839457
487917
847643
495580
856762
503156
867608
507780
877424
513924
889104
517920
896800
518156
908268
524652
911136
527792
920924
529156
935672
637392
641304
647212
653504
1110920
1124820
1138488
1147392
Regression Statistics
Multiple R
0.993811
R Square
0.98766
Adjusted R
Square
0.987335
Standard Error
6181.672
Observations
40
700000
Consumption
650000
Cˆ  35,358  0.532GDP
600000
Series1
Linear (Series1)
550000
500000
450000
800000
850000
900000
950000
1000000
1050000
1100000
1150000
Real GDP
Statistics from Excel for regression of consumption on real GDP
ANOVA
df
Regression
SS
MS
1
1.16E+11
1.16E+11
Residual
38
1.45E+09
38213072
Total
39
1.18E+11
Standard
Coefficients
Error
t Stat
F
3041.359
P-value
Intercept
35358.38
9501.626
3.721298
0.000639
X Variable 1
0.531977
0.009646
55.14852
7.03E-38
Significance
F
7.03E-38
Analysis of consumption function results
• The t test for the regression coefficient gives a t value of 55.1,
with probability extremely small (7.03 times 10 to the power
of minus 38). The null hypothesis of real GDP having no
relationship with consumption is rejected and the alternative
hypothesis that consumption has a positive relationship with
real GDP is accepted.
• The estimate of the slope, in this case the MPC, is 0.532. Over
this period, increases in real GDP are associated with
increases in consumption of just over one-half of GDP.
• There appears to be serial correlation in the model (see later
slides) so the assumptions are violated. This violation may not
affect the estimate of the MPC all that much.
• Time series regressions of this type often have a very good fit
to the data. In this case, R2 = 0.988.
Assumptions for regression model y   0  1 x  
• Linear relationship between x and y.
– Transform curvlinear relation to a linear one.
• Interval or ratio level scales for x and y.
– Nominal scales – dummy variables and multiple regression.
– Ordinal scales – be very cautious with interpretation.
• x truly independent, exogenous, and error free.
– May correct for latter with an errors in variables model.
• No relevant variables excluded from the model.
• Several assumptions about the error term ε.
– Random variable with mean of 0.
– Normally distributed random errors.
– Equal variances.
– Values of ε independent of each other.
Error term ε in y  0  1 x  
• Importance
– Source of information for statistical tests.
– Violation of assumptions may mean regression
model, estimates, and statistical tests inaccurate.
• Source of error
– Random component – random sampling,
unpredictable individual behaviour.
– Measurement error.
– Variables not in equation.
• Examination of residuals provides possibility of
testing assumptions about ε (ASW, 12.8).
Assumptions about ε (ASW, 487-8)
• E(ε) = 0. ε is a random error with a mean or expected value of
zero so that E(y) = β0 + β1x is the “true” regression equation.
• Var(ε) = σ2 for each value of x. For different values of x, the
variance for the distribution of random errors is the same.
This characteristic is referred to as homoskedasticity and if
this assumption is not met, the model has heteroskedasticity.
• Values of ε are independent of each other. For any x, the
values of ε are unrelated to or independent of values of ε for
any other x. The violation of this assumption may be referred
to as serial correlation or autocorrelation.
• For each x, the distribution of values of ε is a normal
distribution.
Assumptions in practice
• These strong assumptions about the random error
term ε are often not met. Econometricians have
developed many procedures for handling data where
assumptions are not met.
• For testing the model, assume the assumptions are
met.
• If the assumptions are met, econometricians show
that the least-squares estimators are the best linear
unbiased estimators (BLUE) possible.
Assumptions in examples
• Regression of wages and salaries on years of schooling.
Microdata from a random sample means that the errors are
likely random with mean 0 and are likely independent of each
other. Distribution of wages and salaries may not be normal
and variance of wages and salaries at different years of
schooling may not be equal.
• Consumption function likely has correlated errors associated
with it and may not meet the equal variance and normal
distribution assumptions. But estimate of MPC may be
reasonably accurate.
• Alcohol example probably violates each assumption
somewhat. However, the estimate of the effect of income on
alcohol consumption may be a reasonable estimate.
Testing the model for statistical significance
• The key question is whether the slope is 0 or not, that is,
whether the regression model explains any of the variation in
the dependent variable y. The hypotheses are:
H0: β1 = 0.
Ha: β1 ≠ 0.
• If the true relationship is y = β0 + β1x + ε, different samples
yield different values for the estimators b0 and b1 of the
parameters β0 and β1, respectively. With repeated sampling,
these estimators thus have a variability or standard error. This
variability depends on the variability of the random error
term so estimating σ2 is the first step in testing the model.
• There are two tests, the t-test for the statistical significance of
the slope and the F-test for the significance of the equation.
For bivariate regression, these two tests give identical results,
but they are different tests in multivariate regression.
Estimating σ2, the variance of ε
• The values of the random error term ε are not observed but,
once a regression line has been estimated from a sample, the
residuals (ei) can be calculated and used to construct an
estimate of σ2. Recall that the error sum of squares, or
unexplained variation, was SSE.
SSE    yi  yˆ i    yi  b0  b1 xi 
2
2
• Dividing SSE by the degrees of freedom provides an estimate
of the variance. This is termed the mean square error (MSE)
and, for a bivariate regression line, equals
s 2  MSE 
SSE
n2
• There are n – 2 degrees of freedom since two parameters, β0
and β1, are estimated in a bivariate regression.
Standard error of estimate s or se
• Associated with each regression line is a standard error of
estimate. ASW use the symbol s. Some texts use the symbol
se to distinguish it from the standard deviation of a variable.
SSE
s  se  MSE 
n2
• Alcohol example. N=10, SSE = 4.159933, MSE = SSE/8 =
0.519992.
s  se  0.519992  0.721104
and note this is given in Excel Regression Statistics box.
• Schooling and earnings. s = 19,448. See next slides.
Standard error of estimate s or se
• Rough rule of thumb:
– Two-thirds of observed values are within 1 standard error of estimate
of the line.
– 95% plus of observed values are within 2 standard errors of the line.
yˆ  b0  b1 x
Standard error
of estimate
Two standard
errors of
estimate
2 st. errors
Plot of WGSAL42 with YRSCHL18
100000
1 st. error
y
80000
yˆ  13,493  4,181x
60000
40000
20000
0
8
10
12
14
16
Total Number of years of schooling compl
18
20
22
15 /22 observations
within 1 st. error and
21/22 within 2 st. errors
Distribution of b1
• The statistic b1 has a mean of β1, ie. E(b1) = β1.
• Standard error of b1 is the standard error or estimate divided
by the square root of the variation of x. The estimate of this
standard error is
s
sb 
2
(
x

x
)
 i
1
• The distribution of b1 is described by a t-distribution with the
above mean and standard deviation and n-2 degrees of
freedom.
Schooling and earnings example –
standard error of the slope.
Regression Statistics
Multiple R
0.503045
R Square
0.253054
Adjusted R
Square
0.215707
Standard
Error
19447.73
Observations
 (x  x)
2
i
s
19,447.73
sb1 

 1,606.249
2
146.5927
 ( xi  x )
22
Standard
Coefficients
Error
Intercept
X Variable
1
 146.5927 from Nov. 12 handout
t Stat
P-value
-13493
23211.26
-0.58131
0.567523
4181.095
1606.249
2.603019
0.017015
Test of statistical significance for b1
H0: β1 = 0.
Ha: β1 ≠ 0.
• b1 is the test statistic for the hypotheses and the t value, with
n-2 df, is
b1  1
t
sb1
Since the null hypothesis is usually that β1 = 0, this becomes
b1 divided by its standard deviation or standard error.
• Schooling and earnings example.
t
b1  1 b1 4181.095


 2.603
sb1
sb1 1606.249
and, with a sample of n = 22 cases, there are 22 - 2 = 20 df.
The result is statistically significant at the α = 0.02 level of
significant (P-value = 0.017). Reject H0 and accept Ha.
Schooling associated with earnings at 0.02 significance.
Reject H0
Reject H0
Do Not Reject H0
a/2 = .025
t0.025 ≈ 2.0
a/2 = .025
0
z
t0.025 ≈ 2.0
If test t-value outside the range
→ reject H0.
23
Rule of thumb of 2
b1  1 b1

• Since the null hypothesis is usually H0: β1 = 0, t 
sb1
sb1
• The question is how large a t value is necessary to reject this
hypothesis.
• When the degrees of freedom is large, the t distribution
approaches the normal distribution. At α = 0.05, for a twotailed test, the critical values are t or Z of -1.96 and +1.96.
• Thus, for large samples or for data sets with many
observations (say 100 plus), if b1 is over double the value of
sb1, reject H0 and accept Ha. If b1 is less than twice the value
of sb1, do not reject H0.
• This is just a rough rule of thumb.
• Where df < 50, it is best to check the P-value associated with
the t value.
Test for the intercept
• A parallel test can be conducted for the intercept of
the line. Given that economic theory often is silent
on the issue of what the intercept might be, this is
usually of little interest.
• If there is reason to hypothesize a value for the
intercept, follow the same procedure. The Excel
estimate of the regression coefficients provides the
estimator of the slope, its standard error, t-value,
and P-value.
Confidence interval for b1
• From the distribution for b1, interval estimates for estimates
of β1 are formed as follows:
b1  ta sb1
2
• For the schooling and earnings example, b1 = 4,181, the
standard error of b1 = 1,606, and n = 22, so t for 20 df and 95%
confidence is tα/2 = t 0.05 = 2.086, giving the interval from 831
to 7,531 – a wide interval for estimate of the effect of an extra
year of schooling on annual wages and salaries.
b1  ta sb1  4,181  (2.086 1,606)  4,181  3,350
2
Intercept
X Variable
1
Standard
Coefficients
Error
-13493 23211.26
4181.095
1606.249
t Stat
-0.58131
P-value Lower 95% Upper 95%
0.567523 -61910.9
34924.8
2.603019
0.017015
830.5213
7531.67
F test for R2
•
•
•
•
•
H0: β1 = 0 or R2 = 0. No relationship.
Ha: β1 ≠ 0 or R2 ≠ 0. Relationship exists.
MSR
Test is the ratio of the regression mean square
F
to the error mean square, an F test.
MSE
Reject H0 and accept Ha if F is large, ie. P-value associated
with F is below the value of α selected (eg. 0.05).
Do not reject H0 if F is not large, ie. P-value associated with F
is above the level of α selected (eg. 0.05).
For a bivariate regression, this test is exactly equivalent to the
t test for the slope of the line.
In multivariate regression, the F test provides a test for the
existence of a relationship. The t test for each independent
variable is a test for the possible influence of that variable.
Example – income and alcohol consumption
H0: β1 = 0 or R2 = 0. No relationship between income and
alcohol consumption.
Ha: β1 ≠ 0 or R2 ≠ 0. Income affects alcohol consumption.
• F = MSR/MSE = 6.920067 / 0.519992 = 13.308. P = 0.006513.
Reject H0 and accept Ha at α = 0.01.
• F table. At α = 0.01, with 1 and 8 df, F = 11.26. Estimated
F = 13.30803 > 11.26. Reject H0 and accept Ha at 0.01 level.
• At 0.01 significance, conclude that income affects alcohol
consumption.
ANOVA
df
Regression
Residual
Total
SS
1
8
9
6.920067
4.159933
11.08
MS
6.920067
0.519992
F
13.30803
Significance F
0.006513
Example – schooling and earnings
H0: R2 = 0. No relationship between years of schooling and
wages and salaries.
Ha: R2 ≠ 0. Years of schooling related to wages and salaries.
R2 = 0.253 and the F value is 6.776 with 1 and 20 df.
At α = 0.05, F = 4.35 for 1 and 20 df.
Reject H0 and accept H1 at α = 0.05.
P value = 0.017 so reject H0 at 0.02 significance but not at 0.01.
ANOVA
df
Regression
Residual
Total
SS
1
20
21
2.56E+09
7.56E+09
1.01E+10
MS
2.56E+09
3.78E+08
F
6.775708
Significance
F
0.017015
Estimation and prediction (ASW, 498-502)
• Point estimate provided by estimated regression line.
• In the example of the effect of years of schooling on wages
and salaries, predicted wages and salaries for those with 16
years of schooling are:
yˆ  13,493  4,181x  13,493  (4,18116)  53,403
• The confidence intervals associated with the predicted values:
– Depend on the confidence level (eg. 95%), the standard
error, the sample size, the variation of x, and the distance x
is from its mean. Formulae in ASW, pp. 499 and 501.
– Greater distance of x from the mean of x associated with a
wider interval.
FIGURE 12.8
CONFIDENCE INTERVALS FOR THE MEAN SALES y AT GIVEN VALUES OF STUDENT
POPULATION x
FIGURE 12.9
CONFIDENCE AND PREDICTION INTERVALS FOR SALES y AT GIVEN VALUES OF STUDENT
POPULATION x
Example – Schooling and wages and salaries. Inner band gives 95%
confidence intervals for prediction of mean values of wages and salaries for
each year of schooling. Outer band gives 95% prediction intervals for
individual wages and salaries.
100000
yˆ  13,493  4,181x
80000
Se = 19,447
Sb1 = 1,606
t = 2.603 for slope
and P-value = 0.017
60000
40000
20000
0
Rsq = 0.2531
8
10
12
14
16
18
20
22
Total Number of years of schooling completed by person
Confidence intervals for estimation and
prediction
• For estimation of predicted mean value of the dependent
variable, the inner bands illustrate the intervals.
• For estimation of predicted individual values of the
dependent variable, the outer bands illustrate the intervals.
These intervals can be very large. In the above example they
are so large that predicting individual wages and salaries from
years of schooling is almost completely unreliable. But it is
unrealistic to expect that a sample of size 22, with only one
independent variable (years of schooling) would allow a good
prediction of individual salaries.
• Interval estimates can be narrowed by expanding sample size
and constructing a model with improved fit and reduced
standard error.
Wednesday
•
•
•
•
Reporting regression results.
Examination of residuals, ASW, 12.8.
Examples of transformations.
Introduction to multiple regression.
Download