Stat 301– Lecture 20 Sums of Squares

advertisement
Stat 301– Lecture 20
Sums of Squares
SS(C. Total) = 203748.80
 SS(Year) = 187300.39



Year explains 91.9%
SS(Year2|Year) = 16264.87

Year2 adds 8.0%
1
Sums of Squares
SS(C. Total) = 203748.80
SS(Year) = 187300.39
91.9%
SS(Year2|Year) = 16264.87
8.0%
2
Sums of Squares
SS(C. Total) = 203748.80
2
 SS(Year ) = 188977.13



Year2 explains 92.8%
SS(Year|Year2) = 14588.13

Year adds 7.1%
3
Stat 301– Lecture 20
Sums of Squares
SS(C. Total) = 203748.80
SS(Year|Year2) = 14588.13
7.1%
SS(Year2) = 188977.13
92.8%
4
Sums of Squares
SS(C. Total) = 203748.80
SS(shared) = 172895.80
84.8%
SS(Year|Year2) = 14588.13
7.1%
SS(Year2|Year) = 16264.87
8.0%
5
Sums of Squares
SS(C. Total) = 203748.80
 SS(Year – 1900) = 187300.39



(Year – 1900) explains 91.9%
SS= 16264.87

(Year – 1900)2 adds 8.0%
6
Stat 301– Lecture 20
Sums of Squares
SS(C. Total) = 203748.80
SS((Year–1900)) = 187300.39
SS((Year–1900)2| (Year–1900)) = 16264.87
91.9%
8.0%
7
Sums of Squares
SS(C. Total) = 203748.80
2
 SS((Year – 1900) ) = 16264.87



(Year – 1900)2 explains 8.0%
SS((Year – 1900)|(Year – 1900)2)
= 187300.39

(Year – 1900) adds 91.9%
8
Sums of Squares
SS(C. Total) = 203748.80
SS((Year–1900)|(Year–1900)2) = 187300.39
91.9%
SS((Year–1900)2) = 16264.87
8.0%
9
Stat 301– Lecture 20
Sums of Squares
SS(C. Total) = 203748.80
SS((Year–1900)|(Year–1900)2)=187300.39
91.9%
SS(shared) = 0.00
0.0%
SS((Year–1900)2|(Year–1900))=16264.87
8.0%
10
Effects of Centering
Year2 shares over 85% of the
explained variation with Year.
2
 (Year – 1900) shares none of
the explained variation with
(Year – 1900).

11
Why does this happen?
The correlation between Year2
and Year is statistically
significant, multicollinearity.
 The correlation between
(Year–1900)2 and (Year–1900) is
zero, no linear relationship.

12
Stat 301– Lecture 20
What about 1940 & 1950?



The predictions for 1940 and 1950
are much higher than the actual
population values.
Why?
Can we add a term to the model
that could account for this?
13
Dummy Variable
A dummy of indicator variable
can be used to identify
individual or sets of values.
 X = 1 if Year is 1940 or 1950
 X = 0 otherwise

14
Quadratic with Dummy

Predicted Population = 75.467 +
1.368*(Year – 1900) +
0.0066577*(Year – 1900)2 –
8.947*X

Note that the other estimated
slope coefficients are very close
to those in the quadratic model.
15
Stat 301– Lecture 20
Quadratic with Dummy

For 1940 and 1950, the
prediction is lowered by 8.947
million.
16
Quadratic

1940
Actual = 132.165
Predicted = 139.426
 Residual = –7.261



1950
Actual =151.326
Predicted =159.129
 Residual = –7.803


17
Quadratic with Dummy

1940
Actual = 132.165
Predicted = 131.908
 Residual = 0.257



1950
Actual =151.326
Predicted =151.583
 Residual = –0.257


18
Stat 301– Lecture 20
Change in R2

Quadratic: R2 =0.9991


Quadratic+Dummy: R2 =0.9998


99.91% explained variation
99.98% explained variation
Only a small increase.
19
Significant Improvement?

Dummy variable, X added to
the quadratic model.
t = –7.25, P-value < 0.0001
 Because the P-value is small, the
dummy variable, X, adds
significantly to the quadratic
model.

20
Change in RMSE

Quadratic:


Quadratic + Dummy:


RMSE = 3.029
RMSE = 1.602
RMSE reduced quite a bit.
21
Stat 301– Lecture 20
5
4
3
2
1
0
-1
-2
-3
-4
-5
1800
1850
1900
1950
2000
Year
22
Plot of Residuals
One might detect a up – down
– up – down, wave.
 Worst predictions are still within
3 or 4 million of the actual
population.
 Probably can’t do much better.

23
Residuals
24
Download