Stat 301– Lecture 20 Sums of Squares SS(C. Total) = 203748.80 SS(Year) = 187300.39 Year explains 91.9% SS(Year2|Year) = 16264.87 Year2 adds 8.0% 1 Sums of Squares SS(C. Total) = 203748.80 SS(Year) = 187300.39 91.9% SS(Year2|Year) = 16264.87 8.0% 2 Sums of Squares SS(C. Total) = 203748.80 2 SS(Year ) = 188977.13 Year2 explains 92.8% SS(Year|Year2) = 14588.13 Year adds 7.1% 3 Stat 301– Lecture 20 Sums of Squares SS(C. Total) = 203748.80 SS(Year|Year2) = 14588.13 7.1% SS(Year2) = 188977.13 92.8% 4 Sums of Squares SS(C. Total) = 203748.80 SS(shared) = 172895.80 84.8% SS(Year|Year2) = 14588.13 7.1% SS(Year2|Year) = 16264.87 8.0% 5 Sums of Squares SS(C. Total) = 203748.80 SS(Year – 1900) = 187300.39 (Year – 1900) explains 91.9% SS= 16264.87 (Year – 1900)2 adds 8.0% 6 Stat 301– Lecture 20 Sums of Squares SS(C. Total) = 203748.80 SS((Year–1900)) = 187300.39 SS((Year–1900)2| (Year–1900)) = 16264.87 91.9% 8.0% 7 Sums of Squares SS(C. Total) = 203748.80 2 SS((Year – 1900) ) = 16264.87 (Year – 1900)2 explains 8.0% SS((Year – 1900)|(Year – 1900)2) = 187300.39 (Year – 1900) adds 91.9% 8 Sums of Squares SS(C. Total) = 203748.80 SS((Year–1900)|(Year–1900)2) = 187300.39 91.9% SS((Year–1900)2) = 16264.87 8.0% 9 Stat 301– Lecture 20 Sums of Squares SS(C. Total) = 203748.80 SS((Year–1900)|(Year–1900)2)=187300.39 91.9% SS(shared) = 0.00 0.0% SS((Year–1900)2|(Year–1900))=16264.87 8.0% 10 Effects of Centering Year2 shares over 85% of the explained variation with Year. 2 (Year – 1900) shares none of the explained variation with (Year – 1900). 11 Why does this happen? The correlation between Year2 and Year is statistically significant, multicollinearity. The correlation between (Year–1900)2 and (Year–1900) is zero, no linear relationship. 12 Stat 301– Lecture 20 What about 1940 & 1950? The predictions for 1940 and 1950 are much higher than the actual population values. Why? Can we add a term to the model that could account for this? 13 Dummy Variable A dummy of indicator variable can be used to identify individual or sets of values. X = 1 if Year is 1940 or 1950 X = 0 otherwise 14 Quadratic with Dummy Predicted Population = 75.467 + 1.368*(Year – 1900) + 0.0066577*(Year – 1900)2 – 8.947*X Note that the other estimated slope coefficients are very close to those in the quadratic model. 15 Stat 301– Lecture 20 Quadratic with Dummy For 1940 and 1950, the prediction is lowered by 8.947 million. 16 Quadratic 1940 Actual = 132.165 Predicted = 139.426 Residual = –7.261 1950 Actual =151.326 Predicted =159.129 Residual = –7.803 17 Quadratic with Dummy 1940 Actual = 132.165 Predicted = 131.908 Residual = 0.257 1950 Actual =151.326 Predicted =151.583 Residual = –0.257 18 Stat 301– Lecture 20 Change in R2 Quadratic: R2 =0.9991 Quadratic+Dummy: R2 =0.9998 99.91% explained variation 99.98% explained variation Only a small increase. 19 Significant Improvement? Dummy variable, X added to the quadratic model. t = –7.25, P-value < 0.0001 Because the P-value is small, the dummy variable, X, adds significantly to the quadratic model. 20 Change in RMSE Quadratic: Quadratic + Dummy: RMSE = 3.029 RMSE = 1.602 RMSE reduced quite a bit. 21 Stat 301– Lecture 20 5 4 3 2 1 0 -1 -2 -3 -4 -5 1800 1850 1900 1950 2000 Year 22 Plot of Residuals One might detect a up – down – up – down, wave. Worst predictions are still within 3 or 4 million of the actual population. Probably can’t do much better. 23 Residuals 24