Trinity College, Dublin Diploma in Statistics Introduction to Regression Computer Laboratory 2; Feedback Initial data analysis Make dotplots of the four key variables; Dotplots of Meter Sales, GNP, RLP, RPC Meter Sales 60 90 120 150 180 210 240 GNP 560 700 840 980 1120 1260 1400 1.80 1.98 RLP 0.90 1.08 1.26 1.44 1.62 RPC 0.36 0.42 0.48 0.54 0.60 0.66 0.72 Meter Sales forms two homogeneous groups, one varying between 40 and 120, the other varying between 150 and 260. GNP varies between 500 and 1500, with a concentration at the low end of the scale, and homogeneously spread otherwise. RLP varies from 0.8 to 2.1, with a relatively dense subset between 0.8 and 1.2, relatively sparse between 1.4 and 1.8 and two larger values. RPC varies mostly between 0.5 and 0.75, with four lower values between 0.3 and 0.4. Time Series Plot of Meter Sales, GNP, RLP, RPC Make time series plots 1 Meter Sales 1500 250 200 1250 150 1000 100 750 50 500 RLP 2.00 0.7 1.75 0.6 1.50 0.5 1.25 0.4 1.00 0.3 1 7 14 21 28 35 Index 7 14 21 GNP RPC 28 35 Meter Sales has a steady linear upward trend until 1977, apart from one or two slight deviations. From 1978 on, it appears to take a downward step, especially in 1979, the year of the industrial dispute. There may be signs of a recovery. GNP has a series of gradually increasing upward trends, with some tailing off at the end. RPC and RLP both have series of downward trends with upward steps in between. This is especially noticeable in the case of RPC. The obvious explanation is that nominal prices stayed constant for periods of years, while inflation rose, so that real prices declined within those periods. Between periods, the nominal price increased substantially. Note the big increase in RLP in 1962 followed by a general upward trend and the even bigger increase in RLP in 1971 followed by a stronger general upward trend. Post Office personnel related these to known changes in government policy regarding postal pricing. Make a scatterplot matrix Matrix Plot of Meter Sales, GNP, RLP, RPC 500 1000 1500 1.0 1.5 2.0 0.3 0.5 0.7 240 160 Meter Sales 80 1500 1000 GNP 500 2.0 1.5 RLP 1.0 RPC Meter Sales has a strong positive relationship with GNP and with RLP. Its relationship with RPC is unclear. There appears to be extra variation in Meter Sales at high GNP. GNP and RLP appear strongly positively related. The relationship of each with RPC is unclear Regression calculation and interpretation Regression Analysis: Meter Sales versus GNP, RLP, RPC The regression equation is Meter Sales = - 118 + 0.149 GNP + 9.3 RLP + 203 RPC Predictor Constant GNP RLP RPC Coef -118.24 0.14938 9.25 203.14 SE Coef 28.42 0.03535 31.08 48.25 T -4.16 4.23 0.30 4.21 P 0.000 0.000 0.768 0.000 S = 22.4829 How do you interpret the t values? Those for the GNP and RPC coefficients are positive, in line with intuition, and are highly statistically significant. That for the RLP coefficient is not statistically significant. 2 Note: Assuming the fitted regression was a good fit, this would suggest deleting RLP from the model and refitting to get a simpler model. That would be premature here, as we have not used diagnostics to check the fit of the model. Use the prediction formula to make predictions for 1984 and 1985. forecasts of GNP and inflation were: 1984 1985 GNP: + 1.5% + 1.5% Inflation: + 8.6% + 5.5% Note that the and that the nominal letter price and nominal phone charge did not change. To calculate GNP84, add 1.5% of GNP83 to GNP83, that is, GNP84 = GNP83 + 0.015 × GNP83 = GNP83 × 1.015. Thus, GNP84 = GNP83 × 1.015 = 1462.6 × 1.015 = 1484.5 Similarly, GNP85 = GNP84 × 1.015 = 1484.5 × 1.015 = 1506.8 To calculate RLP84, note that LP84 stays the same while everything else increases by 8.6%. RLP84 = RLP83 / 1.086 = 1.993 / 1.086 = 1.835 RLP85 = RLP84 / 1.055 = 1.835 / 1.055 = 1.739 RPC84 = RPC83 / 1.086 = 1.993 / 1.086 = 0.599 RPC85 = RPC84 / 1.055 = 0.599 / 1.055 = 0.568 Predicted Sales84 = - 118 + 0.149 GNP84 + 9.3 RLP84 + 203 RPC84 2s = - 118 + 0.149 × 1484.5 + 9.3 × 1.835 + 203 × 0.599 45 Predicted Sales85 = - 118 + 0.149 GNP85 + 9.3 RLP85 + 203 RPC85 2s = - 118 + 0.149 × 1506.8 + 9.3 × 1.739 + 203 × 0.568 45 Predicted Sales Lower bound Upper bound 1984 241.9 196.9 286.9 1985 238 193 283 Discuss the value of the prediction interval width in the context of (i) the current level of meter sales, (ii) annual changes in meter sales in recent years and (iii) the value of ̂Sales , the estimate of based on the Sales data alone. (i) Prediction limits of 45, that is, a prediction range of 90, is relatively big when sales are around 260. (ii) Over the last 10 years, excluding 1979, sales ranged from 200 to 260, a range of 60. In this context, a prediction range of 90 seems relatively big. 3 (iii) ̂Sales , the standard deviation of prediction with no explanatory variables, equals 65. The residual standard deviation from the regression is 22.5, roughly ⅓. This represents a substantial improvement. However, when considered in context, as in (i) and (ii) above, it is not substantial enough. N.B. Context is all important when interpreting statistical analysis. Mathematical statisticians tend to be blissfully unaware of this desideratum, and their research, teaching and statistical advice reflects this. Caveat emptor! Diagnostic analysis of residuals View the Residuals v Fits plot; click on it if it is visible, else select it from the Window menu. Residuals Versus the Fitted Values (response is Meter Sales) 3 Deleted Residual 2 1 0 -1 -2 -3 -4 50 100 150 200 250 Fitted Value Describe any patterns and exceptions that you see. There is one large negative outlier, exceeding 3 in magnitude. There are two residuals exceeding 2 but we should not be surprised by this with 35 cases. There is a suggestion that residual spread increases with fitted value. How does the year 1979 show up? Are there other cases with exceptional residuals? What are the exceptional residual values? 1979 is the year with the large negative residual, value -3.5. The other residuals exceeding 2 in magnitude are 1977 (2.7) and 1981 (-2.2). Characterise the apparent pattern in residual variation. Residual variation (that is, variation in the vertical direction) appears to be lower for small fitted values and higher for large fitted values. The fitted values appear to form a series of homogeneous subsets. How do you explain the subsets of residuals with similar Fits values? 4 Time Series Plot of RPC Versus Fits (response is Meter Sales) 0.7 2 Deleted Residual RPC 0.6 0.5 0.4 0.3 0 -2 -4 1 7 14 21 28 35 50 100 Index 150 200 250 Fitted Value They correspond to the groups of values of RPC that reflect constant nominal prices. It appears that meter sales, as estimated by the fitted values, stayed more or less constant within these periods of constant prices. This may be open to economic interpretation. Normal Probability Plot of the Residuals Select the Normal diagnostic plot.(response is Meter Sales) 3 N 35 AD 0.834 P-Value 0.028 Deleted Residual 2 1 0 -1 -2 -3 -4 -2 -1 0 1 2 Score Describe any patterns and exceptions that you see. How does the year 1979 show up? What do you think of the largest positive residual? The bulk of the points follow a linear pattern. There are four potential outliers, one of which (bottom left) appears exceptional. This is the 1979 case. The largest positive residual does not appear exceptional in this graph. (Conceivably, it could appear exceptional if 1979 is deleted). Iterate the analysis Regression Analysis: Meter Sales versus GNP, RLP, RPC The regression equation is Meter Sales = - 101 + 0.201 GNP - 28.0 RLP + 173 RPC 34 cases used, 1 cases contain missing values Predictor Constant GNP RLP RPC Coef -100.73 0.20145 -27.96 173.23 SE Coef 24.78 0.03360 28.56 42.08 T -4.06 6.00 -0.98 4.12 S = 19.2045 5 P 0.000 0.000 0.335 0.000 Compare old and new. Discuss the change in s; what implications does this have for prediction? s is reduced by a small amount; previous judgements are unchanged. Discuss changes in the t-values. The overall pattern is similar. RLP is still insignificant. Describe and interpret any patterns and exceptions you see in the diagnostic plots. Which is the most exceptional case? Residuals Versus the Fitted Values Normal Probability Plot of the Residuals (response is Meter Sales) (response is Meter Sales) N 34 AD 1.392 P-Value <0.005 2 Deleted Residual Deleted Residual 2 0 -2 -4 0 -2 -4 50 100 150 200 250 -2 -1 0 Fitted Value 1 2 Score There is a new exceptional case, corresponding to 1981. Its residual value, at 3.5 approx., is the biggest in magnitude. There are other residuals that are potentially exceptional. Delete this case, as above, and repeat the iteration. Regression Analysis: Meter Sales versus GNP, RLP, RPC The regression equation is Meter Sales = - 98.5 + 0.221 GNP - 34.3 RLP + 155 RPC 33 cases used, 2 cases contain missing values Predictor Constant GNP RLP RPC Coef -98.46 0.22108 -34.35 155.01 SE Coef 20.99 0.02898 24.25 35.99 T -4.69 7.63 -1.42 4.31 P 0.000 0.000 0.167 0.000 S = 16.2615 Residuals Versus the Fitted Values Normal Probability Plot of the Residuals (response is Meter Sales) (response is Meter Sales) N 33 AD 1.370 P-Value <0.005 2 Deleted Residual Deleted Residual 2 0 -2 -4 0 -2 -4 50 100 150 200 250 -2 Fitted Value -1 0 1 Score Describe the changes on deleting this case and the next step suggested. The pattern of change is as before; another exceptional case to be deleted. 6 2 Review the initial data analysis Time Series Plot of Meter Sales Time Series Plot of RPC 250 0.7 0.6 RPC Meter Sales 200 150 0.5 100 0.4 50 0.3 1 7 14 21 28 35 1 7 14 Index Scatterplot of Meter Sales v s GNP 28 35 Scatterplot of Meter Sales v s RPC 250 250 200 200 Meter Sales Meter Sales 21 Index 150 100 150 100 50 50 500 750 1000 1250 1500 0.3 0.4 GNP 0.5 0.6 0.7 RPC What do you deduce from the patterns revealed by the brushing? The time series plot of RPC shows the successive sets of values corresponding to constant nominal phone charge values identified earlier. There are corresponding sets of points in the Meter Sales vs. RPC scatterplot showing (a) (b) growing sales within each set as real phone charge (RPC) decreases growing sales between sets as RPC increases. The first of these is counterintuitive, but may be explained by the growth of sales as GNP increases. The second is as expected. The most recent Meter Sales values are subject to substantial variation which is not explained by GNP ( ≈ constant) and which does not correspond to the pattern of the successive sets of constant phone charges seen in the earlier data points in the Meter Sales vs RLP scatterplot. The conclusion is that the behaviour of the Meter Sales process changed in recent years, becoming unstable and subject to substantial unexplained variation. Such a conclusion would need to be discussed with the client. Unless the excessive variation is explained, it appears that the system underlying sales has become unstable and so there is not much prospect of identifying a useful prediction formula. Modelling earlier data Regression Analysis: Meter Sales versus GNP, RLP, RPC The regression equation is Meter Sales = - 89.7 + 0.280 GNP - 79.8 RLP + 147 RPC Predictor Constant GNP RLP RPC Coef -89.70 0.28049 -79.78 147.40 SE Coef 10.72 0.01619 13.90 17.14 T -8.37 17.32 -5.74 8.60 S = 6.54275 7 P 0.000 0.000 0.000 0.000 Residuals Versus the Fitted Values Normal Probability Plot of the Residuals (response is Meter Sales) (response is Meter Sales) 2 Deleted Residual Deleted Residual 2 1 0 -1 1 0 -1 -2 -2 50 100 150 200 -2 -1 Fitted Value 0 Score 1 2 Comment on the diagnostics. Both plots are satisfactory, apparently reflecting stable chance variation in the residuals Interpret the t-values, including signs (+ or –). The GNP coefficient is very highly significant, and positive, as expected. The RLP coefficient has now become highly significant, and negative, as expected. The RPC coefficient continues to be significant, and positive, as expected. Comment on the new s value. s = 6.5 is less than half the best value previously recorded and represents a much more satisfactory value for prediction purposes. Write down the prediction formula corresponding to this output; include an allowance for prediction error. Meter Sales = - 89.7 + 0.280 GNP - 79.8 RLP + 147 RPC ± 13 Use this prediction formula to "predict" meter sales for 1983, compare to the observed meter sales for 1983. Year GNP RLP RPC 1983 1462.6 1.993 0.651 Predicted Sales Lower bound Upper bound 257.5 244.4 270.6 Actual Sales 259.7 Discuss the suggestion that the meter sales process was "back on track" by 1983, and the advisability of using the prediction formula based on the pre 1976 data for forecasting 1984 and beyond. Whether this forecasting formula should be used after 1983 is a matter of huge speculation. The only justification for using it in these data is the one observation in 1983. Without substantial additional support, its use would not be recommended. 8 Technical Addendum The nonstandard pattern in the variation of RLP and the interpretation put on the relationship between Meter Sales and RPC, allowing for the effect of GNP, suggests an alternative model which, in theory, more closely reflects the actual relationships. It appears that Meter Sales jumped to a higher level whenever the Government sanctioned an increase in nominal Phone Charge. These jumps can be modelled by adding "indicator" variables (referred to, somewhat unfairly, as "dummy" variables by econometricians) defined to take the value 0 for all years prior to the jump and 1 for all years during and after the jump. Thus, the first jump occurred during 1952, so the corresponding indicator will be 0 from 1949 to 1952 and 1 from 1953 to 1983. Multiplying this Explanatory variable by regression coefficient simply adds 0 to predicted Meter Sales from 1949 to 1952 and adds from 1953 to 1983. The result of doing this for the four jumps that took place up to 1975 yields Regression Analysis: Meter Sales versus GNP, RLP, ... The regression equation is Meter Sales = 38.9 + 0.159 GNP - 73.5 RLP - 14.4 RPC + 13.4 Jump1953 + 23.1 Jump1956 + 41.9 Jump1964 + 16.4 Jump1970 Predictor Constant GNP RLP RPC Jump1953 Jump1956 Jump1964 Jump1970 Coef 38.87 0.15905 -73.53 -14.36 13.385 23.110 41.93 16.39 SE Coef 51.62 0.04881 14.76 65.76 9.688 8.140 15.17 10.19 T 0.75 3.26 -4.98 -0.22 1.38 2.84 2.76 1.61 P 0.461 0.004 0.000 0.830 0.184 0.011 0.013 0.125 S = 5.42373 Note that the t-value for RPC is negligible so that RPC may be omitted. The variation explained by RPC is captured by the four indicator variables. Also, the s value is lower than before, suggesting that the variation in Meter Sales is better explained by the indicators than by RPC alone. This is a simple example of how improved models may be derived when more detailed knowledge of the relationships is available. Note that a similar analysis may apply to the relationship of Meter Sales to RLP. 9