Lab 1 97 - Trinity College Dublin

advertisement
Trinity College, Dublin
Diploma in Statistics
Introduction to Regression
Computer Laboratory 2; Feedback
Initial data analysis
Make dotplots of the four key variables;
Dotplots of Meter Sales, GNP, RLP, RPC
Meter Sales
60
90
120
150
180
210
240
GNP
560
700
840
980
1120
1260
1400
1.80
1.98
RLP
0.90
1.08
1.26
1.44
1.62
RPC
0.36
0.42
0.48
0.54
0.60
0.66
0.72
Meter Sales forms two homogeneous groups, one varying between 40 and 120, the other
varying between 150 and 260.
GNP varies between 500 and 1500, with a concentration at the low end of the scale, and
homogeneously spread otherwise.
RLP varies from 0.8 to 2.1, with a relatively dense subset between 0.8 and 1.2, relatively sparse
between 1.4 and 1.8 and two larger values.
RPC varies mostly between 0.5 and 0.75, with four lower values between 0.3 and 0.4.
Time Series Plot of Meter Sales, GNP, RLP, RPC
Make time series plots
1
Meter Sales
1500
250
200
1250
150
1000
100
750
50
500
RLP
2.00
0.7
1.75
0.6
1.50
0.5
1.25
0.4
1.00
0.3
1
7
14
21
28
35
Index
7
14
21
GNP
RPC
28
35
Meter Sales has a steady linear upward trend until 1977, apart from one or two slight deviations.
From 1978 on, it appears to take a downward step, especially in 1979, the year of the industrial
dispute. There may be signs of a recovery.
GNP has a series of gradually increasing upward trends, with some tailing off at the end.
RPC and RLP both have series of downward trends with upward steps in between. This is
especially noticeable in the case of RPC. The obvious explanation is that nominal prices stayed
constant for periods of years, while inflation rose, so that real prices declined within those
periods. Between periods, the nominal price increased substantially. Note the big increase in
RLP in 1962 followed by a general upward trend and the even bigger increase in RLP in 1971
followed by a stronger general upward trend. Post Office personnel related these to known
changes in government policy regarding postal pricing.
Make a scatterplot matrix
Matrix Plot of Meter Sales, GNP, RLP, RPC
500
1000
1500
1.0
1.5
2.0 0.3
0.5
0.7
240
160
Meter Sales
80
1500
1000
GNP
500
2.0
1.5
RLP
1.0
RPC
Meter Sales has a strong positive relationship with GNP and with RLP. Its relationship with
RPC is unclear. There appears to be extra variation in Meter Sales at high GNP.
GNP and RLP appear strongly positively related. The relationship of each with RPC is unclear
Regression calculation and interpretation
Regression Analysis: Meter Sales versus GNP, RLP, RPC
The regression equation is
Meter Sales = - 118 + 0.149 GNP + 9.3 RLP + 203 RPC
Predictor
Constant
GNP
RLP
RPC
Coef
-118.24
0.14938
9.25
203.14
SE Coef
28.42
0.03535
31.08
48.25
T
-4.16
4.23
0.30
4.21
P
0.000
0.000
0.768
0.000
S = 22.4829
How do you interpret the t values?
Those for the GNP and RPC coefficients are positive, in line with intuition, and are highly
statistically significant. That for the RLP coefficient is not statistically significant.
2
Note: Assuming the fitted regression was a good fit, this would suggest deleting RLP from the
model and refitting to get a simpler model. That would be premature here, as we have
not used diagnostics to check the fit of the model.
Use the prediction formula to make predictions for 1984 and 1985.
forecasts of GNP and inflation were:
1984
1985
GNP:
+ 1.5%
+ 1.5%
Inflation:
+ 8.6%
+ 5.5%
Note that the
and that the nominal letter price and nominal phone charge did not change.
To calculate GNP84, add 1.5% of GNP83 to GNP83, that is,
GNP84 = GNP83 + 0.015 × GNP83 = GNP83 × 1.015.
Thus,
GNP84 = GNP83 × 1.015 = 1462.6 × 1.015 = 1484.5
Similarly,
GNP85 = GNP84 × 1.015 = 1484.5 × 1.015 = 1506.8
To calculate RLP84, note that LP84 stays the same while everything else increases by 8.6%.
RLP84 = RLP83 / 1.086 = 1.993 / 1.086 = 1.835
RLP85 = RLP84 / 1.055 = 1.835 / 1.055 = 1.739
RPC84 = RPC83 / 1.086 = 1.993 / 1.086 = 0.599
RPC85 = RPC84 / 1.055 = 0.599 / 1.055 = 0.568
Predicted Sales84 = - 118 + 0.149 GNP84 + 9.3 RLP84 + 203 RPC84  2s
= - 118 + 0.149 × 1484.5 + 9.3 × 1.835 + 203 × 0.599  45
Predicted Sales85 = - 118 + 0.149 GNP85 + 9.3 RLP85 + 203 RPC85  2s
= - 118 + 0.149 × 1506.8 + 9.3 × 1.739 + 203 × 0.568  45
Predicted Sales
Lower bound
Upper bound
1984
241.9
196.9
286.9
1985
238
193
283
Discuss the value of the prediction interval width in the context of
(i) the current level of meter sales,
(ii) annual changes in meter sales in recent years and
(iii) the value of ̂Sales , the estimate of  based on the Sales data alone.
(i)
Prediction limits of  45, that is, a prediction range of 90, is relatively big when sales are
around 260.
(ii)
Over the last 10 years, excluding 1979, sales ranged from 200 to 260, a range of 60. In
this context, a prediction range of 90 seems relatively big.
3
(iii)
̂Sales , the standard deviation of prediction with no explanatory variables, equals 65. The
residual standard deviation from the regression is 22.5, roughly ⅓. This represents a
substantial improvement. However, when considered in context, as in (i) and (ii) above,
it is not substantial enough.
N.B.
Context is all important when interpreting statistical analysis. Mathematical
statisticians tend to be blissfully unaware of this desideratum, and their research,
teaching and statistical advice reflects this. Caveat emptor!
Diagnostic analysis of residuals
View the Residuals v Fits plot; click on it if it is visible, else select it from the Window menu.
Residuals Versus the Fitted Values
(response is Meter Sales)
3
Deleted Residual
2
1
0
-1
-2
-3
-4
50
100
150
200
250
Fitted Value
Describe any patterns and exceptions that you see.
There is one large negative outlier, exceeding 3 in magnitude. There are two residuals
exceeding 2 but we should not be surprised by this with 35 cases. There is a suggestion that
residual spread increases with fitted value.
How does the year 1979 show up? Are there other cases with exceptional residuals?
What are the exceptional residual values?
1979 is the year with the large negative residual, value -3.5. The other residuals exceeding 2 in
magnitude are 1977 (2.7) and 1981 (-2.2).
Characterise the apparent pattern in residual variation.
Residual variation (that is, variation in the vertical direction) appears to be lower for small fitted
values and higher for large fitted values. The fitted values appear to form a series of
homogeneous subsets.
How do you explain the subsets of residuals with similar Fits values?
4
Time Series Plot of RPC
Versus Fits
(response is Meter Sales)
0.7
2
Deleted Residual
RPC
0.6
0.5
0.4
0.3
0
-2
-4
1
7
14
21
28
35
50
100
Index
150
200
250
Fitted Value
They correspond to the groups of values of RPC that reflect constant nominal prices. It
appears that meter sales, as estimated by the fitted values, stayed more or less constant within
these periods of constant prices. This may be open to economic interpretation.
Normal Probability Plot of the Residuals
Select the Normal diagnostic plot.(response is Meter Sales)
3
N
35
AD
0.834
P-Value 0.028
Deleted Residual
2
1
0
-1
-2
-3
-4
-2
-1
0
1
2
Score
Describe any patterns and exceptions that you see. How does the year 1979 show up?
What do you think of the largest positive residual?
The bulk of the points follow a linear pattern. There are four potential outliers, one of which
(bottom left) appears exceptional. This is the 1979 case.
The largest positive residual does not appear exceptional in this graph. (Conceivably, it could
appear exceptional if 1979 is deleted).
Iterate the analysis
Regression Analysis: Meter Sales versus GNP, RLP, RPC
The regression equation is
Meter Sales = - 101 + 0.201 GNP - 28.0 RLP + 173 RPC
34 cases used, 1 cases contain missing values
Predictor
Constant
GNP
RLP
RPC
Coef
-100.73
0.20145
-27.96
173.23
SE Coef
24.78
0.03360
28.56
42.08
T
-4.06
6.00
-0.98
4.12
S = 19.2045
5
P
0.000
0.000
0.335
0.000
Compare old and new.
Discuss the change in s; what implications does this have for prediction?
s is reduced by a small amount; previous judgements are unchanged.
Discuss changes in the t-values.
The overall pattern is similar. RLP is still insignificant.
Describe and interpret any patterns and exceptions you see in the diagnostic plots.
Which is the most exceptional case?
Residuals Versus the Fitted Values
Normal Probability Plot of the Residuals
(response is Meter Sales)
(response is Meter Sales)
N
34
AD
1.392
P-Value <0.005
2
Deleted Residual
Deleted Residual
2
0
-2
-4
0
-2
-4
50
100
150
200
250
-2
-1
0
Fitted Value
1
2
Score
There is a new exceptional case, corresponding to 1981. Its residual value, at 3.5 approx., is
the biggest in magnitude. There are other residuals that are potentially exceptional.
Delete this case, as above, and repeat the iteration.
Regression Analysis: Meter Sales versus GNP, RLP, RPC
The regression equation is
Meter Sales = - 98.5 + 0.221 GNP - 34.3 RLP + 155 RPC
33 cases used, 2 cases contain missing values
Predictor
Constant
GNP
RLP
RPC
Coef
-98.46
0.22108
-34.35
155.01
SE Coef
20.99
0.02898
24.25
35.99
T
-4.69
7.63
-1.42
4.31
P
0.000
0.000
0.167
0.000
S = 16.2615
Residuals Versus the Fitted Values
Normal Probability Plot of the Residuals
(response is Meter Sales)
(response is Meter Sales)
N
33
AD
1.370
P-Value <0.005
2
Deleted Residual
Deleted Residual
2
0
-2
-4
0
-2
-4
50
100
150
200
250
-2
Fitted Value
-1
0
1
Score
Describe the changes on deleting this case and the next step suggested.
The pattern of change is as before; another exceptional case to be deleted.
6
2
Review the initial data analysis
Time Series Plot of Meter Sales
Time Series Plot of RPC
250
0.7
0.6
RPC
Meter Sales
200
150
0.5
100
0.4
50
0.3
1
7
14
21
28
35
1
7
14
Index
Scatterplot of Meter Sales v s GNP
28
35
Scatterplot of Meter Sales v s RPC
250
250
200
200
Meter Sales
Meter Sales
21
Index
150
100
150
100
50
50
500
750
1000
1250
1500
0.3
0.4
GNP
0.5
0.6
0.7
RPC
What do you deduce from the patterns revealed by the brushing?
The time series plot of RPC shows the successive sets of values corresponding to constant
nominal phone charge values identified earlier. There are corresponding sets of points in the
Meter Sales vs. RPC scatterplot showing
(a)
(b)
growing sales within each set as real phone charge (RPC) decreases
growing sales between sets as RPC increases.
The first of these is counterintuitive, but may be explained by the growth of sales as GNP
increases. The second is as expected.
The most recent Meter Sales values are subject to substantial variation which is not explained
by GNP ( ≈ constant) and which does not correspond to the pattern of the successive sets of
constant phone charges seen in the earlier data points in the Meter Sales vs RLP scatterplot.
The conclusion is that the behaviour of the Meter Sales process changed in recent years,
becoming unstable and subject to substantial unexplained variation.
Such a conclusion would need to be discussed with the client. Unless the excessive variation is
explained, it appears that the system underlying sales has become unstable and so there is not
much prospect of identifying a useful prediction formula.
Modelling earlier data
Regression Analysis: Meter Sales versus GNP, RLP, RPC
The regression equation is
Meter Sales = - 89.7 + 0.280 GNP - 79.8 RLP + 147 RPC
Predictor
Constant
GNP
RLP
RPC
Coef
-89.70
0.28049
-79.78
147.40
SE Coef
10.72
0.01619
13.90
17.14
T
-8.37
17.32
-5.74
8.60
S = 6.54275
7
P
0.000
0.000
0.000
0.000
Residuals Versus the Fitted Values
Normal Probability Plot of the Residuals
(response is Meter Sales)
(response is Meter Sales)
2
Deleted Residual
Deleted Residual
2
1
0
-1
1
0
-1
-2
-2
50
100
150
200
-2
-1
Fitted Value
0
Score
1
2
Comment on the diagnostics.
Both plots are satisfactory, apparently reflecting stable chance variation in the residuals
Interpret the t-values, including signs (+ or –).
The GNP coefficient is very highly significant, and positive, as expected.
The RLP coefficient has now become highly significant, and negative, as expected.
The RPC coefficient continues to be significant, and positive, as expected.
Comment on the new s value.
s = 6.5 is less than half the best value previously recorded and represents a much more
satisfactory value for prediction purposes.
Write down the prediction formula corresponding to this output; include an allowance for
prediction error.
Meter Sales = - 89.7 + 0.280 GNP - 79.8 RLP + 147 RPC ± 13
Use this prediction formula to "predict" meter sales for 1983, compare to the observed
meter sales for 1983.
Year
GNP
RLP
RPC
1983
1462.6
1.993
0.651
Predicted Sales
Lower bound
Upper bound
257.5
244.4
270.6
Actual Sales
259.7
Discuss the suggestion that the meter sales process was "back on track" by 1983, and
the advisability of using the prediction formula based on the pre 1976 data for
forecasting 1984 and beyond.
Whether this forecasting formula should be used after 1983 is a matter of huge speculation.
The only justification for using it in these data is the one observation in 1983. Without
substantial additional support, its use would not be recommended.
8
Technical Addendum
The nonstandard pattern in the variation of RLP and the interpretation put on the relationship
between Meter Sales and RPC, allowing for the effect of GNP, suggests an alternative model
which, in theory, more closely reflects the actual relationships.
It appears that Meter Sales jumped to a higher level whenever the Government sanctioned an
increase in nominal Phone Charge. These jumps can be modelled by adding "indicator"
variables (referred to, somewhat unfairly, as "dummy" variables by econometricians) defined to
take the value 0 for all years prior to the jump and 1 for all years during and after the jump.
Thus, the first jump occurred during 1952, so the corresponding indicator will be 0 from 1949 to
1952 and 1 from 1953 to 1983. Multiplying this Explanatory variable by regression coefficient 
simply adds 0 to predicted Meter Sales from 1949 to 1952 and adds  from 1953 to 1983.
The result of doing this for the four jumps that took place up to 1975 yields
Regression Analysis: Meter Sales versus GNP, RLP, ...
The regression equation is
Meter Sales = 38.9 + 0.159 GNP - 73.5 RLP - 14.4 RPC + 13.4 Jump1953
+ 23.1 Jump1956 + 41.9 Jump1964 + 16.4 Jump1970
Predictor
Constant
GNP
RLP
RPC
Jump1953
Jump1956
Jump1964
Jump1970
Coef
38.87
0.15905
-73.53
-14.36
13.385
23.110
41.93
16.39
SE Coef
51.62
0.04881
14.76
65.76
9.688
8.140
15.17
10.19
T
0.75
3.26
-4.98
-0.22
1.38
2.84
2.76
1.61
P
0.461
0.004
0.000
0.830
0.184
0.011
0.013
0.125
S = 5.42373
Note that the t-value for RPC is negligible so that RPC may be omitted. The variation explained
by RPC is captured by the four indicator variables. Also, the s value is lower than before,
suggesting that the variation in Meter Sales is better explained by the indicators than by RPC
alone.
This is a simple example of how improved models may be derived when more detailed
knowledge of the relationships is available.
Note that a similar analysis may apply to the relationship of Meter Sales to RLP.
9
Download