Laboratory 7 - Trinity College Dublin

advertisement
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 7:
Simple Linear Regression
To complete the laboratory exercise, work your way through this handout, which is self
contained and self explanatory. Work in pairs (two per machine), and learn from each other.
Keep separate logs of your work. The tutor is available to help with technicalities and discuss
substantive issues.
Invitations to consider the results of Minitab analysis and their statistical and substantive
interpretations are printed in italics. Take some time for this; consult your neighbour or
tutor. Enter your responses in a Word document, as if draft contributions to a report on
the experiment and its analysis.
Topics:
1. A study of drug potency stability
2. Computer Maintenance Costs
3. Review exercise: a study of productivity in a US district post office
Learning Objectives:
Be able to

make and interpret scatter plots, by hand and using Minitab

enhance scatter plots with regression lines

calculate a simple linear regression and interpret the standard Minitab output

explain the interpretation of the linear regression parameters

make inferences regarding the linear regression parameters

produce confidence interval curves for average response and prediction interval
curves for actual response

produce values for confidence intervals for average response and prediction
intervals for actual response using
o
graphs
o
calculation using the standard formulas
o
calculation using Minitab

use prediction intervals to assess whether individual responses are exceptional

explain the basis for standardised deleted residuals

produce standard diagnostic graphs based on residuals and use residual analysis to
assess whether individual responses are exceptional
Trinity College, Dublin
Generic Skills Programme
1
Introduction to Statistics
Computer Laboratory 7

enhance the standard diagnostic plot with control limits, adjusted for degrees of
freedom

prepare a brief management report summarising the salient results of a simple linear
regression

make a scatter plot by hand and interpret the result

use Minitab to make and interpret dotplots and scatterplots as part of an initial data
analysis

interpret Minitab output on "Unusual" observations

study the effect of deleting exceptional cases on a simple linear regression by
compiling a table of results produced by relevant Minitab regression calculations

use regression prediction to estimate the effect of exceptional circumstances
A study of drug potency stability
We will use Minitab to implement the study of drug potency stability discussed as Example 3 in
the Course Manual, Chapter 5, pp. 34-41 and also in Mullins (2003), pp. 248-260. In the study,
samples of a drug product were stored at a temperature of 25°C and relative humidity 60%.
The purpose of the study was to assess the changes in drug potency over time elapsed since
original production, with a view to evaluating shelf life claims. Determinations of the potency of
the drug were made at seven time points, as shown below, and reported as "percentages of
label claim". The data follow.
Month
Potency
0
3
6
9
12
18
24
102.6
100.4
98.4
99.4
99.8
97.8
97.6
Open Minitab, enter the data in the first two columns, with the variable names in the
corresponding name cells.
Before proceeding with the analysis, change the Minitab default format for Normal probability
plots, as follows:





1.1
from the Tools menu, select Options,
expand Individual Graphs (click on the + sign),
click on Probability Plots,
set Y-Scale Type to "Score", Graph Orientation to "Show raw data on vertical scale",
click OK
Initial data analysis
The first action to take when presented with data such as these is to draw a scatter plot:


from the Graph menu, select Scatterplot, then Simple, click OK,
select Potency as the Y variable and Month as the X-variable, click OK.
Describe what you see in terms of pattern and exceptions.
page 2
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
Given the obvious trend, it may help to enhance the chart with a fitted line. Also, given the
spacing of the Months, it may help to edit the tick labels on the horizontal axis. To do these,








press Ctrl+E (or select "Edit last dialog" from the Edit menu),
select Data View,
select the Regression tab, then "Linear" model,
click Ok, Ok,
double click the X axis,
check the Position of Ticks option,
enter values 0 3 6 etc.,
click OK.
Review your previous comment.
1.2
Calculate the regression of Potency on Month





From the Stat menu, select Regression, then Regression,
select Potency as the Response and Month as the Predictor,
click on Graphs ... , uncheck any residual plots that are checked (we will return to these
later), click OK,
click on Results, check the "Regression equation etc." option,
click OK, OK.
The model underlying this regression may be written as
P  0  1M  ,
  N(0, )
The fitted line is
P̂  ˆ 0  ˆ 1M
What is ̂0 ?
What is ̂1 ?
What is ̂ ?
Provide a conventional interpretation of R2.
How many degrees of freedom are associated with s? Explain.
Write down a prediction equation, with approximate 95% prediction interval (2s).
Is the approximation valid in this case? Explain.
(Hint: See the "exact" prediction interval formula in Course Manual, Ch. 5, p. 39.)
page 3
Trinity College, Dublin
Generic Skills Programme
1.3
Introduction to Statistics
Computer Laboratory 7
Make inferences from the regression results
Note that the number of degrees of freedom for error is small, so that, strictly speaking, critical
values for the t distribution are needed for significance tests and confidence intervals, rather
than use the Normal critical value, 2, as above. To facilitate this, a table of critical values for the
t distribution is attached.
Alternatively, use Minitab to calculate a critical value, as follows:





from the Minitab Calc menu, select Probability Distributions, then select t...,
check Inverse Cumulative Distribution,
enter 5 in the degrees of freedom box,
check Input constant and enter .975 in the relevant box, (explain why .975 is relevant,
draw a picture),
click Ok.
The result appears in the Session window.
Does the potency degrade with time? Test the hypothesis that the slope is 0, that is, the
potency does not degrade with time. Report the result of your test.
What is the estimated decrease in potency per month following manufacture? Provide a
confidence interval for the average decrease in potency per month.
Relate the confidence interval you have calculated to the previous test.
What is the estimated initial potency of this drug? Provide a confidence interval for
average initial potency.
Is the average initial potency significantly different from 100%? Provide a formal report.
(See Laboratory 3, page 9, for format).
1.4
Estimate average potency
The fitted regression line may be used to estimate the average potency of samples from this
batch at varying times since production. For times within the time frame of this study, this can
be done graphically:





from the Stat menu, select Regression, then Fitted Line Plot,
select Potency as the Response and Month as the Predictor,
click on Graphs ... , uncheck any residual plots that are checked (we will return to these
later),
click the Options button, then check the Display Confidence Interval box,
click Ok.
Edit the tick marks on the X axis as in Section 1.1:




double click the X axis,
check the Position of Ticks option,
enter values 0 3 6 etc.,
click OK.
page 4
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
Gridlines facilitate reading values from the graph. If gridlines are not shown,



from the Editor menu, select Add, then Gridlines ... ,
check all four tick boxes,
click OK.
Read the estimated potency at 24 months (2 years), then read the corresponding
confidence limits.
Calculate interval for values outside data range
Reading estimated potency is not possible outside the time frame of the graph, e.g., 3 years (36
months). For such cases, calculation is required. The relevant formula is complicated;
2
1 1M M 
 where sM =
P̂  ˆ 0  ˆ 1M  t .05 s
 
n n  sM 
1
n
n
2
 (Mi  M ) .
i 1

(Another version of this formula, yielding identical numerical results, is shown as formula 6.10,
p. 257, Mullins (2003).)
Fortunately, Minitab does the calculation. To calculate estimated potencies and confidence
intervals corresponding to 2 years (within the graph frame) and 3 years (outside the graph
frame),




enter the two desired times in months, 24 and 36, in Column 3 of the data sheet,
from the Stat menu, select Regression, then Regression,
click on Options, enter c3 in the "Prediction intervals for new observations" box,
click OK, OK.
The confidence intervals are shown in the Session window, headed "95% CI". The intervals
headed "95% PI" are prediction intervals, dealt with in Section 1.5 below.
Check (or correct) the values read from the graph for average potency at 2 years.
Comment on the width of the interval for 3 years, relative to that for 2 years.
Can you relate this comparison to the numerical formula above?
Can you relate this comparison to the reported values of SE Fit?
1.5
Check whether an individual case is exceptional
The potency observed at 6 months, 98.4%, (in Row 3 of the data sheet), looks somewhat low
on the scatter plot, relative to the fitted line. To check this, we treat it as suspect, delete it from
the data set, calculate a prediction interval for 6 month potency from the remaining (nonsuspect) data, and compare the suspect case to the prediction interval. The calculation is done
as follows:


from the Data menu, select Subset Worksheet ... ,
check Specify which rows to exclude,
page 5
Trinity College, Dublin
Generic Skills Programme




Introduction to Statistics
Computer Laboratory 7
check Row numbers, enter 3 in the relevant box, click Ok,
from the Stat menu, select Regression, then Regression, then click the Options button,
enter 6 (for 6 months) in the "Prediction intervals for new observations" box,
click OK, OK.
The prediction interval is shown in the Session window, headed "95% PI".
Compare the actual potency at 6 months to the prediction interval.
Report your conclusion.
Compare the width of the prediction interval to that of the corresponding confidence
interval. Can you explain?
Alternative graphical analysis
The comparison can also be done graphically, using the Fitted Line command as follows:




from the Stat menu, select Regression, then Fitted Line Plot,
click on Graphs ... , uncheck any residual plots that are checked,
click the Options button, uncheck Display Confidence Interval, then check Display
Prediction Interval,
click OK, OK.
The resulting plot is a form of control chart; the prediction limits correspond to control limits1.
Points that appear outside the prediction / control limits are exceptional.
Because Case 3 has been deleted, it does not appear. This can be overcome by adding
reference lines as follows:





from the Editor menu, select Add, then Reference Lines ... ,
enter 98.4 in the box labelled Show reference lines at Y values,
enter 6 in the box labelled Show reference lines at X values,
adjust the tick positions to improve appearance,
click Ok.
Report on the comparison
Outline the correspondences between the graphical and numerical comparisons.
1.6
Residual analysis; check all cases
A more comprehensive approach to the analysis just completed for Case 3 is to use residual
analysis to check all cases at once. This may be done using the Graphs option in the
Regression command. Within that option, there is a choice between Regular, Standardized or
Deleted Residuals.
1
Standard Shewhart charts use "3 limits", with a 0.3% false alarm rate. Here, the prediction limits are
used, with a 5% false alarm rate. Recall footnote 4, Laboratory 3, page 8.
page 6
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
Deleted residuals
Using the Deleted Residuals option is exactly analogous to the procedure used above for case
3, where the suspect case was deleted before the relevant critical values for the suspect case
(the prediction limits based on the remaining data) were calculated. This is an application of
what is referred to as the deletion principle in diagnostic analysis.
The deleted residual for a suspect case is a standardised version of the ordinary residuals,
calculated from the subset of the data with the suspect case deleted. Because the deleted
residuals are standardised, we expect few (5%) of them to exceed  t.05 so that, if one does, we
regard it as exceptional.
The logic here is just the same as that associated with a statistical significance test; comparing
the standardised deleted residual to  t.05 amounts to a statistical significance test2. As such, it
bears the same relationship to the prediction interval test above as the Z test discussed on
Page 8 of Laboratory 3 bears to the X-bar control chart test.
Diagnostic analysis
To implement analysis of the residuals here,




activate the original worksheet; click on it if it is visible or use the Window menu to find it
if not,
re-do the Regression command, this time
 click on Graphs ... ,
 choose Deleted Residual for the plots,
 choose Individual plots, check Normal plot of residuals and Residuals versus fits,
 click OK,
click on Options and delete any entry from the "Prediction intervals for new
observations" box,
click OK, OK.
As a rough rule of thumb, cases with deleted residuals exceeding 2 are regarded as suspect.
Note possible suspect cases
Use the Brush command to identify suspect cases:





from the Editor menu, select Brush,
note the "brushing palette" that opens in top left
move the cursor into the Residual vs Fits plot; it changes to a pointer,
point at the suspect cases, note that their row numbers appear in the brushing palette,
(the corresponding rows are also marked in the data sheet),
enhance the brushing palette by selecting Set ID Variables from the Editor menu, then
select Month.
2
Because of this relationship to a t-test, the deleted residuals are sometimes referred to as deleted t
residuals. Also, because of Student's association with the t-test (he invented it; see Stuart (2003) p. 365),
the deleted residuals are sometimes referred to as the studentised residuals or the studentised deleted
residuals.
page 7
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
The rough rule of thumb is itself suspect when sample size is small. To overcome this,
enhance the Residuals versus Fits plot by adding horizontal reference lines at  2.78, the critical
t-value for 4 degrees of freedom (why 4?);




from the Editor menu, select Select to switch from the Brush to the selection arrow,
from the Editor menu, select Add, then Reference Lines,
enter -2.78 2.78 in the box labelled Show reference lines at Y values,
click Ok.
Comment on suspect cases. How does your evaluation of case 3 here compare with
your previous analysis of case 3?
Refer to the Normal plot of residuals.
What conclusion do you draw from the Normal plot?
N.B.
7 is a very small number of observations on which to base conclusions regarding
underlying assumptions.
2
Computer Maintenance Costs
Recall Laboratory 5, Exercise 2, where a building society manager was concerned about the
expense of computer maintenance in his branch. Following a review of likely sources of
variation in computer maintenance costs, a study of the relationship between maintenance
costs and computer usage in several branches was proposed. Data on maintenance costs
were readily available within each branch. It was decided to use the cumulative maintenance
charges for the preceding six months. There were no records available on computer usage.
Accordingly, the participants in the course decided to conduct a data collection exercise which
involved installing logging software on the computers and work stations in each branch and
recording the total active time for a week long period. The logging software was supplied and
installed by the maintenance contractor (who had been consulted at an early stage). The end
result was data on maintenance costs and usage from twelve branches of the building society,
as shown in the table on the following page.
These data became the subject for discussion at a Project Team Meeting. The (very sensible)
first step taken was to make a scatter plot of these data.
Branch
Number
1
2
3
4
5
6
7
8
9
10
11
12
Maintenance
(£)
Usage
(hours)
1,707
2,218
3,049
3,730
4,705
7,928
3,099
3,297
3,993
2,889
5,241
4,095
13
10
20
28
32
40
17
24
31
0
40
38
page 8
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
Make a scatter plot by hand:
establish the X (horizontal) axis and Y (vertical) axis;
determine the range of each variable;
determine "nice" scales for the axes accordingly;
draw the scales;
plot the pairs of points.
Comment on the result.
There was general agreement that there was a roughly linear relationship between the two
variables, apart from two branches. Scanning the data indicates that these are numbers 6 and
10. On investigation, it was found that a major problem had occurred in Branch 6 during the
previous six months, adding to the maintenance charge, while in Branch 10, there had been a
fault on the copy of the logging software installed there, so that no logging took place.
At this point, it was agreed that expert statistical advice was needed before proceeding further.
You have been asked to assist in this regard. Your task is to analyse the data appropriately,
using Minitab. The data are available in the Maintenance data set in the GenericSkillsData
folder; copy and paste into Minitab.
2.1
Initial data analysis
Use Minitab to make dot plots and a scatter plot:



use the graph menu to make separate dotplots for the two variables,
use the graph menu to make a scatterplot,
 select Scatterplot, then Simple, click OK,
 select Maintenance as the Y variable and Usage as the X-variable, click OK,
use the Brush (Editor menu) to confirm the identity of exceptional cases.
Describe what you see in terms of pattern and exceptions.
What does the scatterplot reveal about pattern and exceptions that the dotplots do
not?
2.2
Simple linear regression
Apart from the exceptional cases, a simple linear regression model seems appropriate. Use
Minitab to calculate the simple linear regression as follows:




from the Stat menu, select Regression, then Regression again,
select Maintenance as the Response and Usage as the Predictor,
click the Graphs button, uncheck selected plots
click Ok, Ok,
What is ̂0 ?
What is ̂1 ?
page 9
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
What is s?
How many degrees of freedom are associated with s?
Write down a prediction equation, with approximate 95% prediction interval.
2.3
Diagnostics
Note the "Unusual Observation" identified by Minitab in the Session window. Minitab bases its
determination on a comparison of standardised (not deleted) residuals with critical values  2. It
is preferable to conduct a more comprehensive diagnostic analysis, as follows:








from the Stat menu, select Regression, then Regression again,
select Maintenance as the Response and Usage as the Predictor,
click the Graphs button, check
 Deleted Residuals for Plots,
 Individual Plots,
 Normal plot of residuals,
 Residuals versus fits,
 uncheck other plots,
click Ok, Ok,
from the Editor menu, select Add,
select Reference Lines,
enter the relevant t critical values in the box labelled "Show reference lines at Y values",
click Ok.
Interpret the diagnostic plots.
2.4
Exclude the "Unusual Observation"
To exclude the "unusual observation" as identified by Minitab,



use the Brush command (Editor menu) to select the "unusual observation" in one of the
diagnostic plots,
use the Subset Worksheet command from the Data menu to exclude the brushed case,
repeat the regression analysis on the subset.
What is ̂0 ?
What is ̂1 ?
What is s?
How many degrees of freedom are associated with s?
How do these values compare with previous?
How will the change in s affect prediction?
page 10
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
Note the "Unusual Observation" identified by Minitab in the Session window; identify the
corresponding points in the diagnostic plots.
Interpret the diagnostic plots.
2.4
Exclude the second "Unusual Observation"
Repeat the exercises in §2.3, having excluded the "Unusual Observation" identified there.
Make a scatter plot including the fitted line based on the reduced data.
Report your conclusions.

3
Review Exercise: a study of productivity in a US district post office
The US Post Office is organised in administrative districts, each served by a major post office.
In one such district, a project was undertaken to set up a system for monitoring the cost of
handling the mail that passed through the post office. The main cost involved was labour,
which the project team decided to measure in terms of man hours required3. Experience
indicated that the labour requirement varied with volume of mail handled and that volume itself
varied from time to time. Thus, any cost monitoring system would have to allow for variation in
volume from time to time. It would be helpful if management knew by how much the labour
requirement should change for any given change in volume, for then they could say whether the
labour cost incurred in a particular instance, when a particular volume was processed,
conformed to that pattern, thus giving themselves a criterion for cost control.
There is a bonus in quantifying the relationship between cost of production and volume in this
way; it allows management to predict labour requirements as soon as they have an idea of
anticipated volume. Such a prediction capability could have a variety of applications including
strategic review, planning and budgeting, evaluating the effects of planned change and of
unplanned interventions. For example, management may be able to anticipate the expected
increase in volume in the weeks before Christmas and increase the labour force accordingly.
Or, in the event of the closure of a mail order company in the district, with a consequent
substantial drop in volume, management will be in a position to decide on the extent of
redeployment required.
To assist with this project, detailed historical records on man hours and volume for 26 four week
accounting periods over two years were assembled in a table, shown on the next page.
3.1
Initial data analysis
The data are available in Excel file US Mail.xls.
Make time series plots and a scatterplot.
Discuss the variation patterns in the data.
Identify any exceptional cases.
Propose an approach to analysis
Data on Mail Processing Hours and Volume for the ABCD Post Office
3
This case dates from 1962-63, when all the relevant workers were male.
page 11
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
(Fiscal Years 1962 -63)
Fiscal Year 1
3.2
Fiscal Year 2
Four-week
Accounting
Period
Pieces
of Mail
Handled
(millions)
Manhours
Used
(thousands)
Four-week
Accounting
Period
Pieces
of Mail
Handled
(millions)
Manhours
Used
(thousands)
1
157
572
1
154
569
2
161
570
2
157
564
3
168
645
3
164
573
4
186
645
4
188
667
5
183
645
5
191
700
6
184
671
6
180
765
7
268
1053
7
270
1070
8
180
675
8
180
637
9
175
670
9
172
650
10
193
710
10
184
655
11
184
656
11
179
665
12
179
640
12
169
599
13
164
599
13
160
605
1
157
572
1
154
569
2
161
570
2
157
564
Simple linear regression
Subset the data to exclude the exceptional cases. Make a new scatter diagram. Calculate a
simple linear regression.
Prepare a report on the statistical significance of the regression coefficient
estimates.
Calculate a confidence interval for marginal change in Manhours for a one unit (1
million) change in Volume for the mail handling process.
Calculate a confidence interval for the change in Manhours corresponding to a
10,000,000 increase in pieces of mail handled.
Write down a prediction equation, with approximate 95% prediction interval.
Produce a graph with fitted line and 95% prediction interval. Identify the position on
the graph of the point corresponding to Period 6, Year 2. Comment.
The isolated exceptional case was due to a breakdown in equipment during which staff were on
duty but not working. The other two exceptional cases were due to the Christmas rush, when
inexperienced sorters were taken on to alleviate the pressure of work.
Use the prediction formula to estimate the loss incurred through equipment
breakdown in Period 6, 1963, and to predict the extra manpower requirement during
the Christmas period, based on the experience of Period 7, 1962 and 1963.
page 12
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 7
Conclusion
This concludes Laboratory 7. The learning objectives listed at the outset are reproduced here.
Check them individually and ensure that you have achieved each one; seek help from the Tutor
if necessary.
Learning Objectives:
Be able to

make and interpret scatter plots, by hand and using Minitab

enhance scatter plots with regression lines

calculate a simple linear regression and interpret the standard Minitab output

explain the interpretation of the linear regression parameters

make inferences regarding the linear regression parameters

produce confidence interval curves for average response and prediction interval
curves for actual response

produce values for confidence intervals for average response and prediction
intervals for actual response using
o
graphs
o
calculation using the standard formulas
o
calculation using Minitab

use prediction intervals to assess whether individual responses are exceptional

explain the basis for standardised deleted residuals

produce standard diagnostic graphs based on residuals and use residual analysis to
assess whether individual responses are exceptional

enhance the standard diagnostic plot with control limits, adjusted for degrees of
freedom

prepare a brief management report summarising the salient results of a simple linear
regression

make a scatter plot by hand and interpret the result

use Minitab to make and interpret dotplots and scatterplots as part of an initial data
analysis

interpret Minitab output on "Unusual" observations

study the effect of deleting exceptional cases on a simple linear regression by
compiling a table of results produced by relevant Minitab regression calculations

use regression prediction to estimate the effect of exceptional circumstances
page 13
Trinity College, Dublin
Generic Skills Programme
Appendix:


= 1
2
3
4
5
6
7
8
9
10
12
15
20
24
30
40
60
120
∞
Introduction to Statistics
Computer Laboratory 7
Selected critical values of the t-distribution
.25
.10
.05
.02
.01
.002
.001
2.41
1.60
1.42
1.34
1.30
1.27
1.25
1.24
1.23
1.22
1.21
1.20
1.18
1.18
1.17
1.17
1.16
1.16
1.15
6.31
2.92
2.35
2.13
2.02
1.94
1.89
1.86
1.83
1.81
1.78
1.75
1.72
1.71
1.70
1.68
1.67
1.66
1.64
12.71
4.30
3.18
2.78
2.57
2.45
2.36
2.31
2.26
2.23
2.18
2.13
2.09
2.06
2.04
2.02
2.00
1.98
1.96
31.82
6.96
4.54
3.75
3.36
3.14
3.00
2.90
2.82
2.76
2.68
2.60
2.53
2.49
2.46
2.42
2.39
2.36
2.33
63.66
9.92
5.84
4.60
4.03
3.71
3.50
3.36
3.25
3.17
3.05
2.95
2.85
2.80
2.75
2.70
2.66
2.62
2.58
318.32
22.33
10.22
7.17
5.89
5.21
4.79
4.50
4.30
4.14
3.93
3.73
3.55
3.47
3.39
3.31
3.23
3.16
3.09
636.61
31.60
12.92
8.61
6.87
5.96
5.41
5.04
4.78
4.59
4.32
4.07
3.85
3.75
3.65
3.55
3.46
3.37
3.29
page 14
Download