Laboratory 7 - Trinity College Dublin

Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 7: Simple Linear Regression To complete the laboratory exercise, work your way through this handout, which is self contained and self explanatory. Work in pairs (two per machine), and learn from each other. Keep separate logs of your work. The tutor is available to help with technicalities and discuss substantive issues. Invitations to consider the results of Minitab analysis and their statistical and substantive interpretations are printed in italics. Take some time for this; consult your neighbour or tutor. Enter your responses in a Word document, as if draft contributions to a report on the experiment and its analysis. Topics: 1. A study of drug potency stability 2. Computer Maintenance Costs 3. Review exercise: a study of productivity in a US district post office Learning Objectives: Be able to  make and interpret scatter plots, by hand and using Minitab  enhance scatter plots with regression lines  calculate a simple linear regression and interpret the standard Minitab output  explain the interpretation of the linear regression parameters  make inferences regarding the linear regression parameters  produce confidence interval curves for average response and prediction interval curves for actual response  produce values for confidence intervals for average response and prediction intervals for actual response using o graphs o calculation using the standard formulas o calculation using Minitab  use prediction intervals to assess whether individual responses are exceptional  explain the basis for standardised deleted residuals  produce standard diagnostic graphs based on residuals and use residual analysis to assess whether individual responses are exceptional Trinity College, Dublin Generic Skills Programme 1 Introduction to Statistics Computer Laboratory 7  enhance the standard diagnostic plot with control limits, adjusted for degrees of freedom  prepare a brief management report summarising the salient results of a simple linear regression  make a scatter plot by hand and interpret the result  use Minitab to make and interpret dotplots and scatterplots as part of an initial data analysis  interpret Minitab output on "Unusual" observations  study the effect of deleting exceptional cases on a simple linear regression by compiling a table of results produced by relevant Minitab regression calculations  use regression prediction to estimate the effect of exceptional circumstances A study of drug potency stability We will use Minitab to implement the study of drug potency stability discussed as Example 3 in the Course Manual, Chapter 5, pp. 34-41 and also in Mullins (2003), pp. 248-260. In the study, samples of a drug product were stored at a temperature of 25°C and relative humidity 60%. The purpose of the study was to assess the changes in drug potency over time elapsed since original production, with a view to evaluating shelf life claims. Determinations of the potency of the drug were made at seven time points, as shown below, and reported as "percentages of label claim". The data follow. Month Potency 0 3 6 9 12 18 24 102.6 100.4 98.4 99.4 99.8 97.8 97.6 Open Minitab, enter the data in the first two columns, with the variable names in the corresponding name cells. Before proceeding with the analysis, change the Minitab default format for Normal probability plots, as follows:      1.1 from the Tools menu, select Options, expand Individual Graphs (click on the + sign), click on Probability Plots, set Y-Scale Type to "Score", Graph Orientation to "Show raw data on vertical scale", click OK Initial data analysis The first action to take when presented with data such as these is to draw a scatter plot:   from the Graph menu, select Scatterplot, then Simple, click OK, select Potency as the Y variable and Month as the X-variable, click OK. Describe what you see in terms of pattern and exceptions. page 2 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 Given the obvious trend, it may help to enhance the chart with a fitted line. Also, given the spacing of the Months, it may help to edit the tick labels on the horizontal axis. To do these,         press Ctrl+E (or select "Edit last dialog" from the Edit menu), select Data View, select the Regression tab, then "Linear" model, click Ok, Ok, double click the X axis, check the Position of Ticks option, enter values 0 3 6 etc., click OK. Review your previous comment. 1.2 Calculate the regression of Potency on Month      From the Stat menu, select Regression, then Regression, select Potency as the Response and Month as the Predictor, click on Graphs ... , uncheck any residual plots that are checked (we will return to these later), click OK, click on Results, check the "Regression equation etc." option, click OK, OK. The model underlying this regression may be written as P  0  1M  ,   N(0, ) The fitted line is P̂  ˆ 0  ˆ 1M What is ̂0 ? What is ̂1 ? What is ̂ ? Provide a conventional interpretation of R2. How many degrees of freedom are associated with s? Explain. Write down a prediction equation, with approximate 95% prediction interval (2s). Is the approximation valid in this case? Explain. (Hint: See the "exact" prediction interval formula in Course Manual, Ch. 5, p. 39.) page 3 Trinity College, Dublin Generic Skills Programme 1.3 Introduction to Statistics Computer Laboratory 7 Make inferences from the regression results Note that the number of degrees of freedom for error is small, so that, strictly speaking, critical values for the t distribution are needed for significance tests and confidence intervals, rather than use the Normal critical value, 2, as above. To facilitate this, a table of critical values for the t distribution is attached. Alternatively, use Minitab to calculate a critical value, as follows:      from the Minitab Calc menu, select Probability Distributions, then select t..., check Inverse Cumulative Distribution, enter 5 in the degrees of freedom box, check Input constant and enter .975 in the relevant box, (explain why .975 is relevant, draw a picture), click Ok. The result appears in the Session window. Does the potency degrade with time? Test the hypothesis that the slope is 0, that is, the potency does not degrade with time. Report the result of your test. What is the estimated decrease in potency per month following manufacture? Provide a confidence interval for the average decrease in potency per month. Relate the confidence interval you have calculated to the previous test. What is the estimated initial potency of this drug? Provide a confidence interval for average initial potency. Is the average initial potency significantly different from 100%? Provide a formal report. (See Laboratory 3, page 9, for format). 1.4 Estimate average potency The fitted regression line may be used to estimate the average potency of samples from this batch at varying times since production. For times within the time frame of this study, this can be done graphically:      from the Stat menu, select Regression, then Fitted Line Plot, select Potency as the Response and Month as the Predictor, click on Graphs ... , uncheck any residual plots that are checked (we will return to these later), click the Options button, then check the Display Confidence Interval box, click Ok. Edit the tick marks on the X axis as in Section 1.1:     double click the X axis, check the Position of Ticks option, enter values 0 3 6 etc., click OK. page 4 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 Gridlines facilitate reading values from the graph. If gridlines are not shown,    from the Editor menu, select Add, then Gridlines ... , check all four tick boxes, click OK. Read the estimated potency at 24 months (2 years), then read the corresponding confidence limits. Calculate interval for values outside data range Reading estimated potency is not possible outside the time frame of the graph, e.g., 3 years (36 months). For such cases, calculation is required. The relevant formula is complicated; 2 1 1M M   where sM = P̂  ˆ 0  ˆ 1M  t .05 s   n n  sM  1 n n 2  (Mi  M ) . i 1  (Another version of this formula, yielding identical numerical results, is shown as formula 6.10, p. 257, Mullins (2003).) Fortunately, Minitab does the calculation. To calculate estimated potencies and confidence intervals corresponding to 2 years (within the graph frame) and 3 years (outside the graph frame),     enter the two desired times in months, 24 and 36, in Column 3 of the data sheet, from the Stat menu, select Regression, then Regression, click on Options, enter c3 in the "Prediction intervals for new observations" box, click OK, OK. The confidence intervals are shown in the Session window, headed "95% CI". The intervals headed "95% PI" are prediction intervals, dealt with in Section 1.5 below. Check (or correct) the values read from the graph for average potency at 2 years. Comment on the width of the interval for 3 years, relative to that for 2 years. Can you relate this comparison to the numerical formula above? Can you relate this comparison to the reported values of SE Fit? 1.5 Check whether an individual case is exceptional The potency observed at 6 months, 98.4%, (in Row 3 of the data sheet), looks somewhat low on the scatter plot, relative to the fitted line. To check this, we treat it as suspect, delete it from the data set, calculate a prediction interval for 6 month potency from the remaining (nonsuspect) data, and compare the suspect case to the prediction interval. The calculation is done as follows:   from the Data menu, select Subset Worksheet ... , check Specify which rows to exclude, page 5 Trinity College, Dublin Generic Skills Programme     Introduction to Statistics Computer Laboratory 7 check Row numbers, enter 3 in the relevant box, click Ok, from the Stat menu, select Regression, then Regression, then click the Options button, enter 6 (for 6 months) in the "Prediction intervals for new observations" box, click OK, OK. The prediction interval is shown in the Session window, headed "95% PI". Compare the actual potency at 6 months to the prediction interval. Report your conclusion. Compare the width of the prediction interval to that of the corresponding confidence interval. Can you explain? Alternative graphical analysis The comparison can also be done graphically, using the Fitted Line command as follows:     from the Stat menu, select Regression, then Fitted Line Plot, click on Graphs ... , uncheck any residual plots that are checked, click the Options button, uncheck Display Confidence Interval, then check Display Prediction Interval, click OK, OK. The resulting plot is a form of control chart; the prediction limits correspond to control limits1. Points that appear outside the prediction / control limits are exceptional. Because Case 3 has been deleted, it does not appear. This can be overcome by adding reference lines as follows:      from the Editor menu, select Add, then Reference Lines ... , enter 98.4 in the box labelled Show reference lines at Y values, enter 6 in the box labelled Show reference lines at X values, adjust the tick positions to improve appearance, click Ok. Report on the comparison Outline the correspondences between the graphical and numerical comparisons. 1.6 Residual analysis; check all cases A more comprehensive approach to the analysis just completed for Case 3 is to use residual analysis to check all cases at once. This may be done using the Graphs option in the Regression command. Within that option, there is a choice between Regular, Standardized or Deleted Residuals. 1 Standard Shewhart charts use "3 limits", with a 0.3% false alarm rate. Here, the prediction limits are used, with a 5% false alarm rate. Recall footnote 4, Laboratory 3, page 8. page 6 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 Deleted residuals Using the Deleted Residuals option is exactly analogous to the procedure used above for case 3, where the suspect case was deleted before the relevant critical values for the suspect case (the prediction limits based on the remaining data) were calculated. This is an application of what is referred to as the deletion principle in diagnostic analysis. The deleted residual for a suspect case is a standardised version of the ordinary residuals, calculated from the subset of the data with the suspect case deleted. Because the deleted residuals are standardised, we expect few (5%) of them to exceed  t.05 so that, if one does, we regard it as exceptional. The logic here is just the same as that associated with a statistical significance test; comparing the standardised deleted residual to  t.05 amounts to a statistical significance test2. As such, it bears the same relationship to the prediction interval test above as the Z test discussed on Page 8 of Laboratory 3 bears to the X-bar control chart test. Diagnostic analysis To implement analysis of the residuals here,     activate the original worksheet; click on it if it is visible or use the Window menu to find it if not, re-do the Regression command, this time  click on Graphs ... ,  choose Deleted Residual for the plots,  choose Individual plots, check Normal plot of residuals and Residuals versus fits,  click OK, click on Options and delete any entry from the "Prediction intervals for new observations" box, click OK, OK. As a rough rule of thumb, cases with deleted residuals exceeding 2 are regarded as suspect. Note possible suspect cases Use the Brush command to identify suspect cases:      from the Editor menu, select Brush, note the "brushing palette" that opens in top left move the cursor into the Residual vs Fits plot; it changes to a pointer, point at the suspect cases, note that their row numbers appear in the brushing palette, (the corresponding rows are also marked in the data sheet), enhance the brushing palette by selecting Set ID Variables from the Editor menu, then select Month. 2 Because of this relationship to a t-test, the deleted residuals are sometimes referred to as deleted t residuals. Also, because of Student's association with the t-test (he invented it; see Stuart (2003) p. 365), the deleted residuals are sometimes referred to as the studentised residuals or the studentised deleted residuals. page 7 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 The rough rule of thumb is itself suspect when sample size is small. To overcome this, enhance the Residuals versus Fits plot by adding horizontal reference lines at  2.78, the critical t-value for 4 degrees of freedom (why 4?);     from the Editor menu, select Select to switch from the Brush to the selection arrow, from the Editor menu, select Add, then Reference Lines, enter -2.78 2.78 in the box labelled Show reference lines at Y values, click Ok. Comment on suspect cases. How does your evaluation of case 3 here compare with your previous analysis of case 3? Refer to the Normal plot of residuals. What conclusion do you draw from the Normal plot? N.B. 7 is a very small number of observations on which to base conclusions regarding underlying assumptions. 2 Computer Maintenance Costs Recall Laboratory 5, Exercise 2, where a building society manager was concerned about the expense of computer maintenance in his branch. Following a review of likely sources of variation in computer maintenance costs, a study of the relationship between maintenance costs and computer usage in several branches was proposed. Data on maintenance costs were readily available within each branch. It was decided to use the cumulative maintenance charges for the preceding six months. There were no records available on computer usage. Accordingly, the participants in the course decided to conduct a data collection exercise which involved installing logging software on the computers and work stations in each branch and recording the total active time for a week long period. The logging software was supplied and installed by the maintenance contractor (who had been consulted at an early stage). The end result was data on maintenance costs and usage from twelve branches of the building society, as shown in the table on the following page. These data became the subject for discussion at a Project Team Meeting. The (very sensible) first step taken was to make a scatter plot of these data. Branch Number 1 2 3 4 5 6 7 8 9 10 11 12 Maintenance (£) Usage (hours) 1,707 2,218 3,049 3,730 4,705 7,928 3,099 3,297 3,993 2,889 5,241 4,095 13 10 20 28 32 40 17 24 31 0 40 38 page 8 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 Make a scatter plot by hand: establish the X (horizontal) axis and Y (vertical) axis; determine the range of each variable; determine "nice" scales for the axes accordingly; draw the scales; plot the pairs of points. Comment on the result. There was general agreement that there was a roughly linear relationship between the two variables, apart from two branches. Scanning the data indicates that these are numbers 6 and 10. On investigation, it was found that a major problem had occurred in Branch 6 during the previous six months, adding to the maintenance charge, while in Branch 10, there had been a fault on the copy of the logging software installed there, so that no logging took place. At this point, it was agreed that expert statistical advice was needed before proceeding further. You have been asked to assist in this regard. Your task is to analyse the data appropriately, using Minitab. The data are available in the Maintenance data set in the GenericSkillsData folder; copy and paste into Minitab. 2.1 Initial data analysis Use Minitab to make dot plots and a scatter plot:    use the graph menu to make separate dotplots for the two variables, use the graph menu to make a scatterplot,  select Scatterplot, then Simple, click OK,  select Maintenance as the Y variable and Usage as the X-variable, click OK, use the Brush (Editor menu) to confirm the identity of exceptional cases. Describe what you see in terms of pattern and exceptions. What does the scatterplot reveal about pattern and exceptions that the dotplots do not? 2.2 Simple linear regression Apart from the exceptional cases, a simple linear regression model seems appropriate. Use Minitab to calculate the simple linear regression as follows:     from the Stat menu, select Regression, then Regression again, select Maintenance as the Response and Usage as the Predictor, click the Graphs button, uncheck selected plots click Ok, Ok, What is ̂0 ? What is ̂1 ? page 9 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 What is s? How many degrees of freedom are associated with s? Write down a prediction equation, with approximate 95% prediction interval. 2.3 Diagnostics Note the "Unusual Observation" identified by Minitab in the Session window. Minitab bases its determination on a comparison of standardised (not deleted) residuals with critical values  2. It is preferable to conduct a more comprehensive diagnostic analysis, as follows:         from the Stat menu, select Regression, then Regression again, select Maintenance as the Response and Usage as the Predictor, click the Graphs button, check  Deleted Residuals for Plots,  Individual Plots,  Normal plot of residuals,  Residuals versus fits,  uncheck other plots, click Ok, Ok, from the Editor menu, select Add, select Reference Lines, enter the relevant t critical values in the box labelled "Show reference lines at Y values", click Ok. Interpret the diagnostic plots. 2.4 Exclude the "Unusual Observation" To exclude the "unusual observation" as identified by Minitab,    use the Brush command (Editor menu) to select the "unusual observation" in one of the diagnostic plots, use the Subset Worksheet command from the Data menu to exclude the brushed case, repeat the regression analysis on the subset. What is ̂0 ? What is ̂1 ? What is s? How many degrees of freedom are associated with s? How do these values compare with previous? How will the change in s affect prediction? page 10 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 Note the "Unusual Observation" identified by Minitab in the Session window; identify the corresponding points in the diagnostic plots. Interpret the diagnostic plots. 2.4 Exclude the second "Unusual Observation" Repeat the exercises in §2.3, having excluded the "Unusual Observation" identified there. Make a scatter plot including the fitted line based on the reduced data. Report your conclusions.  3 Review Exercise: a study of productivity in a US district post office The US Post Office is organised in administrative districts, each served by a major post office. In one such district, a project was undertaken to set up a system for monitoring the cost of handling the mail that passed through the post office. The main cost involved was labour, which the project team decided to measure in terms of man hours required3. Experience indicated that the labour requirement varied with volume of mail handled and that volume itself varied from time to time. Thus, any cost monitoring system would have to allow for variation in volume from time to time. It would be helpful if management knew by how much the labour requirement should change for any given change in volume, for then they could say whether the labour cost incurred in a particular instance, when a particular volume was processed, conformed to that pattern, thus giving themselves a criterion for cost control. There is a bonus in quantifying the relationship between cost of production and volume in this way; it allows management to predict labour requirements as soon as they have an idea of anticipated volume. Such a prediction capability could have a variety of applications including strategic review, planning and budgeting, evaluating the effects of planned change and of unplanned interventions. For example, management may be able to anticipate the expected increase in volume in the weeks before Christmas and increase the labour force accordingly. Or, in the event of the closure of a mail order company in the district, with a consequent substantial drop in volume, management will be in a position to decide on the extent of redeployment required. To assist with this project, detailed historical records on man hours and volume for 26 four week accounting periods over two years were assembled in a table, shown on the next page. 3.1 Initial data analysis The data are available in Excel file US Mail.xls. Make time series plots and a scatterplot. Discuss the variation patterns in the data. Identify any exceptional cases. Propose an approach to analysis Data on Mail Processing Hours and Volume for the ABCD Post Office 3 This case dates from 1962-63, when all the relevant workers were male. page 11 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 (Fiscal Years 1962 -63) Fiscal Year 1 3.2 Fiscal Year 2 Four-week Accounting Period Pieces of Mail Handled (millions) Manhours Used (thousands) Four-week Accounting Period Pieces of Mail Handled (millions) Manhours Used (thousands) 1 157 572 1 154 569 2 161 570 2 157 564 3 168 645 3 164 573 4 186 645 4 188 667 5 183 645 5 191 700 6 184 671 6 180 765 7 268 1053 7 270 1070 8 180 675 8 180 637 9 175 670 9 172 650 10 193 710 10 184 655 11 184 656 11 179 665 12 179 640 12 169 599 13 164 599 13 160 605 1 157 572 1 154 569 2 161 570 2 157 564 Simple linear regression Subset the data to exclude the exceptional cases. Make a new scatter diagram. Calculate a simple linear regression. Prepare a report on the statistical significance of the regression coefficient estimates. Calculate a confidence interval for marginal change in Manhours for a one unit (1 million) change in Volume for the mail handling process. Calculate a confidence interval for the change in Manhours corresponding to a 10,000,000 increase in pieces of mail handled. Write down a prediction equation, with approximate 95% prediction interval. Produce a graph with fitted line and 95% prediction interval. Identify the position on the graph of the point corresponding to Period 6, Year 2. Comment. The isolated exceptional case was due to a breakdown in equipment during which staff were on duty but not working. The other two exceptional cases were due to the Christmas rush, when inexperienced sorters were taken on to alleviate the pressure of work. Use the prediction formula to estimate the loss incurred through equipment breakdown in Period 6, 1963, and to predict the extra manpower requirement during the Christmas period, based on the experience of Period 7, 1962 and 1963. page 12 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 7 Conclusion This concludes Laboratory 7. The learning objectives listed at the outset are reproduced here. Check them individually and ensure that you have achieved each one; seek help from the Tutor if necessary. Learning Objectives: Be able to  make and interpret scatter plots, by hand and using Minitab  enhance scatter plots with regression lines  calculate a simple linear regression and interpret the standard Minitab output  explain the interpretation of the linear regression parameters  make inferences regarding the linear regression parameters  produce confidence interval curves for average response and prediction interval curves for actual response  produce values for confidence intervals for average response and prediction intervals for actual response using o graphs o calculation using the standard formulas o calculation using Minitab  use prediction intervals to assess whether individual responses are exceptional  explain the basis for standardised deleted residuals  produce standard diagnostic graphs based on residuals and use residual analysis to assess whether individual responses are exceptional  enhance the standard diagnostic plot with control limits, adjusted for degrees of freedom  prepare a brief management report summarising the salient results of a simple linear regression  make a scatter plot by hand and interpret the result  use Minitab to make and interpret dotplots and scatterplots as part of an initial data analysis  interpret Minitab output on "Unusual" observations  study the effect of deleting exceptional cases on a simple linear regression by compiling a table of results produced by relevant Minitab regression calculations  use regression prediction to estimate the effect of exceptional circumstances page 13 Trinity College, Dublin Generic Skills Programme Appendix:   = 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞ Introduction to Statistics Computer Laboratory 7 Selected critical values of the t-distribution .25 .10 .05 .02 .01 .002 .001 2.41 1.60 1.42 1.34 1.30 1.27 1.25 1.24 1.23 1.22 1.21 1.20 1.18 1.18 1.17 1.17 1.16 1.16 1.15 6.31 2.92 2.35 2.13 2.02 1.94 1.89 1.86 1.83 1.81 1.78 1.75 1.72 1.71 1.70 1.68 1.67 1.66 1.64 12.71 4.30 3.18 2.78 2.57 2.45 2.36 2.31 2.26 2.23 2.18 2.13 2.09 2.06 2.04 2.02 2.00 1.98 1.96 31.82 6.96 4.54 3.75 3.36 3.14 3.00 2.90 2.82 2.76 2.68 2.60 2.53 2.49 2.46 2.42 2.39 2.36 2.33 63.66 9.92 5.84 4.60 4.03 3.71 3.50 3.36 3.25 3.17 3.05 2.95 2.85 2.80 2.75 2.70 2.66 2.62 2.58 318.32 22.33 10.22 7.17 5.89 5.21 4.79 4.50 4.30 4.14 3.93 3.73 3.55 3.47 3.39 3.31 3.23 3.16 3.09 636.61 31.60 12.92 8.61 6.87 5.96 5.41 5.04 4.78 4.59 4.32 4.07 3.85 3.75 3.65 3.55 3.46 3.37 3.29 page 14

Laboratory 7 - Trinity College Dublin

Related documents

Products

Support

Laboratory 7 - Trinity College Dublin

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib