Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2: The Normal model for statistical variation To complete the laboratory exercise, work your way through this handout, which is self contained and self explanatory. Work in pairs (two per machine), and learn from each other. Keep separate logs of your work. The tutor is available to help with technicalities and discuss substantive issues. Invitations to consider the results of Minitab analysis and their statistical and substantive interpretations are printed in italics. Take some time for this; consult your neighbour or tutor. Enter your responses in a Word document, as if draft contributions to a report on the experiment and its analysis. Topics: 1. Introducing the Normal model 2. Checking the Normal model introducing simulation 3. Application of the Normal model, calculating nonconformance rates 4. Solving problems on combining Normal variables using simulation Learning Objectives: Be able to use Minitab to make histograms with fitted Normal curves describe and interpret differences between different fitted Normal curves and relate such differences to differences between corresponding means and standard deviations use Minitab to produce Normal probability plots describe the basis for interpreting Normal probability plots use Minitab to produce simulated reference plots for Normal probability plots Use the Minitab Editor to improve plot readability visually assess the goodness of fit of the Normal model identify Normal diagnostic plot patterns corresponding to skew distributions describe the rationale for applying the log transformation to skew data apply the log transformation to skew data identify Normal diagnostic plot patterns corresponding to "granular" data use computer simulation and Normal diagnostic plots to study the effect of rounding applied to Normal data define specification limits and nonconformance rates Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 use Minitab to calculate nonconformance rates via frequency distributions use Minitab to estimate nonconformance rates from numerical data summaries using the Normal model enable commands in the Minitab Session window and use the LET command to calculate cell entries from other cell entries define a what-if analysis in the context of reducing nonconformance rates use Minitab to implement a simple what-if analysis explain and interpret in context the results of the what-if analysis use Minitab to calculate frequencies of simple events defined in terms of variables that follow the Normal model use computer simulation to estimate frequencies of simple events defined in terms of variables that follow the Normal model explain the effect of changing the size of a simulation use computer simulation to study the effects of assuming a non Normal model, interpret the results of such a study. 1 Introducing the Normal model In both cases studied in Laboratory 1, the tennis ball durability study and the tennis ball core diameter study, the data in each sample showed a tendency to bunch towards the middle and spread out towards the edges. This behaviour suggests that a Normal distribution may be appropriate as a model for the frequency distribution of these data. If so, the Normal model may be used as an approximation to facilitate calculations relating to frequency distributions. For many applications, calculations of this kind are relatively simple and can be applied straightforwardly to assist with the study of the systems being modelled. In other situations, the calculations needed become complicated, if not impossible. In such situations, computer simulation of data following the Normal model may be used to study the systems being modelled, with the calculations then being applied to the simulated data rather than to the Normal model formula. Calculations of this kind will be the topic of later sections of this laboratory. In addition, many of the standard methods of statistical analysis assume that the data available were sampled from a population or process which follows the Normal model. Such methods will be the subject of later laboratories. If the Normal model is not appropriate, at least approximately, then calculation results or statistical analyses based on the Normal model will be misleading and so it is desirable to have a way for checking the adequacy of the Normal model. A way for doing this will be used in the following section. In this section, the basics of the Normal model and related calculations are introduced via Minitab. Start Minitab Follow the instruction in Laboratory 1, page 2, to log in to your PC and start Minitab. page 2 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Make histograms with fitted Normal curves As noted above, the data in the samples studies in Laboratory 1 tended to bunch in the middle and spread towards the edges, suggesting a Normal model. This can be further illustrated by enhancing the relevant histograms with fitted Normal curves. Here, we illustrate using the tennis ball core diameters data. First, load the data into Minitab: navigate to the course data files in the GenericSkillsData folder, see Laboratory 1, page 4, open the Diameter.xls file and copy the four data columns, paste into Minitab. Next, make histograms as in Laboratory 1, pages 10-11, this time, enhancing with fitted Normal curves and also applying the "maximise the data-to-ink" rule by removing unnecessary axes. To do this, in the opening gallery, select With Fit, in the Histogram dialog window, add click on Scale, uncheck all checked boxes in the Y Scale, uncheck High Axis in the X Scale, select the Gridlines tab, uncheck any checked boxes. Use the layout tool as before, this time with four rows. Use the Editor to improve the shape as before, clear all unnecessary text as before, edit Y scales as before. Change the X scales to get more sensible tick marks and extend the scale to accommodate the fitted Normal curve: double click the X scale in the Press 1 histogram, replace the tick positions by 50 60 70 80 90 100 110 120, copy the tick positions, (for use with the other presses) click OK, repeat for the other presses, pasting (rather than typing) the tick positions. Are the Normal curves appropriate? Discuss the goodness of the fit and departures therefrom. page 3 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Show overlaid Normal curves While there is some evidence of departure from Normality in the Diameter data, it is not particulary strong. Here, we take the view that the Normal approximation is reasonably good. Since using the Normal model simplifies (as with any model), reducing the display to just the Normal curves may be expected to clarify the main features of the data; recall the "maximise data-to-ink ratio" rule referred to in Laboratory 1. This may be achieved using another option: from the Graph menu, select Histogram, select With Fit and Groups, select the four variables as Graph variables, check the "Graph variables form groups" option, click on the Scale button and deselect Major ticks and Major tick labels for the Y Scale, (to reduce subsequent unnecessary clutter), click on the Multiple Graphs button and check to Show Graph Variables "Overlaid on the same graph". The resulting graph has much clutter, with much emphasis on the labelling and summaries and little on the actual curves. This can be improved using the Editor menu: point at the graph title and click to highlight, from the Editor menu, select Edit Title ... , replace the Title text with "Normal curves for Presses 1 to 4", unbold the font, click OK, highlight the subtitle and delete, highlight the Y axis title and delete, use the Editor to change the X axis tick positions to those used earlier, change the X axis title to Diameter, and unbold the X axis title, the legend (top right) and the data summary (middle right), click on any of the curves, (all are then selected), from the Editor menu, select Edit Distribution Fit, select Custom in the Lines box, make all lines solid and double the size of the lines, if gridlines are shown, point and double click to edit, select the Show tab, uncheck checked options. Compare the positions of the curves on the X axis; refer to the corresponding mean values. Compare the spreads of the curves (and corresponding heights); refer to the corresponding standard deviation values. For each press, discuss the need for improvement in the context of specification limits of 70 (lower specification limit) and 100 (upper specification limit). page 4 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 2 Checking the Normal model Making Normal plots If the Normal model is not appropriate, then calculations and statistical methods that assume the Normal model are not meaningful. It is important to be able to check the validity of the Normal model. A suitable tool for this purpose is the Normal probability plot (or Normal diagnostic plot). To produce such a plot: from the Graph menu, choose Probability Plot, then Single, select Press 1 as the graph variable, click on the Distribution button, note the default Normal distribution, click on the Data Display tab, uncheck Show confidence interval, click OK, click on the Scale button, in the Axes and Ticks window, check Transpose Y and X (plot raw data on vertical scale)1 click on the Y-Scale Type tab and select Score as the Y-Scale Type to put standard zscore values on the (now) horizontal axis, click on the Gridlines tab and uncheck all boxes, click OK, OK. In this graph, the horizontal axis represents Normal Scores, idealised values expected from data that closely follow the Normal model, while the vertical axis represents the observed data. If the observed data follow the Normal model, then the resulting plotted points should approximate a straight line. On the other hand, if the points do not approximate a straight line, then the observed data do not follow the Normal model. Thus, the degree of Normality of the observed data is indicated by the degree of linearity of the plotted points. Do you think that these data follow a Normal model? Explain. Explain the several examples of horizontal strips of points. Making Normal reference plots Without an appropriate reference base, it is difficult to judge the degree of Normality of data such as these. Formal tests of Normality will be introduced later in the course; material related to one such test is included in the summary table on the right of the graph. Here, the important technique of model simulation based on random number generation is used to simulate many data sets that follow the Normal model and probability plots based on these are used as a reference basis for the observed data to hand. Use the Calc menu to generate the reference data and the Graph menu to produce the reference plots, as follows: from the Calc menu, choose Random Data, then Normal, generate 186 rows of data, 1 Minitab assigns the data to the horizontal axis and the Normal scores to the vertical axis as default. This is not the usual convention. It is possible to change this Minitab default using the Options command in the Tools menu; click on the "+" beside Individual Graphs, click Probability Plots, under Graph Orientation select Show raw data on vertical scale. page 5 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 store in columns c11-c29, click OK, note the new data in the data sheet, from the Graph menu, choose Probability Plot, then Single, enter c11-c29 as the Graph variables, click on the Distribution button, then the Data Display tab, ensure Show confidence interval is unchecked, click Ok, click on the Multiple Graphs button, select On separate graphs, click OK, OK. The 19 plots cascade onto the monitor. The original plot is hidden at the top of the cascade; drag it to a free area of the monitor, so that it can be compared to the reference plots. To do this, start with the plot of C29, click on the C28 plot, click on the C27 plot, etc. For each reference plot, note any deviations from a straight line, compare the pattern in the Press 1 plot to that in each reference plot. Do you think that the Press 1 data follow a Normal model? Explain. Produce plots for Presses 2 to 4 and interpret; use the same set of reference plots. Not all distributions are Normal In a study of factors related to obesity, the amounts of a particular steroid found in urine samples taken from 100 obese females were recorded. They may be found in an Excel file named Steroids.xls. To check Normality, copy the data to a Minitab worksheet, make a histogram with fitted Normal curve and make a Normal diagnostic plot. What pattern of variation do you see in the histogram? How is this reflected in the Normal plot? A frequent response to non-Normal data is to apply a transformation to the data such that the transformed data follow the Normal model. In the case of skew data, the log transformation is often successful in achieving this; the log function tends to reduce the spread of the larger values and increase the spread of the smaller values, thus inducing greater symmetry. This effect is illustrated in Figure 1 on page 7. Note how the wide interval between 4 and 5 on the x axis is transformed to a much narrower interval on the log(x) axis, while the narrow interval between 0.1 and 0.3 on the x axis is transformed to a much wider interval on the log(x) axis. Use the Minitab Calculator to calculate log10(Steroid), check Normality as above. What conclusions do you draw from these graphs? successful? Compare the graphs of Steroid with those of logSteroid. page 6 Was the log transform Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 0.8 0.6 0.4 0.2 0.0 log(x) 1 2 3 4 5 -0.2 -0.4 -0.6 -0.8 -1.0 x Figure 1 The log transformation Another example In another study, this time of pregnant women, among several variables measured were their heights in cms. The data may be found in Excel file Heights.xls. There is considerable empirical evidence that human heights should follow the Normal model. To check the validity of the Normal model in this case, copy the data to Minitab, make a histogram with fitted Normal curve and make a Normal diagnostic plot. What pattern of variation do you see in the histogram? What pattern of variation do you see in the Normal plot? Does the Normal model apply? Discuss if not. Using computer simulation to study the rounding effect Minitab can be used to generate data that closely follow the Normal model (and other models also). Here, we simulate data similar to the height data studied above, verify the Normal model for the simulated data, round the data to whole numbers, as the height data was, and check the validity of the Normal model in that case. To generate the data, from the Calc menu, choose Random Data, then Normal ... , enter 1794 as the Number of rows of data to generate, store the results in the next available column, enter 162.4 as Mean and 6.284 as Standard deviation, click OK, name the new column Random Heights. page 7 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 To round the generated data, in the Calculator, Store the results in the next available column, select the Round function from the list of functions, select Random Heights as the "number" and set decimals to 0, click OK, name the new column Rounded Heights. Make a histogram with fitted Normal curve and a Normal diagnostic plot for each of the new variables. Discuss your results, compare with those for the actual heights. Review your interpretation of the Normal diagnostic plots of the Diameter data 3 Application of the Normal Model, Calculating Nonconformance Rates In manufacturing and other areas, product specification frequently involves a pair of numbers between which all values of a key process measurement should fall. These numbers are the Lower and Upper Specification Limits (LSL, USL). Ideally, no measurements will go outside these limits. This is not always achievable, however, so that some proportion of manufacture may not conform to the specification limits and the aim shifts to minimising this proportion, or nonconformance rate. From limits on tennis ball external diameters specified by the International Tennis Federation, it is possible to derive specification limits for tennis ball core diameters on the coded scale being used here. In this case, the limits 70 to 100 are used2. Here, we first see what the actual nonconformance rates were for the tennis ball cores produced by the four presses in the manufacturer's production line, then check these against what the Normal model would predict and finally use the Normal model to check what reduction in nonconformance rates could be achieved by suitable adjustments to the process. Calculate the observed nonconformance rates Estimates of the process nonconformance rates (proportions of tennis ball cores with diameters outside the specification limits) can be calculated from the data to hand by making a frequency distribution using the Tables command in the Stat menu: from the Stat menu, select Tables, then Tally Individual values, select Press 1 to Press 4 as the Variables, check Counts and Cumulative counts, click OK. The results appear in the Session window. Use the cumulative count output to calculate the number of nonconformances as CumCnt (69) + (186 – CumCnt(100)); divide by 186 to get the nonconformance rate; (multiply by 100 to express as a rate per cent) Note that a specification range of 100 – 70 = 30, at least 30% of the permitted values, is unrealistic. In this case, the original data were "coded"; the two leading digits in each value, which were the same for all the data, were deleted. 2 page 8 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 What are the observed nonconformance rates for the four presses? How do they compare? Are they satisfactory? Use the Normal model to estimate nonconformance rates Assuming that the Normal model provides an adequate representation of the variation in the data, the Normal curves provide an alternative approach to evaluating non-conformance rates. The big advantage of the Normal model, assuming it provides an adequate fit to the data, is that it allows the variation to be represented by just two numbers, the mean and standard deviation, representing centre and spread, respectively. Thus, if we calculate values for mean and standard deviation from data, the frequency distribution of the data being modelled can be reproduced reasonably accurately by substituting these values into the appropriate mathematical formula for the Normal model. The relevant formula is the "Cumulative probability" that gives the area to the left of a given value. The area to the right is 1 minus the area to the left. (Draw a Normal curve to illustrate this). Here, we use Minitab to do the required calculation, as follows: enter the specification limits in the first two rows of Column 5 of the data sheet, name the column Specs, from the Calc menu, select Probability Distributions, then Normal Distribution, check Cumulative probability, enter the Press 1 Mean and Standard deviation as given in the Normal curves graph window, enter c5 as the Input column and c6 as Optional storage, click OK, label c6 as Rate 1, repeat for Presses 2, 3 and 4, using columns C7, C8 and C9 as "Optional storage", labelling the columns Rate 2, Rate 3, Rate 4. Calculate the nonconformance rates directly or, optionally, use the LET command in the Session window as follow: click in the Session window to activate it, in the Editor menu, check Enable Commands, following the "MTB >" command prompt that appears in the next line of the Session window, type SideNote: let C6(3)=C6(1)+1-C6(2) The Minitab Calculator is confined to columns and does not apply to individual cells. The LET command in the Session press Return, note the result in cell 3 window works for cells. of C6, copy the command you just typed and paste after the next command prompt, in the copied command, edit C6 to C7, three times, press Return, repeat for Presses 3 and 4. What are the estimated nonconformance rates for the four presses? How do they compare with each other? How do they compare with the observed rates? page 9 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Use the Normal model to implement a 'What-If' analysis The flexibility of the Normal model makes it useful in answering questions such as how will the non-conformance rate change in response to a range of changes in centre and spread? how will the non-conformance rate change in response to possible changes in customers specifications? what change in centre and reduction in spread are needed to achieve an acceptably low non-conformance rate? what change in centre and reduction in spread are needed if the customer demands tighter specifications? Assuming we know the costs involved in making such improvements to the process, "what-if" analyses of this form are an essential component of rational decision making. To illustrate, use Minitab to calculate the nonconformance rate expected if the process mean can be held at 85, the centre of the specification range, with a "typical" process standard deviation of 8; the process standard deviation can be halved to 4: press Ctrl+E to recall the Normal Distribution dialog box, enter 85 for the mean, 8 for the standard deviation, c10 for the optional storage, press Ctrl+E, enter 4 for the standard deviation, c11 for the optional storage, calculate the corresponding nonconformance rates. Discuss the degree of improvement that can be expected by centring the process, halving the standard deviation. Discuss the degree of difficulty involved in centring the process, halving the standard deviation. More can be learned from more extensive what-if calculation. However, Minitab is not a suitable instrument for extensive calculations of this kind; a fully fledged spreadsheet is needed. Details of such calculations using Excel are given in Stuart (2003), §3.4. 4 Solving Problems on Combining Normal Variables Exercise 1.5.1, Chapter 1, p. 27 assumes that the time to complete a Ph.D. is a combination of the times to complete four activities, these times following the Normal model with parameters as set out in the table below, in which time is measured in weeks. page 10 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Activity Mean Literature review (L) Problem formulation (P) Data collection (D) Write up (W) 30 10 120 16 Standard Deviation 8 3 12 3 Exercise 1.5.1 required the calculation of the mean and standard deviation of the total time. These were 176 and 15, respectively. Exercise 1.5.1 then required the calculation of the chances that the doctorate will take (i) less than 3 years (156 weeks), (ii) more than 3½ years (182 weeks), (iii) more than 4 years (208 weeks), using the table of the Normal distribution. The solutions were given on pages 45-47. Here, you will use Minitab to do the direct calculation and also to do the calculation indirectly using computer simulation. Before proceeding to the calculation, you should illustrate using a Normal curve. Use Minitab Calculation To use Minitab to do these calculations, enter the times (156, 182, 208) in Column 1 of a new Minitab data sheet, name the column Time, select Calc / Probability Distributions / Normal, select Cumulative probability, enter 176 for Mean, 15 for Standard deviation, C1 for input column, C2 for Optional storage, name C2 as P Less, use the Calculator (from the Calc menu) to calculate 1 – C2, store result in C3, name C3 as P More. Compare your answers with those found using the standard Normal table Use Minitab Simulation In a relatively simple problem such as this, the Normal calculation is easy and effective. In more complicated problems, where variables are combined in complex ways that may involve complex formulas, differential equations, etc., the calculation required may not be as simple or may not be readily derivable as it was here. In such cases, the use of computer simulation to simulate the outcomes of very many individuals (Ph.D. candidates in this case) allows us to calculate the average outcome of individuals. This may be done here using Minitab as follows: in a new worksheet, enter variable names L, P, D and W (for Literature review, Problem formulation, Data collection and Write up, respectively) in the Name cells of Columns 1 to 4, from the Calc menu, select Random Data, then Normal ... , enter 1000 in the Number of rows of data to generate box, tab to the Store in column(s) box and Select the L column (C1), enter Mean 30 and Standard deviation 8, click Ok, repeat for Columns P, D and W, entering appropriate means and standard deviations. page 11 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Describe the contents of the data sheet in terms of typical times of several students; use illustrative graphs and descriptive statistics to assist in formulating your description. Check on the validity of the data generated: make Normal diagnostic plots of the four columns of data. Comment on the plots with respect to the validity of the Normal model. Comment on the summary data (means and standard deviations) with respect to the accuracy of the simulation. How would the results change if you had simulated 10, 100, 1,000,000 individuals? Complete the exercise by computing the total times and then calculating the proportions of times less than 3 years, more than 3½ years, more than 4 years, as follows: name C5 as T, use the calculator to put L + P + D + W in C5; for future use, check the "Assign as a formula" box, click Ok, from the Stat menu, select Tables, then Tally Individual Variables, select T as the variable and check to display Cumulative Percents, in the Session window, scroll down the output from the Tally command to find the cumulative percents less than 156, 182, 208, respectively, calculate the desired estimated probabilities. Before interpreting your results, check on the validity of the simulation of T, as above. Comment on the validity of the simulation of T. How do your values of the completion time probabilities as estimated from the simulation compare with the calculated values found earlier? Repeat this exercise, more than once if time permits. Note that generating random data in Columns 1 to 4 will automatically put the totals in Column 5, because you checked the "Assign as a formula" box when using the Calculator. Repeat with the number of rows set at 10,000. Discuss the results of your repeated simulations. page 12 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Calculation for a non-Normal distribution The Normal model is not always appropriate. For times to a designated event, as here, an alternative model called the Exponential distribution is often more appropriate. This model is often used in reliability analysis in engineering, in survival analysis in medical research and in duration analysis in economics and sociology. The Exponential distribution has the property that its standard deviation is the square root of its mean, so that only the mean needs to be specified for most applications. A graph of the Exponential frequency distribution curve, for the case where the mean is 1, is shown on the next page. For different mean values, the shape of the curve is the same, just its location (and spread) differs. Minitab can be used both to calculate and to simulate the solution to the problem at hand: using the same means as earlier, repeat the calculation of the probabilities of various Ph.D. completion times, using the Exponential model instead of the Normal, repeat the use of simulation to estimate the probabilities of various Ph.D. completion times, using the Exponential model instead of the Normal, and including a diagnostic check of the validity of the Exponential model for the simulated data. Discuss your results. How do the calculations and estimations compare? How do the results using the Exponential model compare with those for the Normal? 0 Figure 2 2 4 6 8 10 Expontial frequency curve with mean = 1 Top Tip Simulation aids in solving mathematically intractable problems, speedy solution of otherwise solvable problems, providing assurance with regard to mathematically derived solutions, enabling as problem solvers those with limited mathematical skills, and communicating solutions to clients. page 13 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 2 Conclusion This concludes Laboratory 2. The learning objectives listed at the outset are reproduced here. Check them individually and ensure that you have achieved each one; seek help from the Tutor if necessary. Learning Objectives: Be able to use Minitab to make histograms with fitted Normal curves describe and interpret differences between different fitted Normal curves and relate such differences to differences between corresponding means and standard deviations use Minitab to produce Normal probability plots describe the basis for interpreting Normal probability plots use Minitab to produce reference plots for Normal probability plots Use the Minitab Editor to improve plot readability visually assess the goodness of fit of the Normal model identify Normal diagnostic plot patterns corresponding to skew distributions describe the rationale for applying the log transformation to skew data apply the log transformation to skew data identify Normal diagnostic p[lot patterns corresponding to granular data use computer simulation and Normal diagnostic plots to study the effect of rounding applied to Normal data define specification limits and nonconformance rates use Minitab to calculate nonconformance rates via frequency distributions use Minitab to estimate nonconformance rates from numerical data summaries using the Normal model enable commands in the Minitab Session window and use the LET command to calculate cell entries from other cell entries define a what-if analysis in the context of reducing nonconformance rates use Minitab to implement a simple what-if analysis explain and interpret in context the results of the what-if analysis use Minitab to calculate frequencies of simple events defined in terms of variables that follow the Normal model use computer simulation to estimate frequencies of simple events defined in terms of variables that follow the Normal model explain the effect of changing the size of a simulation use computer simulation to study the effects of assuming a non Normal model interpret the results of such a study. page 14