Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1: Simple Data Analysis using Minitab To complete the laboratory exercise, work your way through this handout, which is self contained and self explanatory. Work in pairs (two per machine), and learn from each other. Keep separate logs of your work. The tutor is available to help with technicalities and discuss substantive issues. Invitations to consider the results of Minitab analysis and their statistical and substantive interpretations are printed in italics. Take some time for this; consult your neighbour or tutor. Enter your responses in a Word document, as if draft contributions to a report on the experiment and its analysis. Topics: 1. Basic features of Minitab 2. Simple data analysis 3. Simple analysis of a larger data set Learning Objectives: Be able to start Minitab and become familiar with the Minitab menus use Minitab context sensitive help and navigate the Help facility enter data in a Minitab data sheet, by hand and by copying from a file use Minitab to make dotplots, boxplots and histograms recognise the need to simplify graphs for communication purposes understand the data ink principle use the Minitab graph editor to apply the data ink principle provide informative interpretive comments on the results of the graphical analysis understand the roles of pattern and exception in interpreting dotplots, boxplots and histograms understand the roles of level and spread in comparing samples of measurements use the Brush tool to identify exceptional cases in dotplots and boxplots understand the relative merits of dotplots, boxplots and histograms for data display recognise the range of statistics available for calculation using Minitab use Minitab to calculate simple numerical summaries of data provide informative interpretive comments on the results of the numerical summaries identify and mark exceptional cases for deletion using the Minitab missing value code * understand the effects of exceptional cases on different summary statistics recognise the limitations exceptional cases place on interpreting summary statistics Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 Data The data sets used in the following exercises are stored in Excel files and may be copied into Minitab (and most other statistical software programmes). The data for the first example used below are stored in an Excel file named Durability.xls. In a study of the effect on the strength of tennis balls of a modification to the edge seam, the modification was put into effect for a short period after which the process was changed back to its original state. Data on strength were collected before, during and after the modification was in effect. The data are in the form of time to breakage under stress (durability), so bigger is better. For convenience, they are presented here in tabular form. Durability of tennis balls Before, During and After application of a process change Before During After 34 37 40 30 34 34 34 37 53 69 40 40 53 32 34 40 46 40 40 44 53 53 69 40 37 53 40 48 60 60 69 48 53 44 60 48 44 53 44 48 44 48 40 44 40 54 48 34 44 48 32 54 48 40 37 40 40 48 40 37 1. Basic features of Minitab Starting Minitab First, log in to your PC, using your usual username and the password supplied in class. Then click Start, Programs, Minitab 15 for Windows, Minitab. Minitab windows When you open Minitab, two windows appear, a Worksheet (sometimes referred to as a Data sheet or Data window) and the Session window. The Minitab menu bar and standard toolbar appear above the windows. (Other window types and toolbars are available, but may be ignored for the moment). page 2 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 The worksheet looks just like an Excel worksheet, except it does not have all the bells and whistles of an Excel worksheet. Note the default name of the worksheet, Worksheet 1, in the title bar above the empty data window. The three asterisks following the name indicate that this is the active worksheet. As with Excel, you can have many worksheets. Note the icon to the left of the worksheet name. Right-click this to see a list of tasks including actions relevant to the worksheet, for example, renaming the worksheet. Right-click any cell in the worksheet to see a list of tasks relevant to the worksheet contents. The Session window holds the commands generated by some menus, as well as some output from some commands. Minitab started out as a command driven programme. It is now fully menu driven; the menus activate relevant commands. Although we may not want to see the commands, we still need to see the Session window because of the output it shows. Entering data in the Worksheet Later, you will copy the Durability data shown in the table above from an Excel file and paste them into the worksheet. For now, enter the Before data in the worksheet by hand: click in the Name cell of the first column (the cell under the column label, C1), type "Duration" (without the quotes), as the column name, press Enter, to move to the first data cell, enter the data in order, 34, 37, 40, 30, etc., pressing Enter each time to move down to the next cell, as in an Excel worksheet column, re-name the worksheet right-click the worksheet icon (top left), select Rename Worksheet, type Durability, click OK. These data will be used to illustrate some of the basic Minitab operations available in the menus. Minitab Menus Click on each menu button in turn to see the features available in each. Many will not be meaningful at first sight. Some will be explained in this laboratory, others later. File File commands deal with the outside world of the Microsoft Windows system such as opening, closing and saving data files, importing and exporting data, and printing. The first few commands in the list refer to Projects. Minitab has a facility for organising related data sets in different worksheets, as in Excel, and with them the associated graphs and other output. Minitab has a facility called Project Manager to handle these. Minitab Projects and the Project Manager are very useful for managing the data and analyses arising in a real research project, hence the names. However, in these laboratories, we will deal with one worksheet at a time, so we will not need the Projects feature. Now, save your data: page 3 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 from the File menu, select Save Current Worksheet, navigate to a data folder in a suitable location, e.g., the Desktop or your memory stick, click Save. Use Windows Explorer to check your data folder; note the new Durability file with its Minitab icon. Edit Edit commands are for editing data, or for general purpose Windows-style "copying and pasting". Try editing some cells: in the data window, select the first two data cells (containing values 34 and 37), from the Edit menu, select Clear Cells, from the Edit menu, select Delete Cells. Compare the results of the two commands. Generally, if you are in doubt which of two apparently similar commands to use, try both. Now, try copying data from the Durability.xls file and pasting it into the active Minitab worksheet. To access the Excel file, click on the Start button in the bottom left hand corner and choose Run.. in the dialog box, type \\tholos\shared, as below, and click OK, in the window that opens double click on the ST1001 folder, double click on the GET folder, double click on the GenericSkillsData folder. The datasets for today's Laboratory are Durability and Diameter Access the Durability data: click on Durability.xls, then Open page 4 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 copy the three data columns, in the Minitab active data window, click in the Name cell for Column 2 (C2), from the Minitab Edit menu select Paste Cells. Check the correspondence between the Duration data in Column 1 and the Before data in Column 2. Finally, delete the (unwanted) Duration column: in the data sheet, click on C1 to select the entire Duration column, from the Edit menu, select Delete Cells. Note the result Data Data commands are concerned with moving and organising data within and between Minitab worksheets. Try reformatting the durability data as a single column of data, with a second column identifying which original column the data come from, (a typical format for advanced statistical analysis software): from the Data menu, select Stack, then Columns, in the resulting dialog box, in the left hand window, drag across Before, During and After to highlight them, click the Select button below, note the result in the right hand window, in the "Store stacked data in" dialog, select "Column of current worksheet:", enter c4 in the corresponding window, enter c5 in the "Store subscripts in:" window, uncheck "Use variable names in subscript column", click OK, name c4 "Duration", name c5 "Sample". Note that Minitab refers to the sample identifiers (1, 2 and 3) in C5 as "subscripts". This comes from widely used mathematical notation for identifying values in samples. Using Y to denote the variable (in this case, Duration), Yij denotes value j in sample i where, in this case, i can be 1, 2 or 3 and j can be 1, 2, ... , 20. Y is frequently used in statistical notation to represent a response variable. Here, Y = Duration may be regarded as responding to changes in the manufacturing process, the changes being the modification to the edge seam and its subsequent removal. Check the correspondence between the unstacked data in Columns 1-3 and the stacked data in Columns 4 and 5. For greater transparency, use the original names for Columns 1 – 3 as identifiers in Column 5. This may be done by re-using the Stack command, via the Edit Last Dialog facility: from the Edit menu, select Edit Last Dialog (or press Ctrl+E), check "Use variable names in subscript column", page 5 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 click OK. Comment. Which version do you prefer, unstacked, stacked with numerical identifiers, stacked with sample names as identifiers? You will find later that some commands require the data stacked and others require unstacked data and some commands require numerical identifiers with stacked data. While the unstacked format may appear more intuitive at an elementary level, the stacked format is the one most widely used in advanced statistical analysis, with both Minitab and other statistical software. Calc The first Calc command is a calculator which allows you to calculate more or less complicated functions of your data, such as adding variables, calculating square root, and many more. From the Calc menu, select Calculator and explore the resulting dialog box. Click the down-arrow beside the function type box (showing "All functions" by default), view the functions available under various types. Other commands implement a range of specialised calculations. Try calculating simple data summaries and storing the results: from the Calc menu, select Column Statistics, in the resulting dialog box, set Statistic to Mean, tab to the Input variable box below, highlight Before in the list of column variables on the left, click the Select button below, click OK. Note the result that appears in the Session window. Repeat using Ctrl+E (Edit Last Dialog), this time selecting Standard deviation as the statistic; note the result. Other summary statistics for Before and corresponding statistics for During and After may be calculated in this way. However, the process is tedious and more effective solutions are available in the Stat menu. Try making patterned data; in this case, recreate the numerical "subscripts" created by Stack: from the Calc menu, select Make Patterned Data, then Simple Set of Numbers, in the resulting dialog box, Store patterned data in: From first value: To last value: page 6 C6 1 3 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 In steps of: List each value: 1 20 times click OK. Compare the patterned data with the text "subscripts". Review the entries in the dialog box (press Ctrl+E to view it); what happens with different choices of last value, step, value repeats, sequence repeats? What would youy enter to get 20 samples of 3? 6 sample of 10? 10 samples of 6? Stat Stat commands implement a range of statistical calculations. Try simple numerical summaries: from the Stat menu, select Basic Statistics, then Display Descriptive Statistics, highlight Before, During, After, and Select click OK, view the results in the Session window. Comment on the values of the descriptive statistics appearing in the Session window, particularly with regard to between-sample comparisons. Minitab provides a wide choice of summary statistics. To review the list, press Ctrl+E, click on the Statistics button, click on the Help button in the bottom left corner. Examine the list of summary statistics. Define the ones you recognise. Use the links provided (underlined) to get definitions of those with which you are not familiar. Check the definition of trimmed mean. Why do you think it is defined in this way? Top Tip Context sensitive help is provided via Help buttons in dialog boxes throughout Minitab. Get used to using it! Graph Graph commands make graphs and plots. Explore the commands in the graph menu by selecting some of them and noting what they do (or discovering what they do using Help) Graphical exploration of data will be taken up below. Editor The list of Editor menu commands depends on what type of window is active. Later, we will find the Editor commands for graphs to be very useful. page 7 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 Click in the Session window and view the Editor commands, then click in the data sheet and view again. Tools The Tools menu provides general purpose tools and links, and tools for setting up Minitab as you like it. Window Window commands allow for manipulating and arranging windows. It also lists all open windows, which can be useful for finding a window hidden by others. Help The Help commands help with Minitab and also provide extensive help on statistical analysis and interpretation. 2 Simple data analysis In this section, you will analyse simple data sets using simple numerical and graphical methods, starting with the durability data, a relatively small data set, and proceeding to a second somewhat bigger data set. It is recommended here that graphical analysis should always be used first, to get a feeling for what is going on in the data, with numerical summaries subsequently being applied to provide quantification. The simplest summary graphs for individual variables are dotplots, boxplots and histograms. Make dotplots Make dotplots for the three samples of Durability values discussed earlier: from the Graph menu, select Dotplot, in the resulting window (called the Gallery), select the Simple option for Multiple Y's, (note that Y is widely used in statistical notation to represent a response variable ), click OK, highlight Before, During and After, click Select click OK. Interpret the results; give a verbal description of any patterns that you see and any exceptions to those patterns. Do the data follow the Normal model for statistical variation? Discuss. Compare the samples with regard to the centres (magnitudes) of their values. Compare the samples with regard to the spreads of their values Did the process change have an effect? How does the spread (variation) within each sample affect your judgement of differences in centres between samples? The default scaling of the horizontal axis is not well chosen, particularly if these plots are intended for inclusion in an informative report. This can be changed using the Editor menu: from the Editor menu, choose Select Item, then Edit X Scale, (or point at the X axis and double click) select Position of Ticks, clear text and enter 30 35 40 45 etc. up to 70, page 8 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 click OK. You can make dotplots from the stacked data also by selecting the With Groups option from the initial Dotplot gallery: from the Graph menu, select Dotplot, select the With Groups option for One Y, create dotplots from the stacked Duration data, use Sample (C5) as the categorizing variable, drag the resulting graph by its Title bar to move it away from your first dotplot (If the first dotplot is hidden, use the Windows menu to show it again.) Note that the categories are in alphabetical order. Compare and contrast the two plots. Which do you prefer? Why? Which shows the Before – During – After sequence best? Which shows the effect of the process change best? You can change the order of the samples in the second plot to, for example, time order, as in the first plot. To do this, click any cell in the Sample column (C5) in the worksheet, right-click anywhere in the worksheet, select Column, then Value Order, in the "Define an order" box, type Before, During, After, separated by returns, click OK. Now, redraw the dotplots using Columns 4 and 5. One case in the Before sample appears exceptional. Minitab has an interactive graphics facility which helps identify such cases, called the brush. To use it, proceed as follows: click in the "Dotplot of Before, During, After" graph window title bar, to activate it, from the Editor menu, select Brush, (note the Identifier window that opens in top left), point at the potential exceptional case (note the "pointing finger" cursor) and click, from the Editor menu, select Set ID Variables, highlight Before, click Select, click OK. Note the data in the Identifier window; click in the data sheet and compare, check the correspondence between the highlighted point and the data in Row 10. Note that points are also highlighted in the During and After plots. This is a consequence of the linking feature associated with brushing which links values for the same case in different variables as well as in the data sheet. It is not sensible in this case; there is no substantive link between the 10th values of the three variables. Minitab also links to graphs of other variables. As an illustration, activate the last dotplot you made, select Brush from the Editor menu and note the already highlighted point in the Before dotplot. page 9 Trinity College, Dublin Generic Skills Programme NB. Statistics for Research Students Laboratory 1 In larger more complicated data sets, brushing, linking and identification will be very helpful in exploratory data analysis. Make boxplots Use the Graph menu to create boxplots in much the same way as for the dotplots. Refresh your memory on the definition of boxplots by using the Help button in the initial dialog box (the Gallery) that appears when you select Boxplot from the Graph menu. Interpret the results; give a verbal description of what you see. Compare the samples with regard to their centres. Compare the samples with regard to their spreads. Did the process change have an effect? Brush the boxplots as you did the dotplots. Compare the results of brushing boxplots with brushing dotplots. Make histograms Make histograms for the three samples: from the Graph menu, select Histogram, then Simple, click OK, select Before, During and After as graph variables, click on Multiple Graphs, select On separate graphs, select Same Y, Same X, (to facilitate comparison) click OK, OK. The three histograms appear in separate windows, making comparison difficult. This can be overcome by using the Layout Tool in the Editor menu: activate the Before histogram window, click anywhere in the window, from the Editor menu, select Layout Tool, use the up/down arrows in the Layout dialog box to change Rows to 3 and Columns to 1, use the Right arrow to move Histogram of During into the layout panel, repeat for Histogram of After, click Finish. The shape of the histograms is not satisfactory; they should be taller and thinner. To do this, use the Editor menu again: with the Layout window active, from the Editor menu, choose Select Item, then Graph Region, note the highlighting around the edge of the graph window, from the Editor menu, select Edit Graph Region, click on the Graph Size tab, select Custom, page 10 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 change the value of Width to a value roughly half the value of Height, click OK, to improve the aesthetics, grab the bottom right corner of the Layout graph window and resize appropriately, to hide the gray background and make the graphs legible. The resulting graph is better proportioned, but contains too much unnecessary text which may distract from the desired comparisons. This is easily fixed: in the Layout graph window, point at the "Histogram of Before" label and press the Delete key, repeat with the "frequency" label on the Y axis, repeat with the corresponding labels in the other two histograms. Finally, now that the clutter has been cleared, note that the vertical axes are not the same, even though Same Y was selected above. In fact, the labels are not necessary so that, once the axes have been made the same, further clutter can be cleared. To fix both, you need to select the individual axes before using the Editor menu again: point at the vertical axis for the Before histogram and double-click, to edit it, note the value of the Scale Range Maximum, click Cancel repeat for the other two histograms, note the biggest Scale Range Maximum, edit all three histograms Y scales to have that Scale Range Maximum, in each case, click on the Show tab, uncheck any checked boxes, click OK, right-click the X-axis of the Before histogram, select Edit X-scale, select the Show tab, uncheck the High Axis line box, click OK, repeat for the other two histograms, click OK. Top Tip Removing clutter from graphs is always a good thing; it allows the viewer to focus on the essentials without the distraction of the clutter. This idea is encapsulated in a basic rule of data display: maximise the data-to-ink ratio. While not essential for work in progress, such as during this Laboratory, it is strongly advised for publication, such as for inclusion in reports intended to be read by others. Compare dotplots, boxplots, histograms Dot plots, box plots and histograms graphically convey information concerning frequency distributions. Which of the three conveys the most information? Which conveys the least? page 11 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 Which do you prefer in the context of the Durability data? Why? Calculate numerical summaries Use the Stat menu to calculate numerical summaries: from the Stat menu select Basic Statistics, then Display Descriptive Statistics, select Before, During, After as the Variables, click OK. The results appear in the Session window. Some of the statistics produced may be unfamiliar. To find out what they are, edit the last dialog (press Ctrl+E), click on the Statistics button, (note the selected options), click on the Help button. Check the definitions of SE Mean and N* ( = N missing) SE Mean is not appropriate as a descriptive statistic (it arises in statistical inference later). N* is not necessary with these data. Change the selections to exclude these and include others: minimise the Help window, uncheck SE of mean and N missing, check Range and Interquartile range, click OK, OK. SideNote: The set of statistics consisting of minimum, lower (or 1st) quartile, median, With unexceptional data, interpretation of upper (or 3rd) quartile and numerical summaries is straightforward and maximum should correspond to the results of the graphical summaries. Problems may arise if is referred to as the five number summary. It forms the basis for constructing boxplots. there are exceptional cases in the data. Compare means. Interpret the results of your comparisons. Refer back to graphical comparisons. Comment. Repeat with medians. Comment Repeat with standard deviations, then Ranges, then Interquartile ranges. Which of these statistics accords best with the graphical summaries? Explain. As one of the Before values has already been identified as possibly exceptional, recalculate the numerical summaries with the exceptional value marked for deletion using the missing value code, *: click in the data sheet (or type Ctrl+D) to activate it, highlight Row 10 of Column 1, from the Edit menu, select Clear Cells (or press Backspace () or enter *), from the Stat menu select Basic Statistics, then Display Descriptive Statistics, click on the Statistics button, check N missing, page 12 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 click OK, OK. Discuss the changes in location and in spread. Which would you prefer as a summary statistic for spread in the Before sample, standard deviation, range or interquartile range. Why? 3. Simple analysis of a larger data set For this exercise, either use the data on tennis ball core diameters introduced below, or use your own data, provided it has at least 100 cases. The tennis ball manufacturer referred to above was having problems meeting new more stringent specifications for tennis ball diameters that had been introduced by the International Lawn Tennis Federation. As part of the manufacturing process, presses are used to form pressurised tennis ball cores. There were four presses in the production line, each producing 186 cores in a single production run. In a special study of the problems involved, the focus of attention was the variation arising in the four presses. To study this, the diameters of the 744 cores produced by the four presses in one run were measured and recorded. The data are stored in an Excel file named Diameter.xls.1 Refer to the instructions for file opening on Page 5. Carry out a full analysis of the chosen data, corresponding to the analysis of the Durability data described above. Provide a comprehensive account of your analysis and your interpretations of the results. Discuss which graph types work better for small data sets and which work better for larger data sets 1 These data are discussed extensively in Stuart (2003, §1.3, §1.4, §2.1, §2.2, Ch. 3) page 13 Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 1 Conclusion This concludes Laboratory 1. The learning objectives listed at the outset are reproduced here. Check them individually and ensure that you have achieved each one; seek help from the Tutor if necessary. Learning Objectives: Be able to start Minitab and become familiar with the Minitab menus enter data in a Minitab data sheet, by hand and by copying from a file use Minitab to make dotplots, boxplots and histograms recognise the need to simplify graphs for communication purposes understand the data ink principle use the Minitab graph editor to apply the data ink principle provide informative interpretive comments on the results of the graphical analysis understand the roles of pattern and exception in interpreting dotplots, boxplots and histograms understand the roles of level and spread in comparing samples of measurements use the Brush tool to identify exceptional cases in dotplots and boxplots understand the relative merits of dotplots, boxplots and histograms for data display recognise the range of statistics available for calculation using Minitab use Minitab to calculate simple numerical summaries of data provide informative interpretive comments on the results of the numerical summaries identify and mark exceptional cases for deletion using the Minitab missing value code * understand the effects of exceptional cases on different summary statistics recognise the limitations exceptional cases place on interpreting summary statistics page 14