Laboratory 1 - Trinity College Dublin

advertisement
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1:
Simple Data Analysis using Minitab
To complete the laboratory exercise, work your way through this handout, which is self
contained and self explanatory. Work in pairs (two per machine), and learn from each other.
Keep separate logs of your work. The tutor is available to help with technicalities and discuss
substantive issues.
Invitations to consider the results of Minitab analysis and their statistical and substantive
interpretations are printed in italics. Take some time for this; consult your neighbour or
tutor. Enter your responses in a Word document, as if draft contributions to a report on
the experiment and its analysis.
Topics:
1. Basic features of Minitab
2. Simple data analysis
3. Simple analysis of a larger data set
Learning Objectives:
Be able to


















start Minitab and become familiar with the Minitab menus
use Minitab context sensitive help and navigate the Help facility
enter data in a Minitab data sheet, by hand and by copying from a file
use Minitab to make dotplots, boxplots and histograms
recognise the need to simplify graphs for communication purposes
understand the data ink principle
use the Minitab graph editor to apply the data ink principle
provide informative interpretive comments on the results of the graphical analysis
understand the roles of pattern and exception in interpreting dotplots, boxplots and
histograms
understand the roles of level and spread in comparing samples of measurements
use the Brush tool to identify exceptional cases in dotplots and boxplots
understand the relative merits of dotplots, boxplots and histograms for data display
recognise the range of statistics available for calculation using Minitab
use Minitab to calculate simple numerical summaries of data
provide informative interpretive comments on the results of the numerical summaries
identify and mark exceptional cases for deletion using the Minitab missing value
code *
understand the effects of exceptional cases on different summary statistics
recognise the limitations exceptional cases place on interpreting summary statistics
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1
Data
The data sets used in the following exercises are stored in Excel files and may be copied into
Minitab (and most other statistical software programmes). The data for the first example used
below are stored in an Excel file named Durability.xls.
In a study of the effect on the strength of tennis balls of a modification to the edge seam, the
modification was put into effect for a short period after which the process was changed back to
its original state. Data on strength were collected before, during and after the modification was
in effect. The data are in the form of time to breakage under stress (durability), so bigger is
better. For convenience, they are presented here in tabular form.
Durability of tennis balls Before, During and After
application of a process change
Before
During
After
34
37
40
30
34
34
34
37
53
69
40
40
53
32
34
40
46
40
40
44
53
53
69
40
37
53
40
48
60
60
69
48
53
44
60
48
44
53
44
48
44
48
40
44
40
54
48
34
44
48
32
54
48
40
37
40
40
48
40
37
1.
Basic features of Minitab
Starting Minitab
First, log in to your PC, using your usual username and the password supplied in class.
Then

click Start, Programs, Minitab 15 for Windows, Minitab.
Minitab windows
When you open Minitab, two windows appear, a Worksheet (sometimes referred to as a Data
sheet or Data window) and the Session window.
The Minitab menu bar and standard toolbar appear above the windows. (Other window types
and toolbars are available, but may be ignored for the moment).
page 2
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1
The worksheet looks just like an Excel worksheet, except it does not have all the bells and
whistles of an Excel worksheet. Note the default name of the worksheet, Worksheet 1, in the
title bar above the empty data window. The three asterisks following the name indicate that this
is the active worksheet. As with Excel, you can have many worksheets.
Note the icon to the left of the worksheet name. Right-click this to see a list of tasks including
actions relevant to the worksheet, for example, renaming the worksheet.
Right-click any cell in the worksheet to see a list of tasks relevant to the worksheet contents.
The Session window holds the commands generated by some menus, as well as some output
from some commands. Minitab started out as a command driven programme. It is now fully
menu driven; the menus activate relevant commands. Although we may not want to see the
commands, we still need to see the Session window because of the output it shows.
Entering data in the Worksheet
Later, you will copy the Durability data shown in the table above from an Excel file and paste
them into the worksheet. For now, enter the Before data in the worksheet by hand:





click in the Name cell of the first column (the cell under the column label, C1),
type "Duration" (without the quotes), as the column name,
press Enter, to move to the first data cell,
enter the data in order, 34, 37, 40, 30, etc., pressing Enter each time to move down to
the next cell, as in an Excel worksheet column,
re-name the worksheet
 right-click the worksheet icon (top left),
 select Rename Worksheet,
 type Durability,
 click OK.
These data will be used to illustrate some of the basic Minitab operations available in the
menus.
Minitab Menus
Click on each menu button in turn to see the features available in each. Many will not be
meaningful at first sight. Some will be explained in this laboratory, others later.
File
File commands deal with the outside world of the Microsoft Windows system
such as opening, closing and saving data files, importing and exporting data, and
printing.
The first few commands in the list refer to Projects. Minitab has a facility for
organising related data sets in different worksheets, as in Excel, and with them
the associated graphs and other output. Minitab has a facility called Project
Manager to handle these. Minitab Projects and the Project Manager are very
useful for managing the data and analyses arising in a real research project,
hence the names. However, in these laboratories, we will deal with one
worksheet at a time, so we will not need the Projects feature.
Now, save your data:
page 3
Trinity College, Dublin
Generic Skills Programme



Statistics for Research Students
Laboratory 1
from the File menu, select Save Current Worksheet,
navigate to a data folder in a suitable location, e.g., the Desktop or your
memory stick,
click Save.
Use Windows Explorer to check your data folder; note the new Durability file with
its Minitab icon.
Edit
Edit commands are for editing data, or for general purpose Windows-style
"copying and pasting".
Try editing some cells:



in the data window, select the first two
data cells (containing values 34 and 37),
from the Edit menu, select Clear Cells,
from the Edit menu, select Delete Cells.
Compare the results of the two commands. Generally, if you are in doubt
which of two apparently similar commands to use, try both.
Now, try copying data from the Durability.xls file and pasting it into the active
Minitab worksheet. To access the Excel file,


click on the Start button in the bottom left hand corner and choose Run..
in the dialog box, type \\tholos\shared, as below, and click OK,



in the window that opens double click on the ST1001 folder,
double click on the GET folder,
double click on the GenericSkillsData folder.
The datasets for today's Laboratory are
Durability
and
Diameter
Access the Durability data:

click on Durability.xls, then Open
page 4
Trinity College, Dublin
Generic Skills Programme



Statistics for Research Students
Laboratory 1
copy the three data columns,
in the Minitab active data window, click in the Name cell for Column 2 (C2),
from the Minitab Edit menu select Paste Cells.
Check the correspondence between the Duration data in Column 1 and
the Before data in Column 2.
Finally, delete the (unwanted) Duration column:


in the data sheet, click on C1 to select the entire Duration column,
from the Edit menu, select Delete Cells.
Note the result
Data
Data commands are concerned with moving and organising data within and
between Minitab worksheets.
Try reformatting the durability data as a single column of data, with a second
column identifying which original column the data come from, (a typical format
for advanced statistical analysis software):




from the Data menu, select Stack, then Columns,
in the resulting dialog box, in the left hand window, drag across Before,
During and After to highlight them,
click the Select button below, note the result in the right hand window,
in the "Store stacked data in" dialog,
select "Column of current worksheet:",
enter c4 in the corresponding window,
enter c5 in the "Store subscripts in:" window,
uncheck "Use variable names in subscript column",


click OK,
name c4 "Duration", name c5 "Sample".
Note that Minitab refers to the sample identifiers (1, 2 and 3) in C5 as
"subscripts". This comes from widely used mathematical notation for identifying
values in samples. Using Y to denote the variable (in this case, Duration), Yij
denotes value j in sample i where, in this case, i can be 1, 2 or 3 and j can be 1,
2, ... , 20. Y is frequently used in statistical notation to represent a response
variable. Here, Y = Duration may be regarded as responding to changes in the
manufacturing process, the changes being the modification to the edge seam
and its subsequent removal.
Check the correspondence between the unstacked data in Columns 1-3
and the stacked data in Columns 4 and 5.
For greater transparency, use the original names for Columns 1 – 3 as identifiers
in Column 5. This may be done by re-using the Stack command, via the Edit
Last Dialog facility:


from the Edit menu, select Edit Last Dialog (or press Ctrl+E),
check "Use variable names in subscript column",
page 5
Trinity College, Dublin
Generic Skills Programme

Statistics for Research Students
Laboratory 1
click OK.
Comment.
Which version do you prefer, unstacked, stacked with numerical
identifiers, stacked with sample names as identifiers?
You will find later that some commands require the data stacked and others
require unstacked data and some commands require numerical identifiers
with stacked data.
While the unstacked format may appear more intuitive at an elementary level,
the stacked format is the one most widely used in advanced statistical analysis,
with both Minitab and other statistical software.
Calc
The first Calc command is a calculator which allows you to calculate more or
less complicated functions of your data, such as adding variables, calculating
square root, and many more.
From the Calc menu, select Calculator and explore the resulting dialog
box. Click the down-arrow beside the function type box (showing "All
functions" by default), view the functions available under various types.
Other commands implement a range of specialised calculations.
Try calculating simple data summaries and storing the results:






from the Calc menu, select Column Statistics,
in the resulting dialog box, set Statistic to Mean,
tab to the Input variable box below,
highlight Before in the list of column variables on the left,
click the Select button below,
click OK.
Note the result that appears in the Session window.
Repeat using Ctrl+E (Edit Last Dialog), this time selecting Standard deviation as
the statistic; note the result.
Other summary statistics for Before and corresponding statistics for During and
After may be calculated in this way. However, the process is tedious and more
effective solutions are available in the Stat menu.
Try making patterned data; in this case, recreate the numerical "subscripts"
created by Stack:


from the Calc menu, select Make Patterned Data, then Simple Set of
Numbers,
in the resulting dialog box,
Store patterned data in:
From first value:
To last value:
page 6
C6
1
3
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1
In steps of:
List each value:

1
20
times
click OK.
Compare the patterned data with the text "subscripts".
Review the entries in the dialog box (press Ctrl+E to view it); what
happens with different choices of last value, step, value repeats,
sequence repeats? What would youy enter to get 20 samples of 3? 6
sample of 10? 10 samples of 6?
Stat
Stat commands implement a range of statistical calculations.
Try simple numerical summaries:




from the Stat menu, select Basic Statistics, then Display Descriptive
Statistics,
highlight Before, During, After, and Select
click OK,
view the results in the Session window.
Comment on the values of the descriptive statistics appearing in the
Session window, particularly with regard to between-sample
comparisons.
Minitab provides a wide choice of summary statistics. To review the list, press
Ctrl+E, click on the Statistics button, click on the Help button in the bottom left
corner.
Examine the list of summary statistics. Define the ones you recognise.
Use the links provided (underlined) to get definitions of those with which
you are not familiar.
Check the definition of trimmed mean. Why do you think it is defined in
this way?
Top Tip
Context sensitive help is provided via Help buttons in dialog boxes
throughout Minitab. Get used to using it!
Graph
Graph commands make graphs and plots. Explore the commands in the graph
menu by selecting some of them and noting what they do (or discovering what
they do using Help)
Graphical exploration of data will be taken up below.
Editor
The list of Editor menu commands depends on what type of window is active.
Later, we will find the Editor commands for graphs to be very useful.
page 7
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1
Click in the Session window and view the Editor commands, then click in
the data sheet and view again.
Tools
The Tools menu provides general purpose tools and links, and tools for setting
up Minitab as you like it.
Window
Window commands allow for manipulating and arranging windows. It also lists
all open windows, which can be useful for finding a window hidden by others.
Help
The Help commands help with Minitab and also provide extensive help on
statistical analysis and interpretation.
2
Simple data analysis
In this section, you will analyse simple data sets using simple numerical and graphical methods,
starting with the durability data, a relatively small data set, and proceeding to a second
somewhat bigger data set.
It is recommended here that graphical analysis should always be used first, to get a feeling for
what is going on in the data, with numerical summaries subsequently being applied to provide
quantification. The simplest summary graphs for individual variables are dotplots, boxplots and
histograms.
Make dotplots
Make dotplots for the three samples of Durability values discussed earlier:




from the Graph menu, select Dotplot,
in the resulting window (called the Gallery), select the Simple option for Multiple Y's,
(note that Y is widely used in statistical notation to represent a response variable ), click
OK,
highlight Before, During and After, click Select
click OK.
Interpret the results; give a verbal description of any patterns that you see and any
exceptions to those patterns.
Do the data follow the Normal model for statistical variation? Discuss.
Compare the samples with regard to the centres (magnitudes) of their values.
Compare the samples with regard to the spreads of their values
Did the process change have an effect? How does the spread (variation) within
each sample affect your judgement of differences in centres between samples?
The default scaling of the horizontal axis is not well chosen, particularly if these plots are
intended for inclusion in an informative report. This can be changed using the Editor menu:
 from the Editor menu, choose Select Item, then Edit X Scale, (or point at the X axis and
double click)
 select Position of Ticks, clear text and enter 30 35 40 45 etc. up to 70,
page 8
Trinity College, Dublin
Generic Skills Programme

Statistics for Research Students
Laboratory 1
click OK.
You can make dotplots from the stacked data also by selecting the With Groups option from the
initial Dotplot gallery:




from the Graph menu, select Dotplot,
select the With Groups option for One Y,
create dotplots from the stacked Duration data, use Sample (C5) as the categorizing
variable,
drag the resulting graph by its Title bar to move it away from your first dotplot
(If the first dotplot is hidden, use the Windows menu to show it again.)
Note that the categories are in alphabetical order.
Compare and contrast the two plots. Which do you prefer? Why?
Which shows the Before – During – After sequence best?
Which shows the effect of the process change best?
You can change the order of the samples in the second plot to, for example, time order, as in
the first plot. To do this,





click any cell in the Sample column (C5) in the worksheet,
right-click anywhere in the worksheet,
select Column, then Value Order,
in the "Define an order" box, type Before, During, After, separated by returns,
click OK.
Now, redraw the dotplots using Columns 4 and 5.
One case in the Before sample appears exceptional. Minitab has an interactive graphics facility
which helps identify such cases, called the brush. To use it, proceed as follows:







click in the "Dotplot of Before, During, After" graph window title bar, to activate it,
from the Editor menu, select Brush, (note the Identifier window that opens in top left),
point at the potential exceptional case (note the "pointing finger" cursor) and click,
from the Editor menu, select Set ID Variables,
highlight Before,
click Select,
click OK.
Note the data in the Identifier window; click in the data sheet and compare, check
the correspondence between the highlighted point and the data in Row 10.
Note that points are also highlighted in the During and After plots. This is a consequence of the
linking feature associated with brushing which links values for the same case in different
variables as well as in the data sheet. It is not sensible in this case; there is no substantive link
between the 10th values of the three variables.
Minitab also links to graphs of other variables. As an illustration, activate the last dotplot you
made, select Brush from the Editor menu and note the already highlighted point in the Before
dotplot.
page 9
Trinity College, Dublin
Generic Skills Programme
NB.
Statistics for Research Students
Laboratory 1
In larger more complicated data sets, brushing, linking and identification will be very
helpful in exploratory data analysis.
Make boxplots
Use the Graph menu to create boxplots in much the same way as for the dotplots. Refresh
your memory on the definition of boxplots by using the Help button in the initial dialog box
(the Gallery) that appears when you select Boxplot from the Graph menu.
Interpret the results; give a verbal description of what you see.
Compare the samples with regard to their centres.
Compare the samples with regard to their spreads.
Did the process change have an effect?
Brush the boxplots as you did the dotplots.
Compare the results of brushing boxplots with brushing dotplots.
Make histograms
Make histograms for the three samples:






from the Graph menu, select Histogram, then Simple, click OK,
select Before, During and After as graph variables,
click on Multiple Graphs,
select On separate graphs,
select Same Y, Same X, (to facilitate comparison)
click OK, OK.
The three histograms appear in separate windows, making comparison difficult. This can be
overcome by using the Layout Tool in the Editor menu:






activate the Before histogram window, click anywhere in the window,
from the Editor menu, select Layout Tool,
use the up/down arrows in the Layout dialog box to change Rows to 3 and Columns to
1,
use the Right arrow to move Histogram of During into the layout panel,
repeat for Histogram of After,
click Finish.
The shape of the histograms is not satisfactory; they should be taller and thinner. To do this,
use the Editor menu again:
 with the Layout window active, from the Editor menu, choose Select Item, then Graph
Region, note the highlighting around the edge of the graph window,
 from the Editor menu, select Edit Graph Region,
 click on the Graph Size tab,
 select Custom,
page 10
Trinity College, Dublin
Generic Skills Programme


Statistics for Research Students
Laboratory 1
change the value of Width to a value roughly half the value of Height, click OK,
to improve the aesthetics, grab the bottom right corner of the Layout graph window and
resize appropriately, to hide the gray background and make the graphs legible.
The resulting graph is better proportioned, but contains too much unnecessary text which may
distract from the desired comparisons. This is easily fixed:



in the Layout graph window, point at the "Histogram of Before" label and press the
Delete key,
repeat with the "frequency" label on the Y axis,
repeat with the corresponding labels in the other two histograms.
Finally, now that the clutter has been cleared, note that the vertical axes are not the same, even
though Same Y was selected above. In fact, the labels are not necessary so that, once the
axes have been made the same, further clutter can be cleared. To fix both, you need to select
the individual axes before using the Editor menu again:















point at the vertical axis for the Before histogram and double-click, to edit it,
note the value of the Scale Range Maximum,
click Cancel
repeat for the other two histograms, note the biggest Scale Range Maximum,
edit all three histograms Y scales to have that Scale Range Maximum,
in each case, click on the Show tab,
uncheck any checked boxes,
click OK,
right-click the X-axis of the Before histogram,
select Edit X-scale,
select the Show tab,
uncheck the High Axis line box,
click OK,
repeat for the other two histograms,
click OK.
Top Tip
Removing clutter from graphs is always a good thing; it allows the viewer to
focus on the essentials without the distraction of the clutter. This idea is
encapsulated in a basic rule of data display:
maximise the data-to-ink ratio.
While not essential for work in progress, such as during this Laboratory, it is
strongly advised for publication, such as for inclusion in reports intended to be
read by others.
Compare dotplots, boxplots, histograms
Dot plots, box plots and histograms graphically convey information concerning
frequency distributions. Which of the three conveys the most information? Which
conveys the least?
page 11
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1
Which do you prefer in the context of the Durability data?
Why?
Calculate numerical summaries
Use the Stat menu to calculate numerical summaries:



from the Stat menu select Basic Statistics, then Display Descriptive Statistics,
select Before, During, After as the Variables,
click OK.
The results appear in the Session window. Some of the statistics produced may be unfamiliar.
To find out what they are,



edit the last dialog (press Ctrl+E),
click on the Statistics button, (note the selected options),
click on the Help button.
Check the definitions of SE Mean and N* ( = N missing)
SE Mean is not appropriate as a descriptive statistic (it arises in statistical inference later). N* is
not necessary with these data. Change the selections to exclude these and include others:




minimise the Help window,
uncheck SE of mean and N missing,
check Range and Interquartile range,
click OK, OK.
SideNote:
The set of statistics consisting of
minimum,
lower (or 1st) quartile,
median,
With unexceptional data, interpretation of
upper (or 3rd) quartile and
numerical summaries is straightforward and
maximum
should correspond to the results of the
graphical summaries. Problems may arise if is referred to as the five number summary. It
forms the basis for constructing boxplots.
there are exceptional cases in the data.
Compare means. Interpret the results of your comparisons. Refer back to graphical
comparisons. Comment.
Repeat with medians. Comment
Repeat with standard deviations, then Ranges, then Interquartile ranges. Which of
these statistics accords best with the graphical summaries? Explain.
As one of the Before values has already been identified as possibly exceptional, recalculate
the numerical summaries with the exceptional value marked for deletion using the missing
value code, *:
 click in the data sheet (or type Ctrl+D) to activate it,
 highlight Row 10 of Column 1,
 from the Edit menu, select Clear Cells (or press Backspace () or enter *),
 from the Stat menu select Basic Statistics, then Display Descriptive Statistics,
 click on the Statistics button,
 check N missing,
page 12
Trinity College, Dublin
Generic Skills Programme

Statistics for Research Students
Laboratory 1
click OK, OK.
Discuss the changes in location and in spread.
Which would you prefer as a summary statistic for spread in the Before sample,
standard deviation, range or interquartile range.
Why?
3.
Simple analysis of a larger data set
For this exercise, either use the data on tennis ball core diameters introduced below, or use
your own data, provided it has at least 100 cases.
The tennis ball manufacturer referred to above was having problems meeting new more
stringent specifications for tennis ball diameters that had been introduced by the International
Lawn Tennis Federation. As part of the manufacturing process, presses are used to form
pressurised tennis ball cores. There were four presses in the production line, each producing
186 cores in a single production run. In a special study of the problems involved, the focus of
attention was the variation arising in the four presses. To study this, the diameters of the 744
cores produced by the four presses in one run were measured and recorded. The data are
stored in an Excel file named Diameter.xls.1 Refer to the instructions for file opening on Page 5.

Carry out a full analysis of the chosen data, corresponding to the analysis of the
Durability data described above.
Provide a comprehensive account of your analysis and your interpretations of the
results.
Discuss which graph types work better for small data sets and which work better for
larger data sets
1
These data are discussed extensively in Stuart (2003, §1.3, §1.4, §2.1, §2.2, Ch. 3)
page 13
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 1
Conclusion
This concludes Laboratory 1. The learning objectives listed at the outset are reproduced here.
Check them individually and ensure that you have achieved each one; seek help from the Tutor
if necessary.
Learning Objectives:
Be able to

start Minitab and become familiar with the Minitab menus

enter data in a Minitab data sheet, by hand and by copying from a file

use Minitab to make dotplots, boxplots and histograms

recognise the need to simplify graphs for communication purposes

understand the data ink principle

use the Minitab graph editor to apply the data ink principle

provide informative interpretive comments on the results of the graphical analysis

understand the roles of pattern and exception in interpreting dotplots, boxplots and
histograms

understand the roles of level and spread in comparing samples of measurements

use the Brush tool to identify exceptional cases in dotplots and boxplots

understand the relative merits of dotplots, boxplots and histograms for data display

recognise the range of statistics available for calculation using Minitab

use Minitab to calculate simple numerical summaries of data

provide informative interpretive comments on the results of the numerical summaries

identify and mark exceptional cases for deletion using the Minitab missing value
code *

understand the effects of exceptional cases on different summary statistics

recognise the limitations exceptional cases place on interpreting summary statistics
page 14
Download