here - BCIT Commons

advertisement
MATH 2441
Probability and Statistics for Biological Sciences
Stemplots
For relatively small sets of data, involving perhaps up to a couple of hundred values at the most, the
stemplot (or stem-and-leaf plot) is a way to organize the information which has some of the features of a
histogram, but allows the viewer to reconstruct the original data if desired.
Example SalmonCa0
It will be easiest to explain how a stemplot is constructed by working through an actual example -- here, we'll
go back to the data set SalmonCa0, used to illustrate the procedure for constructing frequency tables and
histograms. The 40 data values are:
75
107
72
61
56
90
52
61
52
53
76
59
73
68
103
72
63
68
78
88
94
69
67
68
47
120
96
43
54
91
63
107
72
101
83
29
54
101
56
129
From previous work, we know that the smallest number in this set is 29 and the largest is 129, giving a span
of 100 units.
In setting up a stemplot, you need to first decide how you will split up each data value into a "leaf" part and a
"stem" part. The leaf part is one or more of the rightmost digits of each value, and every leaf must have
the same number of digits. The digits in each value which are not part of the leaf then form the stem.
This process must be done so that the entire set of data gives somewhere between approximately 5 and 15
unique stems.
In this example, the task is quite easy. If we choose to have our leaves consist of just the single rightmost
digit of each value, then the stems will be all but that one digit. Thus, for the value 29, the leaf is '9' and the
stem is '2'. For the value 78, the leaf is '8' and the stem is '7'. For the value 129, the leaf is '9' and the stem
is '12'. From this you see that we require just eleven stems here: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12.
To construct the stemplot, first write the stems down in a vertical column surrounded by two heavy vertical
lines:
2
3
4
5
6
7
8
9
10
11
12
Now, go through the set of data, one item at a time, and write the leaf of each value in a row to the right of
this column in the row of its stem. As you write these "leaves" be careful to make sure that each leaf takes
up the same amount of space on the line.
Thus, the first value, 75, has a leaf of '5' and a stem of '7'. So, we write a '5' in the '7' row. The second
value, 90, has a leaf of '0' and a stem of '9', so we write a '0' in the '9' row. Proceeding this way through the
entire set of forty values results in the following diagram:
David W. Sabo (1999)
Stemplots
Page 1 of 5
2
3
4
5
6
7
8
9
10
11
12
9
3
2
3
5
8
0
7
7
9
3
6
3
4
7
4 2 4 6 6 3
9 1 8 7 1 8 8
2 2 3 2 8
6 1
1 1 3
0 9
Make sure you understand how each element of this diagram came to be. Once you understand the
procedure, take a look at the diagram itself. If you compare it with the histogram constructed earlier for this
data, you should notice that the visual image here is much like the histogram turned on its side. Each row of
leaves has a length proportional to the number of data values having that stem value. This advantage of
this diagram over a histogram is that you can easily put stems and leaves back together again to get the
original 40 data values. With the stemplot above complete, you can see why it is important that all of the
leaves take up exactly the same amount of space -- they should line up in straight vertical columns. (If you
are typing up a stemplot using a word processor, you need to change to a font in which all the digits have
the same width.)

Quite often, computer programs that generate stemplots automatically will sort the leaves so that they are in
numerical order. For the diagram above, this would give:
2
3
4
5
6
7
8
9
10
11
12
9
3
2
1
2
3
0
1
7
2
1
2
8
1
1
3 4 4 6 6 9
3 3 7 8 8 8 9
2 3 5 6 8
4 6
3 7 7
0 9
The shape is just the same. It is sometimes useful to have the leaves in numerical order, but when you are
doing stemplots by hand in this course, it is not a requirement.
Back-to-Back Stemplots
We can combine two stemplots with common stem values into a single diagram to obtain a visual
comparison of the distribution of two sets of data by using both sides of the stem column for leaves. Again,
an example might make this easiest to understand.
Examples: SalmonCa20 and SalmonCa100
As you see from the 'Example Data Sets" document, the person studying the effect of ClO2 treatment on
salmon fillets repeated the experiment with 50 fillets treated with a 20 ppm solution of ClO 2, getting the data:
55
61
86
76
70
66
50
Page 2 of 5
78
71
78
69
70
78
72
75
69
74
76
72
67
54
66
50
59
70
67
67
61
82
64
58
71
49
63
52
49
59
37
Stemplots
58
70
60
62
68
73
59
62
68
74
74
79
SalmonCa20
David W. Sabo (1999)
and with 35 fillets treated with a 100 ppm solution of ClO2, getting the data:
79
113
75
82
80
95
116
101
65
92
109
120
88
75
88
53
105
93
45
79
108
84
113
106
78
83
72
67
92
86
74
79
89
61
83
SalmonCa100
The first set of values ranges from a minimum of 37ppm to a maximum of 86 ppm, and the second set
ranges from a minimum of 45 ppm to a maximum of 120 ppm. Thus, to handle both sets of data
simultaneously, we need stems of 3, 4, 5, 6, …, 11, 12.
Now, we set the stems up as in the previous example. The leaves corresponding to the data set
SalmonCa20 will be entered to the left of the stems, spacing leftwards. The leaves corresponding to the
data set SalmonCa100 will be entered to the right of the stems, spacing rightwards.
9
9 9 9 8 8 5 4 2 0
9 9 8 8 7 7 7 6 6 4 3 2 2 1 1
9 8 8 8 6 6 5 4 4 4 3 2 2 1 1 0 0 0
6
SalmonCa20
7
9
0
0
0
2
3
4
5
6
7
8
9
10
11
12
5
3
1
1
0
2
1
3
0
5
4
2
2
5
3
7
5
3
3
6
6
5 8 9 9 9
3 4 6 8 8 9
5
8 9
SalmonCa100
The immediate interpretation of this diagram is quite easy. We see that while there is some overlap
between data values in the two sets, the SalmonCa20 data is quite clearly concentrated and centered
around 60 ppm - 70 ppm, whereas the SalmonCa100 data is centered more in the 70 ppm to 90 ppm range.
Further, the distribution on the left is quite a bit narrower than the one on the right, indicating a Ca
concentrations in salmon fillets treated with a 20 ppm solution of ClO 2 are more uniform than the Ca levels in
salmon fillets treated with 100 ppm ClO2.
The interesting conclusion that is suggested by this diagram and this data (if true, of course -- for more
information you might want to look up a paper by J. Kim, W. Du, et al, in Journal of Food Science, 63 (1998),
629) is that increasing the concentration of ClO2 in the sanitizing solution appears to be associated with
higher Ca concentration in the salmon fillets. A goal of this course and of the discipline of statistics in
general will be to develop ways of deciding whether this apparent effect is real (or in the jargon of the
subject, whether the difference in Ca concentrations is statistically significant), or whether it might just be
the result of coincidence in the random sampling of salmon fillets in this case.

You could use the same sort of back-to-back comparison diagram based on histograms as well, by drawing
the bars of the histograms in a horizontal direction, one set leftwards from a central axis, and the other set
rightwards from that central axis.
Because of the way stemplots are constructed by splitting the digits of the actual data values into common
stems and individual leaves, it would seem that there might be sets of data for which acceptable stemplots
could not be constructed because too few or too many stems would result. Things worked out well in the
examples above, because the data values spanned a range of about 100 units, and so using all but the
rightmost digit as stems resulted in about 10 stems, which fit well with the '5 to 15' rule. However, this would
not have worked so well if the data values had spanned a range of, say, just 20 units, or of 500 units. In the
first case, we'd have gotten just 2 stems (too few), and in the second case, we'd have gotten 50 stems (far
too many).
David W. Sabo (1999)
Stemplots
Page 3 of 5
The problem is not as serious as it might first appear, however. Remember, the goal is:

to have between approximately 5 and 15 distinct stems,

to have every stem covering the same span of values

to have every leaf take up the same amount of space, and,

to be able to regenerate the original data values from the stemplot.
If the simple-minded approach illustrated in the examples above result in too few stems, we can split each
stem, and sort the leaves into an upper and lower half, as illustrated by the following example.
Example SalmonpH20:
As part of the salmon fillet sanitation study, the surface pH of the fillets was also recorded. For the 50 fillets
treated with 20 ppm ClO2, the pH values obtained were (not necessarily in the same order as the Ca
concentrations listed in data set SalmonCa20 above):
6.34
6.33
6.42
6.48
6.45
6.32
6.27
6.39
6.26
6.43
6.40
6.37
6.55
6.35
6.53
6.24
6.36
6.42
6.28
6.45
6.36
6.37
6.44
6.42
6.34
6.40
6.39
6.32
6.22
6.39
6.43
6.38
6.25
6.31
6.52
6.37
6.22
6.49
6.45
6.48
6.32
6.48
6.40
6.42
6.38
6.26
6.32
6.38
6.49
6.47
SalmonpH20
These values range from 6.22 to 6.55. If we use the rightmost decimal digits as the leaves, we will get just
four stems in our stemplot: 6.2, 6.3, 6.4, and 6.5. This gives the following stemplot:
6.2
6.3
6.4
6.5
2
1
0
2
2
2
0
3
4 5 6 6 7 8
2 2 2 3 4 4 5 6 6 7 7 7 8 8 8 9 9 9
0 2 2 2 2 3 3 4 5 5 5 7 8 8 8 9 9
5
We could double the number of stems in this diagram by partitioning the range spanned by each stem into
two equal parts: those with leaf values between 0 and 4 (call this the low range for that stem value), and
leaf values between 5 and 9 (call this the high range for that stem value). The resulting stemplot then looks
like:
6.2L
6.2H
6.3L
6.3H
6.4L
6.4H
6.5L
6.5H
2
5
1
5
0
5
2
5
2
6
2
6
0
5
3
4
6
2
6
0
7
7
2
7
2
8
8
2
7
2
8
3
7
2
8
4
8
2
9
4
8 8 9 9 9
3 3 4
9
The subscripts 'L' and 'H' on the stems are optional (they stand for 'Low' and 'High'), since it is obvious from
the values of the leaves that the first of each pair of identical stems corresponds to the low range values of
the leaves and the second to the high range values of the leaves. Notice that there are ten distinct leaves
corresponding to the stem 6.2, and by dividing them into two sets (0 - 4, and 5 - 9), the two new stems, 6.2L
and 6.2H each correspond to the same number of potential leaf values. This is required by the second of the
list of four goals or properties listed above.

The approach used in the above example can usually be applied to double the number of stems in a
stemplot. The methods illustrated so far are probably enough to handle just about every set of data you may
deal with. However, it is also possible to combine stems to reduce their numbers by some factor (such as
two or three, etc.), though the result is not quite as convenient as those obtained in the previous examples
here. At the risk of getting bogged down in details of situations that you are unlikely to encounter often in
actual applications, we'll look at one more example that shows one approach to reducing the number of
stems in a stemplot.
Page 4 of 5
Stemplots
David W. Sabo (1999)
Example SalmonPhos100:
Back to the salmon fillets … Phosphorus is another mineral component of salmon with nutritional
significance. The phosphorus concentrations in the 35 salmon fillets treated with 100 ppm ClO2 were
determined (ppm):
2384
2147
1615
1944
2000
1420
2631
2044
2201
2129
2406
1492
1657
2382
2216
2679
3254
1592
1165
2077
1652
2248
2118
1731
1975
2864
3232
2225
2581
2033
2462
2411
1962
1396
2762
SalmonPhos100
These values range from a low of 1165 ppm to a high of 3254 ppm. We could use the first digit as the stem,
but this would result in only three stems: 1, 2, and 3. If we split up these stems, that gets us to six (1 L, 1H,
2L, 2H, 3L, 3H), which is still a bit low, but might be worth a try. On the other hand, if we used the first two
digits as stems, we end up with 21 stems (11, 12, 13, 14, …, 29, 30, 31, and 32), which is too many.
However, if we double up these 21 stem values, we'd end up with about 10 or 11 branches in our diagram,
which is just about what we want.
So, each stem will be identified with an interval spanning 200 ppm, and so it probably makes the picture
more readable if the stems are multiples of 200 ppm. Thus, select as stems the values 10, 12, 14, 16, 18,
20, 22, 24, 26, 28, 30, 32, twelve values standing, respectively, for the concentrations 1000, 1200, 1400, …,
3200 ppm. Then, the leaves will have to be three digit values. Thus, since
2384 = 2200 + 184,
the data value 2384 will be represent by a leaf of 184 in the row with stem 22. However, although 1420 =
1400 + 20, the leaf for the value 1420 must be written as 020, in the row with stem 14, to ensure that all
leaves take up the same space in our diagram. With this approach, we get the following stemplot for this
data:
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
165
196
020
015
144
000
001
006
031
064
092
052
162
033
016
011
079
192
057
175
044
025
062
162
131
077 118 129 147
048 182 184
181
032 054
Here to facilitate the mental recombination of stems and leaves to regenerate the original data values, we've
appended the trailing zeros to the stems.

David W. Sabo (1999)
Stemplots
Page 5 of 5
Download