MATH 2441 Probability and Statistics for Biological Sciences Stemplots For relatively small sets of data, involving perhaps up to a couple of hundred values at the most, the stemplot (or stem-and-leaf plot) is a way to organize the information which has some of the features of a histogram, but allows the viewer to reconstruct the original data if desired. Example SalmonCa0 It will be easiest to explain how a stemplot is constructed by working through an actual example -- here, we'll go back to the data set SalmonCa0, used to illustrate the procedure for constructing frequency tables and histograms. The 40 data values are: 75 107 72 61 56 90 52 61 52 53 76 59 73 68 103 72 63 68 78 88 94 69 67 68 47 120 96 43 54 91 63 107 72 101 83 29 54 101 56 129 From previous work, we know that the smallest number in this set is 29 and the largest is 129, giving a span of 100 units. In setting up a stemplot, you need to first decide how you will split up each data value into a "leaf" part and a "stem" part. The leaf part is one or more of the rightmost digits of each value, and every leaf must have the same number of digits. The digits in each value which are not part of the leaf then form the stem. This process must be done so that the entire set of data gives somewhere between approximately 5 and 15 unique stems. In this example, the task is quite easy. If we choose to have our leaves consist of just the single rightmost digit of each value, then the stems will be all but that one digit. Thus, for the value 29, the leaf is '9' and the stem is '2'. For the value 78, the leaf is '8' and the stem is '7'. For the value 129, the leaf is '9' and the stem is '12'. From this you see that we require just eleven stems here: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12. To construct the stemplot, first write the stems down in a vertical column surrounded by two heavy vertical lines: 2 3 4 5 6 7 8 9 10 11 12 Now, go through the set of data, one item at a time, and write the leaf of each value in a row to the right of this column in the row of its stem. As you write these "leaves" be careful to make sure that each leaf takes up the same amount of space on the line. Thus, the first value, 75, has a leaf of '5' and a stem of '7'. So, we write a '5' in the '7' row. The second value, 90, has a leaf of '0' and a stem of '9', so we write a '0' in the '9' row. Proceeding this way through the entire set of forty values results in the following diagram: David W. Sabo (1999) Stemplots Page 1 of 5 2 3 4 5 6 7 8 9 10 11 12 9 3 2 3 5 8 0 7 7 9 3 6 3 4 7 4 2 4 6 6 3 9 1 8 7 1 8 8 2 2 3 2 8 6 1 1 1 3 0 9 Make sure you understand how each element of this diagram came to be. Once you understand the procedure, take a look at the diagram itself. If you compare it with the histogram constructed earlier for this data, you should notice that the visual image here is much like the histogram turned on its side. Each row of leaves has a length proportional to the number of data values having that stem value. This advantage of this diagram over a histogram is that you can easily put stems and leaves back together again to get the original 40 data values. With the stemplot above complete, you can see why it is important that all of the leaves take up exactly the same amount of space -- they should line up in straight vertical columns. (If you are typing up a stemplot using a word processor, you need to change to a font in which all the digits have the same width.) Quite often, computer programs that generate stemplots automatically will sort the leaves so that they are in numerical order. For the diagram above, this would give: 2 3 4 5 6 7 8 9 10 11 12 9 3 2 1 2 3 0 1 7 2 1 2 8 1 1 3 4 4 6 6 9 3 3 7 8 8 8 9 2 3 5 6 8 4 6 3 7 7 0 9 The shape is just the same. It is sometimes useful to have the leaves in numerical order, but when you are doing stemplots by hand in this course, it is not a requirement. Back-to-Back Stemplots We can combine two stemplots with common stem values into a single diagram to obtain a visual comparison of the distribution of two sets of data by using both sides of the stem column for leaves. Again, an example might make this easiest to understand. Examples: SalmonCa20 and SalmonCa100 As you see from the 'Example Data Sets" document, the person studying the effect of ClO2 treatment on salmon fillets repeated the experiment with 50 fillets treated with a 20 ppm solution of ClO 2, getting the data: 55 61 86 76 70 66 50 Page 2 of 5 78 71 78 69 70 78 72 75 69 74 76 72 67 54 66 50 59 70 67 67 61 82 64 58 71 49 63 52 49 59 37 Stemplots 58 70 60 62 68 73 59 62 68 74 74 79 SalmonCa20 David W. Sabo (1999) and with 35 fillets treated with a 100 ppm solution of ClO2, getting the data: 79 113 75 82 80 95 116 101 65 92 109 120 88 75 88 53 105 93 45 79 108 84 113 106 78 83 72 67 92 86 74 79 89 61 83 SalmonCa100 The first set of values ranges from a minimum of 37ppm to a maximum of 86 ppm, and the second set ranges from a minimum of 45 ppm to a maximum of 120 ppm. Thus, to handle both sets of data simultaneously, we need stems of 3, 4, 5, 6, …, 11, 12. Now, we set the stems up as in the previous example. The leaves corresponding to the data set SalmonCa20 will be entered to the left of the stems, spacing leftwards. The leaves corresponding to the data set SalmonCa100 will be entered to the right of the stems, spacing rightwards. 9 9 9 9 8 8 5 4 2 0 9 9 8 8 7 7 7 6 6 4 3 2 2 1 1 9 8 8 8 6 6 5 4 4 4 3 2 2 1 1 0 0 0 6 SalmonCa20 7 9 0 0 0 2 3 4 5 6 7 8 9 10 11 12 5 3 1 1 0 2 1 3 0 5 4 2 2 5 3 7 5 3 3 6 6 5 8 9 9 9 3 4 6 8 8 9 5 8 9 SalmonCa100 The immediate interpretation of this diagram is quite easy. We see that while there is some overlap between data values in the two sets, the SalmonCa20 data is quite clearly concentrated and centered around 60 ppm - 70 ppm, whereas the SalmonCa100 data is centered more in the 70 ppm to 90 ppm range. Further, the distribution on the left is quite a bit narrower than the one on the right, indicating a Ca concentrations in salmon fillets treated with a 20 ppm solution of ClO 2 are more uniform than the Ca levels in salmon fillets treated with 100 ppm ClO2. The interesting conclusion that is suggested by this diagram and this data (if true, of course -- for more information you might want to look up a paper by J. Kim, W. Du, et al, in Journal of Food Science, 63 (1998), 629) is that increasing the concentration of ClO2 in the sanitizing solution appears to be associated with higher Ca concentration in the salmon fillets. A goal of this course and of the discipline of statistics in general will be to develop ways of deciding whether this apparent effect is real (or in the jargon of the subject, whether the difference in Ca concentrations is statistically significant), or whether it might just be the result of coincidence in the random sampling of salmon fillets in this case. You could use the same sort of back-to-back comparison diagram based on histograms as well, by drawing the bars of the histograms in a horizontal direction, one set leftwards from a central axis, and the other set rightwards from that central axis. Because of the way stemplots are constructed by splitting the digits of the actual data values into common stems and individual leaves, it would seem that there might be sets of data for which acceptable stemplots could not be constructed because too few or too many stems would result. Things worked out well in the examples above, because the data values spanned a range of about 100 units, and so using all but the rightmost digit as stems resulted in about 10 stems, which fit well with the '5 to 15' rule. However, this would not have worked so well if the data values had spanned a range of, say, just 20 units, or of 500 units. In the first case, we'd have gotten just 2 stems (too few), and in the second case, we'd have gotten 50 stems (far too many). David W. Sabo (1999) Stemplots Page 3 of 5 The problem is not as serious as it might first appear, however. Remember, the goal is: to have between approximately 5 and 15 distinct stems, to have every stem covering the same span of values to have every leaf take up the same amount of space, and, to be able to regenerate the original data values from the stemplot. If the simple-minded approach illustrated in the examples above result in too few stems, we can split each stem, and sort the leaves into an upper and lower half, as illustrated by the following example. Example SalmonpH20: As part of the salmon fillet sanitation study, the surface pH of the fillets was also recorded. For the 50 fillets treated with 20 ppm ClO2, the pH values obtained were (not necessarily in the same order as the Ca concentrations listed in data set SalmonCa20 above): 6.34 6.33 6.42 6.48 6.45 6.32 6.27 6.39 6.26 6.43 6.40 6.37 6.55 6.35 6.53 6.24 6.36 6.42 6.28 6.45 6.36 6.37 6.44 6.42 6.34 6.40 6.39 6.32 6.22 6.39 6.43 6.38 6.25 6.31 6.52 6.37 6.22 6.49 6.45 6.48 6.32 6.48 6.40 6.42 6.38 6.26 6.32 6.38 6.49 6.47 SalmonpH20 These values range from 6.22 to 6.55. If we use the rightmost decimal digits as the leaves, we will get just four stems in our stemplot: 6.2, 6.3, 6.4, and 6.5. This gives the following stemplot: 6.2 6.3 6.4 6.5 2 1 0 2 2 2 0 3 4 5 6 6 7 8 2 2 2 3 4 4 5 6 6 7 7 7 8 8 8 9 9 9 0 2 2 2 2 3 3 4 5 5 5 7 8 8 8 9 9 5 We could double the number of stems in this diagram by partitioning the range spanned by each stem into two equal parts: those with leaf values between 0 and 4 (call this the low range for that stem value), and leaf values between 5 and 9 (call this the high range for that stem value). The resulting stemplot then looks like: 6.2L 6.2H 6.3L 6.3H 6.4L 6.4H 6.5L 6.5H 2 5 1 5 0 5 2 5 2 6 2 6 0 5 3 4 6 2 6 0 7 7 2 7 2 8 8 2 7 2 8 3 7 2 8 4 8 2 9 4 8 8 9 9 9 3 3 4 9 The subscripts 'L' and 'H' on the stems are optional (they stand for 'Low' and 'High'), since it is obvious from the values of the leaves that the first of each pair of identical stems corresponds to the low range values of the leaves and the second to the high range values of the leaves. Notice that there are ten distinct leaves corresponding to the stem 6.2, and by dividing them into two sets (0 - 4, and 5 - 9), the two new stems, 6.2L and 6.2H each correspond to the same number of potential leaf values. This is required by the second of the list of four goals or properties listed above. The approach used in the above example can usually be applied to double the number of stems in a stemplot. The methods illustrated so far are probably enough to handle just about every set of data you may deal with. However, it is also possible to combine stems to reduce their numbers by some factor (such as two or three, etc.), though the result is not quite as convenient as those obtained in the previous examples here. At the risk of getting bogged down in details of situations that you are unlikely to encounter often in actual applications, we'll look at one more example that shows one approach to reducing the number of stems in a stemplot. Page 4 of 5 Stemplots David W. Sabo (1999) Example SalmonPhos100: Back to the salmon fillets … Phosphorus is another mineral component of salmon with nutritional significance. The phosphorus concentrations in the 35 salmon fillets treated with 100 ppm ClO2 were determined (ppm): 2384 2147 1615 1944 2000 1420 2631 2044 2201 2129 2406 1492 1657 2382 2216 2679 3254 1592 1165 2077 1652 2248 2118 1731 1975 2864 3232 2225 2581 2033 2462 2411 1962 1396 2762 SalmonPhos100 These values range from a low of 1165 ppm to a high of 3254 ppm. We could use the first digit as the stem, but this would result in only three stems: 1, 2, and 3. If we split up these stems, that gets us to six (1 L, 1H, 2L, 2H, 3L, 3H), which is still a bit low, but might be worth a try. On the other hand, if we used the first two digits as stems, we end up with 21 stems (11, 12, 13, 14, …, 29, 30, 31, and 32), which is too many. However, if we double up these 21 stem values, we'd end up with about 10 or 11 branches in our diagram, which is just about what we want. So, each stem will be identified with an interval spanning 200 ppm, and so it probably makes the picture more readable if the stems are multiples of 200 ppm. Thus, select as stems the values 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, twelve values standing, respectively, for the concentrations 1000, 1200, 1400, …, 3200 ppm. Then, the leaves will have to be three digit values. Thus, since 2384 = 2200 + 184, the data value 2384 will be represent by a leaf of 184 in the row with stem 22. However, although 1420 = 1400 + 20, the leaf for the value 1420 must be written as 020, in the row with stem 14, to ensure that all leaves take up the same space in our diagram. With this approach, we get the following stemplot for this data: 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 165 196 020 015 144 000 001 006 031 064 092 052 162 033 016 011 079 192 057 175 044 025 062 162 131 077 118 129 147 048 182 184 181 032 054 Here to facilitate the mental recombination of stems and leaves to regenerate the original data values, we've appended the trailing zeros to the stems. David W. Sabo (1999) Stemplots Page 5 of 5