Draft 03/06/16 Draft Chapter 2 Organizing Data Gratzer and Jantzen Organizing Data Page # 1 Draft 03/06/16 Draft In the last chapter you were introduced to the vocabulary of statistics and you developed a greater understanding and appreciation for data and the data collection process. We will now begin to work with data. The first step in understanding the information contained in data is the organization and presentation of the data. Lots of unorganized numbers can be very confusing and often do not convey much useful information. Frequency distributions, which are tables that show how often each particular score (or category) occurs in the data set, are one commonly used method for extracting the hidden information contained in a data set. We will use the following data set to learn more about how to organize data into frequency distributions. AGES OF EMPLOYEES WHO WISH TO ENROLL IN COMPANY'S NEW DENTAL PLAN 31 47 44 34 41 27 39 42 42 45 42 44 43 40 47 57 43 33 44 51 34 47 40 49 53 59 46 44 49 54 44 51 48 43 47 57 37 69 54 49 42 58 44 35 31 51 49 55 35 52 47 50 54 36 61 38 43 44 49 54 42 48 40 56 49 55 31 44 49 34 33 40 34 37 33 61 50 35 55 44 35 47 37 55 50 30 31 51 27 39 39 55 50 27 46 Although the above list contains a lot of information, very little of it is informative to the reader. A frequency distribution for enrollee ages can be created as follows: 1. Divide the range of all scores (high score – low score) into between 6 and 15 classes of equal width, and organize the classes in ascending order. You can use the minimum and maximum scores as guides in the creation of classes with pleasing endpoints. For example, the youngest person above is 27 and the oldest 69. You could create classes like 27-31, 32-36, 37-41 . . . 62-66 and 67-71 for the above age numbers, but it would be more pleasing to create classes like 25-29, 30-34, 35-39 . . . Organizing Data Page # 2 Draft 03/06/16 Draft 60-64 and 65-69. Commonly used class widths are 2, 5, 10 and multiples of 10. Also, be sure that the classes don’t overlap, e.g., 30-35 and 35-39 wouldn’t work because where would you put persons 35 years old? Find the absolute frequency of observations for each class by tallying how many observations fall into each class. 2. Find the relative frequency of each class. Relative frequencies show what fraction of the observations falls into each class and can be found by dividing the number of subjects in each class by the total number of subjects. 3. Find the cumulative relative frequency of each class. Cumulative relative frequencies show what fraction of the observations have values that are less than or equal to the highest value contained in each class. The cumulative relative frequency of any class can be found by adding up all of the relative frequencies for all classes up to and including that class. The above steps outline how a frequency distribution can be created for a numerical (quantitative) variable like ages. Frequency distributions for categorical variables, like gender (male/female), can be constructed in a similar manner, with a few modifications. Usually since the number of possible categories is small, it isn’t necessary to group differing categories together. Absolute frequencies and relative frequencies can be computed for each possible category, e.g., how many men (women) are there and what percent of the sample is male (female)? But cumulative relative frequencies are not computed for categorical variables, because the categories themselves cannot be organized in a hierarchical manner (men don’t have more or less gender than women). Using the above steps, we can create a frequency table for the age data above. There are 95 measurements ranging from a low of 27 to a high of 69, hence the data values have a range of Organizing Data Page # 3 Draft 03/06/16 Draft 42 years. An interval width of 5 years would produce 9 classes of ages, an acceptable number. The following frequency distribution describes the age distribution of the dental plan enrollees. Ages of Employees Applying for Dental Plan Age Tally 25-29 30 –34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 /// ///// ///// // ///// ///// // ///// ///// ///// ///// /// //// ///// ///// /// ///// ///// //// ///// ///// // / Total: Absolute Frequency 3 12 12 23 18 14 10 2 1 95 Relative Frequency .032 = 3.2% .126=12.6% .126=12.6% .242=24.2% .189=18.9% .147=14.7% .105=10.5% .021=2.1% .011=1.1% 1.00 =100% Cumulative Relative Frequency .032=3.2% .158=15.8% .284=28.4% .526=52.6% .716=71.6% .863=86.3% .968=96.8% .989=98.9% 1.00=100% Looking at the 40-44 age category, we can see that 23 persons were 40 to 44 years old. The corresponding relative frequency shows that 40-44 year olds comprised 24.2% (23 divided by 95) of the total number of dental plan seekers. The cumulative relative frequency for that class shows that 52.6% (3.2% + 12.6% + 12.6% + 24.2%) of all plan applicants were less than or equal to 44. We can now begin to make sense of the data. We see the most common and uncommon age intervals, we get an idea of where the data is centered and we begin to understand what it takes to be of exceptional age. Charts Charts are powerful methods for presenting data, and can be constructed using the information contained in frequency distributions. Charts, by creating a picture of the data, provide the reader with information about the values that the data takes on and how often they Organizing Data Page # 4 Draft 03/06/16 Draft occur. A popular method for graphically displaying categorical data is the Bar Graph, while Histograms can be constructed for numerical variables. Bar Graphs Bar Graphs plot either the absolute or the relative frequencies of the differing outcomes of a categorical variable. Suppose the frequency distribution below shows the hair colors for a sample of 110 college students. Hair Color Brown Blonde Black Other Total: Absolute Frequency 40 17 29 24 110 Relative Frequency .364=36.4% .155=15.5% .264=26.4% .218=21.8% 1.00=100% A Bar Graph presenting the distribution of students’ hair color is shown below. The absolute frequencies for each color category are plotted on the vertical axis, while the hair color categories are listed on the horizontal axis. Note that the vertical axis is scaled by numbers that are the same distance from each other (5, 10, 15, 20, etc., not 5, 15, 20, 30, etc.). The highest value on this scale must also be at least as large as the absolute frequency of the category that occurs most often. The horizontal axis contains the list of categories equally spaced on the axis. Above each category label is a bar whose height corresponds to the count of that category. Ideally, each bar should be of equal width, to avoid drawing the reader’s attention to a specific category. The bars also don’t meet, emphasizing the fact that the data is discrete in nature. Organizing Data Page # 5 Draft 03/06/16 Draft 45 40 35 Number of Students 30 25 20 15 10 5 Brown Blonde Black Other Hair Color We could have plotted the relative frequencies instead of the absolute frequencies, which would have yielded a chart with the same shape but differing values on the vertical axis. The advantage of plotting relative frequencies is that the chart will show what share of the sample falls into each category. Histograms If a variable is numerical, Histograms can be used to: 1. look for patterns 2. describe the shape of the distribution (bell-shaped, flat, bimodal, symmetric, skewed) 3. approximate the center of a distribution 4. draw attention to outliers. Organizing Data Page # 6 Draft 03/06/16 Draft A histogram can be constructed using the following procedure. 1. Create a frequency distribution table if one is not available. 2. Create an axis system with either absolute or relative frequencies on the vertical axis and the data classes on the horizontal axis. Like the bar graph, it is essential that the numbered ticks on the vertical scale be equidistant from each other. The horizontal classes must also be of uniform width, and include all possible values in the data range, even if some classes have no observations. 3. Bars with height equal to the absolute frequency (or relative frequency) of each class covering the range of possible values are drawn in. Note that these bars will touch, emphasizing the fact that the data is continuous in nature. It should be noted that if the class widths or minimum values used to create the frequency distribution are changed, the shape of the histogram will change, because the absolute (and relative) frequencies will differ. Since the choice of class width and starting values is somewhat subjective, there is no single best histogram for a given data set. The frequency distribution for the dental plan data described earlier, and the corresponding histogram, are presented below. Organizing Data Page # 7 Draft 03/06/16 Draft Ages of Employees Applying for Dental Plan Age Tally 25-29 30 –34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 /// ///// ///// // ///// ///// // ///// ///// ///// ///// /// //// ///// ///// /// ///// ///// //// ///// ///// // / Total: Absolute Frequency 3 12 12 23 18 14 10 2 1 95 Relative Frequency .032 = 3.2% .126=12.6% .126=12.6% .242=24.2% .189=18.9% .147=14.7% .105=10.5% .021=2.1% .011=1.1% 1.00 =100% Cumulative Relative Frequency .032=3.2% .158=15.8% .284=28.4% .526=52.6% .716=71.6% .863=86.3% .968=96.8% .989=98.9% 1.00=100% Histogram of Ages of Employees Opting for New Dental Plan 24 18 Number of Employees 12 6 0 20 25 30 35 40 45 50 55 60 65 70 75 Ages of Employees Organizing Data Page # 8 Draft 03/06/16 Draft Notice that a vertical breaker appears at age 35. This line emphasizes that there are two distinct categories, namely, 30 to 34 and 35 to 39, into which employees can be placed. Even if several consecutive categories have the same frequency, vertical breakers are necessary to indicate that the observations fall into different classes. The histogram shows that employee ages are centered around 40-44, meaning that that’s the most probable age category, and that the number of employees decreases as you move farther away from that center. Histograms and Area Careful examination of the histogram in the previous section reveals the interesting observation that the ratio of the area enclosed by the bar of a category to the total area under the histogram (area of a bar)/(total area) is the relative frequency of that category. In this example the total area is 475 square units and the area of the bar enclosing the age group 50 to 54 is 14 5 = 70. This yields a ratio of 70/475 = .147, which is equal to the relative frequency of this age group as seen in the frequency distribution. Similar calculations for the age group 30 to 34 yields a ratio of 60/475 = .126, the relative frequency of this group. That this will always be the case can be seen by the following argument. Let fi = frequency of category i. Let x = common width of the intervals. Then the area of bar i, Ai = fi x. The total area under the histogram TA = fi x The ratio of area of bar i to total area = Ai / TA = (fi x) / (fi x) = fi / fi = relative frequency of category i Organizing Data Page # 9 Draft 03/06/16 Draft Occasionally one may encounter data that is presented in a frequency distribution containing unequal intervals. For example, in a study of subjects' reaction to violence on TV, the researchers’ design may call for the groupings child (ages 5 – 12), teens (ages 13 – 18), college aged (ages 19 – 24), family building years (ages 25 – 39) and middle age (ages 40 – 65). There are often practical and valid non-statistical reasons for such groupings. The groupings in the above example might well have been created to investigate the relationship of a person’s position in a family and their perceptions of violent behavior. The following example will illustrate some of the potential problems caused by the presence of unequal interval widths. A quality control expert examined hospital charts for the number of inconsistencies between summary reports and original notes. The results of the investigation are reported in the table below. Number of inconsistencies in Hospital Charts Number of Inconsistencies 0 x<1 1 x<2 2 x<3 3 x<9 Freq 6 3 4 7 If a Histogram is drawn using the numbers in the table above, the following is produced. Organizing Data Page # 10 Draft 03/06/16 Draft Histogram Chart Inconsistencies 8 Frequency 6 4 2 0 2 4 6 8 10 Inconsistencies Does this picture accurately depict what the children saw? In a very real sense it suggests just the opposite. The bar above the group 3 – 9 overshadows the rest of the plot, suggesting that large numbers of inconsistencies are far more prevalent than small numbers of inconsistencies. Careful examination of the data in the table informs us that this is not the case. What went wrong? An examination of area will again provide insight. The total area under this histogram is calculated to be 55 square units. Inconsistencies Area Ratio True Relative Frequency First bar 0x<1 61=6 6/55 = .109 6/20 = .3 Second 1x<2 31=3 3/55 = .055 3/20 = .15 Third 2x<3 41=4 4/55 = .073 4/20 = .2 Fourth 3x<9 7 6 = 42 42/55 = .767 7/20 = .35 The ratio of bar area to total area does not equal the relative frequency of the groups. Thus, the picture above can not be counted on to provide reliable information. However, reliable histograms can be constructed for the purpose of picturing data with unequal intervals. The key Organizing Data Page # 11 Draft 03/06/16 Draft step in such a construction is the new measure, frequency per unit. The first step in the creation of this new measure is the choice of the unit. The frequencies may be counted per year, per fiveyear groups, or per decade. We will adopt the convention that the adjustments will be made per single unit. Thus, in the current example we will measure frequency per inconsistency. The adjusted measure is then created by dividing the frequency of each category by the number of units in the category. Number of Inconsistencies 0 1 2 3 Frequency 6 3 4 7 x<1 x<2 x<3 x<9 Number of Inconsistencies in Interval 1 1 1 6 Chart Frequency per Inconsistency 6/1 = 6 3/1 = 3 4/1 = 4 7/6 = 1.167 Using Frequency per inconcistency as the scale on the vertical axis, the total area under the histogram is calculated to be: (6 1) + (3 1) + (4 1) + (7/6 6) = 20 square units. The illustration that this change makes the ratio of area of bar/ total area equal to the relative frequency follows. Inconsistencies Area Ratio Relative Frequency First bar 0x<1 61=6 6/20 6/20 Second 1x<2 31=3 3/20 3/20 Third 2x<3 41=4 4/20 4/20 Fourth 3x<9 7/6 6 = 7 7/20 7/20 The argument which follows shows that this adjustment procedure always works. Let fi = frequency of category i and ni = number of unit widths in the category. Let nix = width of interval i, where x is the unit width. Organizing Data Page # 12 Draft 03/06/16 Draft The adjusted frequency is fi / ni and the area of category i is (fi / ni) (nix) = fi x The total area under the histogram TA = [(fi / ni ) (ni x)] = (fi x) The ratio of area of bar i to total area = Ai / TA = (fi x) / (fi x) = fi x / x fi = fi / fi = relative frequency of category i The adjusted histogram in the Figure below is more representative of reality. Histogram of Charts per Inconsistency 7 6 Charts per Inconsistency 5 4 3 2 1 0 2 4 6 8 10 # of Inconsistencies A Final Example Age is a measure that must be treated with care, or counts will be off. Age categories are often listed in the form 1 – 4 or 5 – 9. The reader should notice that the category 5 – 9 includes the ages 5, 6, 7, 8, and 9. Thus, this category contains five ages and not the sometimes expected four. The following table summarizes the number of Deaths by accident in Massachusetts in the year 1960. An adjusted histogram follows. Organizing Data Page # 13 Draft 03/06/16 Draft Deaths from accident by Age, Massachusetts, 1960 Age # of Deaths Class Size Under1 1–4 5–9 10 – 14 15 – 24 25 – 34 35 – 44 45 – 54 55 – 64 65 – 74 75 – 84 85 – 99 71 93 85 38 239 133 186 199 249 389 471 301 1 4 5 5 10 10 10 10 10 10 10 15 Deaths per Year of Age 71 23.25 17 7.6 23.9 13.3 18.6 19.9 24.9 38.9 47.1 20.07 Histogram of Deaths per year of Age, Massachusetts, 1960 Deaths per of Age 75 60 45 30 15 10 Organizing Data 20 30 40 50 60 70 80 90 100 Age Page # 14 Draft 03/06/16 Draft The Stem-and-Leaf Plot: The Basics The stem-and-leaf plot, popularized by John Tukey, is another useful way of displaying how the data for a numerical variable are distributed. The construction of such a plot will be illustrated using the following data gathered from 22 students. A statistics teacher asked his/her students to bring to the final exam the sample problems that they had attempted while studying for the final. While the students were taking the exam, the instructor counted the number of problems attempted by each student. The results of this survey are found below. Number of Review Problems Attempted by 22 Statistics Students 51 94 103 114 100 106 100 122 75 84 95 98 70 81 101 110 85 93 112 90 97 86 In order to create a stem-plot, the numbers must be measured with the same degree of precision, e.g., to one decimal place, in whole units, in tens, etc. The first step is to separate each observation’s number into two parts: a stem, containing all of the number’s digits except the right-most digit, and a leaf consisting of the right-most digit. For example, the smallest number 51 would have a stem of 5 and a leaf of 1. The largest number 122 would have a stem of 12 and a leaf of 2. For the second step, make a vertical list of stem values that includes all integer values that lie between the smallest and largest stem values, arranged in ascending order, and place a vertical line (a splitter) to the right of the stem list. For this example a list of 8 stems, the numbers 5, 6, 7, 8, 9, 10, 11 and 12, will span the entire data set. Include all integer values between the minimum and maximum stem values, even if the data doesn’t contain numbers that correspond to all of the stem values. We will observe the convention that if the list of stems Organizing Data Page # 15 Draft 03/06/16 Draft contains between 6 and 15 numbers, the reader can go on to the next step. What to do when this is not the case will be discussed later in this section. For the third step, place each leaf to the right of the splitter on the line containing the stem to which it was originally attached. For example the data point 51 would be recorded as 5 | 1 and the data point 122 is recorded as 12 | 2. If a stem line already holds a leaf, the next leaf is simply placed to the right of the existing one, i.e., 103 and 100 would be recorded as 10 | 30. For the final step, rearrange the leaves so that they are in ascending order away (to the right of) the stems. The initial and rearranged stem-and-leaf plots of the entire class can be seen below. The stem width of each plot is also noted, and indicates the order of magnitude of the stem values, e.g., 5 1 is 5 tens plus 1 or 51, 12 | 2 is 12 tens plus 2 or 122. 5 6 7 8 9 10 11 12 1 50 4156 458307 30601 402 2 stem width = 10 5 6 7 8 9 10 11 12 1 1456 034578 00136 024 2 stem width = 10 Stem-and-leaf plots can provide the reader with the same information as histograms, on how the variable is distributed. Stem-and-leaf plots, unlike histograms, also preserve the original numbers, so that a reader can recreate all of the original data series. A stem-and-leaf plot also draws the reader’s attention to gaps and sudden dips or peaks in the data stream. Data points that deviate greatly from the overall perceived pattern in the data are known as outliers. Special attention must be paid to outliers to determine if they are real or, as is often the case, the result of measurement or recording error. Finally, the stem-and-leaf plot can be an aid in approximating Organizing Data Page # 16 Draft 03/06/16 Draft the center of a distribution. Thus, the plot above presents the reader with a distribution of attempted problems that is symmetric, centered around 97, with a small gap in the 60’s; that is, no student attempted 60 something problems. Thus, attention is drawn to the 51. Was this a mistake, or did this student give up, or was this student so well prepared that no more review was needed? Back-to-Back Stem-and-Leaf Plots In order to compare two samples, two stem-and-leaf plots can be plotted back-to-back, creating a back-to-back stem-and-leaf plot. When using this method two splitter lines are used, one to the left and one to the right of the stem column. The leaves of one sample are recorded to the left of the left splitter while the leaves of the other sample are recorded to the right of the right splitter. Consider the hypothetical data below, reflecting the danger of extended exposure to the sun. This data contains the depth of a melanoma and the sex of the person on whom it was discovered. Each depth has been rounded to the nearest tenth of a mm for demonstration purposes. Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 Organizing Data Melanoma Depth 2.3 2.6 1.5 4.7 3.1 3.7 3.1 0.3 3.7 3.3 2.5 6.9 3.2 Gender Female Male Male Female Male Female Female Female Female Male Female Male Female Subject 14 15 16 17 18 19 20 21 22 23 24 25 Melanoma Depth 1.7 2.8 1.2 3.3 0.2 3.0 10.0 8.6 6.3 4.1 3.5 2.4 Gender Male Female Female Male Female Male Male Male Female Male Male Male Page # 17 Draft 03/06/16 Draft The above data can be used to construct a back-to-back stem-and-leaf plot comparing the depth of the diagnosed melanomas of males and females. You will notice that the data ranges from 0.2 to 10.0 mm. Thus, stems ranging from 0 to 10 measured in whole units will be combined with leaves measured in units measured in tenths. Female depths will be recorded to the left of the stem column and male depths to the right. The plot on the left contains the initial unarranged leaves, while the plot on the right has the leaves arranged in ascending order away from the stem. Women 23 2 853 217 7 3 Men 0 1 2 3 4 5 6 7 8 9 10 Women 57 64 13305 1 9 32 2 853 721 7 3 6 0 stem width = units Men 0 1 2 3 4 5 6 7 8 9 10 57 46 013305 1 9 6 0 stem width = units This back-to-back stem-and-leaf plot shows that the melanomas of women have smaller depths than men’s, perhaps suggesting women tend to get treatment earlier than men The Stem-and-Leaf Plot: The not-so Basics Ideally the number of stems that a stem-and-leaf plot contains should be between 6 and 15. A smaller number hides how the variable is distributed, by lumping too many observations together. More also obscures the variable’s distribution, because few stems will have more than a couple of observations. Organizing Data Consider the following data gathered from a random sample of 38 Page # 18 Draft 03/06/16 Draft U. S. hospitals. Each hospital reported what percentage of their inpatients had had their care paid for by the Medicaid insurance program. The Medicaid program is the principal insurance program for poor persons in the U.S. Medicaid Shares (in %) of Hospitals 1, 2, 5, 6, 7, 7, 10, 24, 26, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 8, 10, 10, 10, 13, 15, 16, 18, 21, 22, 23, 24, 30, 33 Since the Medicaid percentages lie between 1 and 33 percent, we could construct a stem and leaf plot with stems of 0, 1, 2 and 3. But 4 stems is less than the recommended minimum of 6 and such a stem and leaf plot would obscure how the Medicaid shares are actually distributed. We can, however, use split stems to increase the number of stem values. To double the number of stem values (from 4 to 8 in this example), write each of the stem values twice, and then assign leaves with the values of 0 thru 4 on the upper stem value and 5 thru 9 on the lower. Below are two stem and leaf plots for the Medicaid share data, the second of which uses split stems. 0 1223334455556677 788888 1 00003568 2 123446 3 03 stem width = units Organizing Data Page # 19 Draft 03/06/16 0 12233344 0 55556677788888 1 00003 1 568 2 12344 2 6 3 03 Draft stem width = units Note that in the bottom plot, hospitals with percents ranging between 1 thru 4 had their leaves placed on the upper 0 stem, while percents ranging from 5 thru 9 had their leaves placed on the lower 0 stem. The split-stem plot provides more insight into how the Medicaid shares are distributed, showing that nearly twice as many hospitals had Medicaid shares between 5 and 9%, than in the 0 to 4% range. Sometimes a variable’s data values will generate a stem and leaf plot that has too many stem values (more than 15). Consider the following list which shows the number of licensed beds that the aforementioned 38 hospitals have: Number of licensed beds 19, 30, 36, 40, 43, 56, 56, 75, 78, 79, 88, 93, 106, 115, 135, 155, 157, 168, 178, 187, 193, 223, 233, 252, 252, 258, 270, 281, 287, 294, 295, 346, 350, 356, 442, 463, 531, 653 Given minimum and maximum bed sizes of 19 and 653, a stem and leaf plot of the above data would have 65 stems (ranging from 1 to 65). Such a plot wouldn’t be very informative, because many of the stems would have no leaves and the rest would have only one or two. One Organizing Data Page # 20 Draft 03/06/16 Draft way to reduce the number of stems is to truncate each number by discarding the right most digit. In this example, we could discard the right most digit of the above series of numbers yielding the following number series: 1, 3, 3, 4, 4, 5, 5, 7, 7, 7, 8, 9, 10, 11, 13, 15, 15, 16, 17, 18, 19, 22, 23, 25, 25, 25, 27, 28, 28, 29, 29, 34, 35, 35, 44, 46, 53, 65 The first hospital’s size of 19 has been recoded as a 1 (19 discarding the 9), the second’s size of 30 as 3 (30 discarding the 0), the third’s size of 36 as 3 (36 discarding the 6), etc. Since the recoded values range from 1 to 65, the stem and leaf plot will have 6 stems, namely 1, 2, 3, 4, 5 and 6. The corresponding plot is below: 0 133445577789 1 013556789 2 2355578899 3 455 4 46 5 3 6 5 stem width = 100 The plot shows that the number of hospitals is concentrated at the smaller sizes, with decreasing numbers in the larger size categories. Note the stem width, which indicates that the stem values are measured in 100s and the leaves are measured in tens. Hence the bottom row of 6 5 is for the hospital with 653 beds. Also, since the above plot was created by using truncated values, the original data’s values can only be approximated from the plot. Finally, if truncating only one digit still yields more than 15 stem values, try truncating another digit. Organizing Data Page # 21