MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 1 of 8 Notes for Chapter 2 Summarizing and Graphing Data Chapter 3 Describing, Exploring, and Comparing Data Frequency Distribution, Graphic Representation, Measures of Center, Variation, & Standing In these chapters, we will study (1) visual representation of data, (2) means of center and variation, and (3) relative standings and exploratory analysis. These three areas will include (1) frequency distribution, relative frequency distribution, cumulative frequency distribution, histogram, frequency polygon, stem-and-leaf plots, and scatter plots; (2) arithmetic mean, median, mode, midrange, weighted mean, range, standard deviation, coefficient of variation, empirical rule, and Chebyshev’s Theorem; and (3) z-scores, quartiles, percentiles, outliers, and box plots (5-number summary). Visual Representation of Data A frequency distribution is one convenient way to represent a large amount of data in a small amount of space using two columns: (1) categories or classes and (2) frequency. There are some general guidelines that we should use when constructing a frequency distribution. First, determine the number of classes, k, by using the “2 to the k rule.” Find the smallest integer k so that 2k n where n is the total number of observations or data values. For example, if n = 50 data values, we would find k = 6 classes. [24 = 16, 25 = 32, and 26 = 64] Of course, we have some freedom that allows us to choose the number of classes different from the k-value when actually constructing the frequency distribution. We may choose a different kvalue to make the distribution more appealing. NOTE: Classes should be mutually exclusive and collectively exhaustive. This would ensure that each data value would fit into only one class, and every value would belong to a class. Also, we should try to have at least 5 and not more than 15 classes. Thus, we will try to satisfy the inequality 5 k 15. We should avoid, if possible, open-ended classes. Second, determine the class interval or class width. Two guidelines that may be used to determine the class interval, i, are l arg est data value smallest data value l arg est data value smallest data value (1) i (2) i number of classes 1 3.322(log n) Suppose the smallest and largest values of the 50 values from above are 12 and 88, respectively. 88 12 88 12 12.666 and by (2) i 11.439 By (1) i 6 1 3.322(log 50) Again, we have some freedom to choose the class width (interval) to be a whole number if we wish. Depending on our choice for i, we may have to change the number of classes from 6. NOTE: The class intervals should be equal. MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 2 of 8 We will set up our classes so that the lower limit (left value) of the class is included in that class, and the upper limit (right value) of the class is not included in that class. Returning to the 50 data values ranging from 12 to 88, let us set up the classes. If we choose i = 12 and start the first class with a lower limit of 12, we would need 7 classes in order to include the largest value of 88. If we choose i = 15 and start with 10 as the lower limit of the first class, we would need only 6 classes to include the value of 88. NOTE: Some people recommend that the lower limit of the first class be a whole number multiple of the smallest data value. However, this is not essential, and we will use that only when it is convenient. Based on the information presented above, we may choose either of the class setups below. Table A Classes: k=7, i=12 12-24 24-36 36-48 48-60 60-72 72-84 84-96 Table B Classes: k=6, i=15 10-25 25-40 40-55 55-70 70-85 85-100 Once we set up the classes, we count and record the number of values in each class. In Table A, we record, in the frequency column, the number of values so that 12 value < 24, 24 value < 36, etc. 57 43 88 20 78 73 46 41 73 72 Table C. 50 Data Values 25 12 21 70 25 78 22 26 23 87 79 17 13 16 69 24 73 75 48 42 19 42 81 54 16 40 70 37 64 17 74 61 24 39 81 19 64 20 85 46 Using the guidelines, Table A, and the data in Table C above, we get the frequency distribution in Table D below. Table D. Frequency Distribution Classes Frequency 12-24 13 24-36 5 36-48 9 48-60 3 60-72 6 72-84 11 MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc 84-96 Sum of freq. = n = Page 3 of 8 3 50 The relative frequency distribution is constructed from the frequency distribution by dividing each frequency by the sum of the frequencies. For example 13/50 = 0.26, 5/50 = 0.10, etc. Table E below is the relative frequency distribution constructed from Table D. Table E. Relative Frequency Distribution Classes Frequency Relative Frequency 12-24 13 0.26 24-36 5 0.10 36-48 9 0.18 48-60 3 0.06 60-72 6 0.12 72-84 11 0.22 84-96 3 0.06 Total = 50 1.00 From Table E, we see that about 26% and 22% of the data values are in the intervals [12,24) and [72,84), respectively. In addition to the relative frequency distribution, we will discuss the less than cumulative frequency distribution (LCF). The LCF (Table F) shows the accumulated frequency that is less than the upper limit value in the respective class. Table F. Less Than Cumulative Frequency Distribution Classes Frequency Less than Cumulative Frequency (<cf) 12-24 13 13 24-36 5 18 36-48 9 27 48-60 3 30 60-72 6 36 72-84 11 47 84-96 3 50 Total = 50 --We see that 13 values are smaller than 24. The 13 in the first class plus 5 in the second class give 18 values less than 36. [The rest of the values in the column <cf are obtained thusly 18 + 9 = 27, 27 + 3 = 30, 30 + 6 = 36, 36 + 11 = 47, and 47 + 3 = 50.] The histogram is constructed by using the class limits on the horizontal axis of the frequencies on the vertical axis. The histogram below on the left was constructed using Statdisk; on the right by using Excel. MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 4 of 8 Fregrency Histogram 15 10 5 0 1224 2436 3648 4860 6072 7284 8496 Classes The stem-and-leaf plot is a good representation for raw data. All values are shown in a concise form as shown by the Minitab output of the following data: 42, 45, 51, 61, 69, 76, 78, 78, 72, 62, 51, and 44. Table H. Current worksheet: Cities.mtw Character Stem-and-Leaf Display Stem-and-leaf of Atlanta Leaf Unit = 1.0 3 5 (3) 4 4 5 6 7 245 11 129 2688 N = 12 The stem-and-leaf indicates that there are 12 data values, and each leaf represents 1 unit. As we read the first row, we see there are 3 values in the 40’s. These are 42, 44, and 45. There are 2 (5-3=2) values in the 50’s: 51 and 51. There are (3) values in the 60’s: 61, 62, and 69. Finally, there are 4 values in the 70’s: 72, 76, 78, and 78. The first column accumulates from the top down until we reach (3) [Don’t be concerned about Interval is 10. Stem the meaning of this value.] Then increases by 10: 40, 50, the accumulation starts at the 60, 70 bottom and works upward. Adding the (3), the number above it, and the number below it gives us the n = 12. [5 + 3 + 4 = 12] Measures of Center and Variation We will first discuss the population mean and the sample mean. When talking about the population and a sample, we refer to a parameter and a statistic, respectively. Notations for the population and sample means are (mu) and X (X-bar), respectively. Notice in the formulas that N (upper case) represents the total number of observations in the population, and n (lower case) represents the total number of observations in the sample. MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 5 of 8 Arithmetic Mean for Population and Sample Type of Data Population Sample X X Raw X N n f X X f X Grouped f f The arithmetic mean (1) is calculated for interval-level and ratio-level data, (2) includes all data values, (3) is unique for a set of data, (4) is useful in comparing two or more groups of data, and (5) is affected by extremely large or extremely small values. The median is a measure of center that requires little or no calculation for raw data. To find the median for raw data, we use the following procedure. (1) Order the data from smallest value to largest value or vice-versa. (2) If the number of data values is odd, choose the value in the middle so that the same number of values are to the left as are to the right of the middle values. (3) If the number of data values is even, choose the two values in the middle so that the same number of values are to the left as are to the right of the two middle values. (4) Calculate the average of those two values. To find the median for grouped data, we use the following procedure. (1) In the frequency distribution, form the less than cumulative frequency (<CF) column. (2) Find one-half the sum of the frequencies, n/2. (3) Find the largest number in the <CF column that is not larger than n/2. (4) Circle the row (Class, frequency, <CF) below the number in Step 3. This row contains the median. (5) Subtract the number found in Step 3 (CF) from n/2, divide by the frequency (f) circle in Step 4, and multiply by the class interval (i). (6) Add the answer from Step 5 to the lower class limit circled in Step 4. This represents the median for the grouped data. The following formula summarizes the six-step procedure given above. n CF Median for grouped data Median L 2 (i) f The mode is a measure of center that identifies the data value that appears most frequently. There will be no mode if all data values appear the same number of times. There will be more than one mode if two or more data values appear with the same frequency and more frequently than other data value(s). To find the mode for raw data, simply find the value(s) that appear most frequently. To find the mode for grouped data, find the midpoint(s) of the class(es) that has (have) the largest frequencies. The class containing the mode is called the modal class. The mid-range is midway between the largest value and the smallest value of the data. l arg est smallest midrange 2 MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 6 of 8 The weighted mean may be calculated by using the following three-step procedure: (1) multiply each value by a weight for that value, (2) sum those products, and (3) divide that sum by the sum of the weights. The following formula expresses the above procedure: wX w w1 X 1 w2 X 2 wn X n , w1 w2 wn where w represents the weight and X represents the data value. Weighted Mean X Skewness tells us something about the shape of a frequency distribution. A symmetric distribution is one whose graph is symmetric with respect to a vertical line that passes through the mean, median, and mode. If a distribution is skewed to the right, the graph is elongated (or stretched) to the right side. If a distribution is skewed to the left, the graph is elongated (or stretched) to the left side. Remember that extremely large values will pull the mean to the right; thus, skewing the graph (distribution) to the right. Similarly, extremely small values will pull the mean to the left; thus, skewing the graph (distribution) to the left. To calculate the coefficient of skewness by hand, we use Pearson’s index (coefficient) of skewness formula: 3(mean median) I sk s We will discuss variation (dispersion) for two reasons. First, variation (dispersion) can be used to indicate the presence or absence of reliability. Second, variation (dispersion) can be used to compare the spread of two or more distributions. One measure of variation (dispersion) is the range. The range is the difference between the largest and smallest data values. The calculation of the range is the simplest of the measures of variation (dispersion). A disadvantage of using the range is that it involves only two of the data values. Range Range L arg est Value Smallest Value (D1) We calculate the variance of data so that we can find the standard deviation. For population data, the variance is the arithmetic mean of the squared deviations from the mean. For sample data, divide the sum of the squared deviations by n-1. We may use the following procedure to calculate the variance for ungrouped data. (1) Calculate the arithmetic mean. (2) Find the difference between each data value and the mean. (3) Square each of the differences found in Step 2. (4) Sum the squares from Step 3. (5) If the data is from a population, divide the sum in Step 4 by N, the total number of data values. (6) If the data is from a sample, divide the sum in Step 4 by n-1, where n is the total number of data values. The above steps are summarized in the two formulas below. In the sample calculation, the denominator of n-1 is used instead of n to help correct for the error created by the smaller number of data values in the sample compared to the population. The table below shows the Conceptual Formulas and Calculation Formulas (for raw or ungrouped data) used to find the variance of data. The standard deviation can be used to compare the dispersion of two or more populations or samples. Also, if the data values are MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 7 of 8 measured in the same units and the means are close together, a small standard deviation may be used indicate that the mean a reliable measure of central tendency. For population data, the standard deviation is the square root of the population variance. For sample data, the standard deviation is the square root of the sample variance. We may use the following procedure to calculate the standard deviation for ungrouped data. (1) Calculate the variance. (2) Find the square root of the variance from Step 1. The above steps are summarized in the formulas below. Conceptual Formulas to Calculate the Variance of Raw Data Population Sample 2 ( X ) ( X X )2 2 2 (D3) (D4) s N n 1 Calculation 2 X N X2 2 N2 2 (D5) s n X 2 X 2 n(n 1) (D6) Formulas to Calculate the Variance and Standard Deviation of Grouped Data Population Sample Variance 2 N f X 2 f X 2 s2 N2 (D7) Standard Deviation (D9) n f X 2 f X 2 n(n 1) (D8) N f X 2 N2 f X 2 s n f X 2 f X 2 n(n 1) (D10) For grouped data, the range is the difference between the upper limit of the largest class (interval) and lower limit of the smallest class (interval). Range Range Upper Limit of L arg est Interval Lower Limit of Smallest Interval (D1G) Relative Dispersion. If the units of measure are different or the means are not close together, the standard deviation cannot be used to compare dispersions of data sets. Therefore, we use the coefficient of variation that measures the dispersion relative to the mean by dividing the standard deviation by the mean and multiplying by 100 to form a percent. The coefficient of variation is calculated by using the following formula: s (D12) CV (100%) X Empirical Rule. The Empirical Rule applies only to distributions that are symmetrical and bell-shaped. For such distributions, the Empirical Rule states that about 68% of the data values are within plus or minus one standard deviation of the mean; about 95% within plus and minus two standard deviations of the mean; and about 99.7% within plus and minus three standard deviations of the mean. MAT 155: Describing, Exploring, and Comparing Data 0201-NotesCh2-3.doc Page 8 of 8 Chebyshev’s Theorem allows us to determine the minimum proportion of data values within a specific number (larger than one) of standard deviations of the mean for any set of data values. This minimum proportion is calculated by using the formula 1 1 2 (D11) k where k >1 is the number of standard deviations either side of the mean. Using this formula, we see that at least 75% of the data values are between two standard deviations below the mean and two standard deviations above the mean. Similarly, there would be at least 88.9% within three standard deviations of the mean. There would be at least 55.6% within 1.5 standard deviations of the mean. 1 4 5 1.5 yields 1 1.5 2 1 9 9 0.556 55.6% Z-score, standard score, is the number of standard deviations x is from the mean. x x x z-score for sample: z z-score for population: z s Quartiles, Deciles, Percentiles. Earlier we discussed measures of center. One of those measures was the median. We found the median to be the middle value of ungrouped data, and we used a formula to find the median for grouped data. Now we will calculate quartiles, deciles, and percentiles as measures of dispersion. For ungrouped data, the following formula may be used to find the location, L, of a percentile, k: k L (D13) n 100 If L is whole number, Pk is midway between Lth value and (L+1)st value of the sorted data. If L is not a whole number, Pk is the next value above the Lth position. To find the location of the first quartile, simply find the location of the 25th percentile; to find the location of the second decile, simply find the location of the 20th percentile. Percentile of value x = number of values less than x 100 total number of values Box Plots. A box plot is a graphical display of five values: smallest and largest data values, the median, and the first and third quartiles. To draw a box plot, (1) identify the smallest and largest data values, (2) calculate the first, second, and third quartiles, (3) draw a rectangle with the first quartile at the left end, the third quartile at the right end, and the second quartile (median) as a vertical line segment in the rectangle, (4) draw line segments from the left end to the smallest value and from the right end to the largest value. As an example, consider the following: smallest value is 50, first quartile is 70, second quartile (median) is 90, third quartile is 115, and the largest value is 150. The box plot representing these values is shown below. 50 70 90 115 150 (Copyrighted by Claude S. Moore 2004-2008)