Chapter 2: Methods for Describing Sets of Data (Page 19-98) Homework:14ab, 36, 43, 45, 51, 56, 64abc, 71, 79, 85, 89, 96 1 Section 2.1: Numerical Measures of Central Tendency (center): • Why we are interested in the central tendency of a set of measurements? The central tendency of a set of measurements is the tendency of the data to cluster (or center) about certain numerical values. Since it is very important to both descriptive and inferential statistics, there are many numerical measures such as mean, median, and mode available to estimate the central tendency of a set of measurements. One can not say which one is the best measure for the central tendency of a set of data because data have very different characteristic. 2 The most popular measure for the central tendency is the mean (or the arithmetic mean). We use the Greek letter µ to stand for the population mean and use the to standx for the sample mean. The mode is a useful numerical measure of the central tendency if one wants to know the measurement that occurs most frequently in the data set. The median is a good measure for the central tendency if there are several extremely large (or extremely small) measurements in the data. • Which one is the best numerical measure for the central tendency of a set of data? 3 • Example 2.1 (Basic): The following data give the weekly expenditures (in dollars) on nonalcoholic beverages for 45 households randomly selected from the 1996 Diary Survey. 6.5 10.9 12.3 9.0 10.4 8.2 9.2 5.4 4.7 5.6 8.0 16.5 15.1 0.9 9.8 0.7 3.3 4.9 7.2 12.7 1.3 4.6 5.4 2.5 9.0 0.9 13.5 10.1 10.3 3.1 2.2 1.6 12.7 2.2 10.6 10.5 2.4 7.1 1.4 10.1 15.9 7.1 1.3 4.6 2.7 Use part of the SAS output in next 3 tables to find the sample size, mean, median, and mode for weekly expenditures. 4 Results for Example 2.1 Variable=EXPENSE Moments N Mean Std Dev Skewness USS CV T:Mean=0 Num ^= 0 M(Sign) Sign Rank 45 6.986667 4.468811 0.31744 3075.3 63.96199 10.4878 45 22.5 517.5 Sum Wgts Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| 5 45 314.4 19.97027 -0.88551 878.692 0.666171 0.0001 45 0.0001 0.0001 Quantiles(Def=5) 100% Max 75% Q3 50% Med 25% Q1 Range Q3-Q1 Mode 16.5 10.3 7.1 2.7 15.8 7.6 0.9 99% 95% 90% 10% 6 16.5 15.1 12.7 1.3 Extremes Lowest 0.7( 0.9( 0.9( 1.3( 1.3( Obs 27) 34) 14) 39) 20) 7 Highest 12.7( 13.5( 15.1( 15.9( 16.5( Obs 45) 22) 26) 24) 41) Example 2.2 (Intermediate): Michelson conducted an experiment to determine the velocity of the light between 1879 and 1882. Table 2.1 presents Michelson's determinations minus 299000 in Km/sec. Table 2.1 Velocity of the Light 870 890 850 1000 960 830 880 880 890 910 870 840 740 980 940 790 880 910 810 920 810 780 900 930 960 810 880 850 810 890 740 810 1070 650 940 880 860 870 820 860 810 760 930 760 880 880 720 840 800 880 940 810 850 810 800 830 720 840 770 720 950 790 950 1000 850 800 620 850 760 840 800 810 980 1000 860 790 860 840 740 850 810 820 980 960 900 760 970 840 750 850 870 850 880 960 840 800 950 840 760 780 8 Result From Example 2.2 Variable=SPEED N Mean Std Dev Skewness USS CV T:Mean=0 Num ^= 0 M(Sign) Sgn Rank 100 852.2 78.96528 -0.01125 73241800 9.26605 107.9209 100 50 2525 Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| 9 85220 6235.515 0.347244 617316 7.896528 0.0001 100 0.0001 0.0001 Quantiles(Def=5) 100% Max 1070 75% Q3 895 50% Med 850 25% Q1 805 0% Min 620 Range Q3-Q1 Mode 99% 95% 90% 10% 5% 1% 450 90 810 10 1035 980 960 760 730 635 Extremes Lowest 620( 650( 720( 720( 720( Obs 67) 34) 60) 57) 47) 11 Highest 980( 1000( 1000( 1000( 1070( Obs 83) 4) 64) 74) 33) • The data set is skew to the right if there are several extremely large measurements (see Figure 2.2). In this case the mean is greater than the median and the extremely large values have a stronger impact on the mean. • The data set is skew to the left if there are several extremely small measurements (see Figure 2.3). In this case the mean is small than the median and the extremely small values pose stronger impact on the mean as well. • The data sets are well behaved if they are symmetric (see Figure 2.1). Symmetrical data sets pose several good properties that will be discussed in later chapters. 12 Figure 2.1 Symmetric Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Mean, Median, and Mode Overlap 13 Figure 2.2 SKEW TO THE RIGHT 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 Mean > Median 14 FIGURE 2.3 SKEW TO THE LEFT 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 Mean < Median 15 Section 2.2: Numerical Measures of Variability • Why we are interested in numerical measures for the variability of a set of measurements? The variability of a set of measurements is the "spread" of the data. Measure of variabiltiy is as important as the measure of central tendency. There are many significant different data sets, which can have the same mean, median, and mode. We introduce three numerical measurements: range, variance, and standard deviatiation to estimate the variability. 16 • Why sometimes the range is not a good numerical measure for the variability of a set of data? The variability of two sets of data can be very different even if they have a similar range because the range only depends on the largest and smallest measurements and one extremely large measurement (or one extremely small measurement) can alter the range significantly. 17 We use the symbols s and s2 to stand for the samlpe standard deviation and the sample variance, respectively, and the Greek symbols s and s2 to stand for the population standard deviation and the population variance, respectively. Both standard deviation and variance are good measures for the variability of a set of measurements. 18 • Is there any set of measurements that can be completely explained by the sample mean and the sample standard deviation? Yes. A set of measurements can be explained completely by the sample mean and the sample standard deviation of the relative frequency distribution if the data is similar to Figure 2.1. 19 Example 2.3 (Basic): Find the variance, the standard deviation and the range from SAS output in Example 2.1. 20 Example 2.4 (Intermediate): a) Find the variance, the standard deviation and the range from SAS output in Example 2.2. b) Find the variance, the standard deviation, and the range without three extreme values. c) Which measure is most affected by the deletion of extreme values? d) Comparing the mean, the median, and the mode before and after the deletion of outliers. 21 Result From Example 2.4 (Without Extreme values) Variable=SPEED N 97 Mean 854.433 Sum 82880 Std Dev 70.31135 Variance 4943.686 Skewness 0.206141 Kurtosis -0.57312 USS 71290000 CSS 474593.8 CV 8.229007 Std Mean 7.139036 T:Mean=0 119.6847 Pr>|T| 0.0001 Num ^= 0 97 Num > 0 97 M(Sign) 48.5 Pr>=|M| 0.0001 Sgn Rank 2376.5 Pr>=|S| 0.0001 22 Quantiles(Def=5) 100% 75% 50% 25% 0% Max Q3 Med Q1 Min Range Q3-Q1 Mode 1000 890 850 810 720 99% 95% 90% 10% 5% 1% 280 80 810 23 1000 980 960 760 740 720 Section 2.3: Interpreting the Standard Deviation Standard deviation provides a measurement of variability of a sample. The sample with larger sample standard deviation has higher variability. The standard deviation also provides information to answer question such as "How many measurements are within 2 standard deviations of the mean?" for any specific data set. We need to understand the following two rules in order to answer the above question. 24 Chebyshev's Rule: 2 1 1 / k For any set of measurements, at least of the measurements will fall within k standard deviations of the mean for any number of k greater than 1 (a) At least 3/4 of the measurements will fall within the interval x 2s , x 2s for a sample and 2s , 2s for a population. (b) At least 8/9 of the measurements will fall within the interval x 3s , x 3s for a sample and 3s , 3s for a population. 25 The Empirical Rule: The empirical rule is a rule of thumb that applies only to samples or populations with frequency distributions that are mound-shaped, i.e. the frequency distributions are similar to a bell (a) Approximately 68% of the measurements will fall within the interval x s , x s for a sample and s , s for a population. (b) Approximately 95% of the measurements will fall within the interval x 2s , x 2s for a sample and 2s , 2s for a population. (c) Approximately 99.7% of the measurements will fall within the interval x 3s , x 3s for a sample and 3s , 3s for a population. 26 Example 2.5 (Basic): For any set of data, what can be said about the percentage of measurements contained in each of the following intervals. (a) 2s to 2s. (b) 3s to 3s (c) 4s to 4s. 27 Example 2.6 (Intermediate): The mean and standard deviation of a group of one hundred NBA players are 70.25 inches and 3.25 inches, respectively. (a) How many players in this group are taller than 76.75 inches based upon the Empirical Rule? (b) Can we answer part (a) based on the Chebyshev's rule? (c) What assumption is required in order to apply the Empirical Rule? 28 Section 2.4: Numerical Measures of Relative Standing • Can you say that you did poorly in one exam if you got 70 points? You might do poorly or you might do a fair job in this exam. You can get the top score if all other students got less than 60 points in this extremely difficult exam. Your performance should be judged by the relative standing instead of the numerical score. Descriptive measures of the relationship of a measurement to the rest of the date are called measures of relative standing. 29 Example 2.7 (Basic): Base on the SAS output for Example 2.1 to find the following percentiles: (a) 10th percentile (b) 25th percentile (c) 50th percentile (d) 55th percentile (e) 90th percentile Note: 1. Median is the 50th percentile of a quantitative data set. 2.Upper quartile is the 75th percentile and lower quartile is the 25th percentile of a quantitative data set. 30 • Quantile: Let q be any number between 0 and 1, the qth quantile denoted by Q(q) is a number such that a fraction of q of the measurements fall below and a fraction of (1-q) of the measurements fall above this number. 31 • Sample Z Score: Suppose x is a measurement from a sample with mean x and standard deviation s. The sample Z score of x is x -x Z= . s • Population Z Score: Suppose x is a measurement from a population with mean and standard deviation s. The population Z score of x is Z x s . 32 Example 2.8: The following data give the yearly contributions (in dollars) to a local church by 35 households randomly selected from the 1996 Interview Survey. 30 50 27 25 100 300 100 75 200 76 25 15 60 240 100 130 15 200 18 10 25 50 125 200 400 500 300 34 87 24 25 140 275 250 150 (a) Find the mean and median of this set of data? (b) Find the standard deviation and range? (c) Compute the Z score for 200. (d) How many measurements are fall within two standard deviations of the mean? 33 Univariate Procedure Variable=DOLLARS N 35 Mean 125.1714 Std Dev 120.8157 Skewness 1.374005 USS 1044655 CV 96.52021 T:Mean=0 6.129369 Num ^= 0 35 M(Sign) 17.5 Sgn Rank 315 34 Sum Wgts Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| 35 4381 14596.44 1.620988 496279 20.42159 0.0001 35 0.0001 0.0001 Quantiles(Def=5) 100% 75% 50% 25% 0% Max Q3 Med Q1 Min Range Q3-Q1 Mode 500 200 87 25 10 99% 95% 90% 10% 5% 1% 490 175 25 35 500 400 300 18 15 10 Extremes Lowest 10( 15( 15( 18( 24( Obs 20) 17) 12) 19) 30) Highest 275( 300( 300( 400( 500( 36 Obs 33) 6) 27) 25) 26) Section 2.5: Graphic Methods for Describing Data (Bar Chart, Pie Chart, and Histogram) • Why we need to use graphic methods to describe data. Mean and standard deviation alone can not characterize the wide variety of distributions that data can have. We can easily find examples that several significantly different data sets have same mean and standard deviation. • Can we find several different data sets with same mean and standard deviation? Three data sets in Figure 2.4 all have same mean, median, standard deviation, and variance. However, they are very different. 37 Figure 2.4 C •• • • • • • • • • • • • • • • • B •• • • • • • • • • • • • • •• • • • • • • • • •• • • • • • • • • • • • • • A • • • • • • • • • •• • • • • • 82 87 92 97 • • • • • •• • • • • • • • • • 102 38 107 112 117 122 We will not cover bar-charts, pie-charts, or histograms in this semester. Firstly, bar-charts and pie-charts pose several perception problems as indicated by the famous book entitled "The Elements of Graphing Data" (William S. Cleveland, 1995). Secondly, we focus on discussing quantitative data in this semester but both pie-charts and bar-charts are graphical tools for qualitative data. Thirdly, there is more information encoded in a well designed stemleaf display than a histogram. • Box-plots, and stem-leaf displays are the graphical methods discussed in this course. 39 Section 2.6: Stem-and-Leaf Display Figure 2.5 shows a stem-and-leaf display of the ozone data (Tukey 1977). It is a hybrid between a data table and a histogram since it shows numerical values as numerals but its profile is very much like a histogram (see Figure 2.6). One can follow the following steps to construct a stem-and-leaf display by hand. 1. Define the stem and leaf to be used. 2. Write the stems in a column arranged from the smallest stem at the top(bottom) to the largest stem at the bottom (top). 40 3. If the leaves consist of more than one digit, drop the digits after the first digit. 4. Record the leaf for each measurement in the row corresponding to its stem. 5. Find the median and highlight the leaf corresponding to the median. 6. Count the number of leaves in the row with the median and put the count in the depth column. 7. Count the number of leaves for each row from the top row to the median row and put the cumulative counts in the depth column. 8. Count the number of leaves for each row from the bottom row to the median row and put the cummulative counts in the depth column. 41 Figure 2.5 Stem-and-Leaf Depth 3 5 8 12 16 23 30 36 43 59 (11) 55 42 35 26 12 2 Stem Leaf 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 034 99 025 1236 1346 2244455 1334899 013338 1244899 0000002235667779 11111122355 0114444668889 1222259 023677779 11223788888888 3444467888 44 42 Stem 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Depth 3 5 8 12 16 23 30 36 43 59 (11) 55 42 35 26 12 2 034 99 025 1236 1346 2244455 1334899 013338 1244899 0000002235667779 11111122355 0114444668889 1222259 023677779 11223788888888 3444467888 44 Leaf Figure 2.6 Stem-and-Leaf Display with 90 Degree Rotation 43 Univariate Procedure Variable=OZONE N 125 Mean 79.288 Std Dev 39.90954 Skewness 0.510449 USS 983327 CV 50.3349 T:Mean=0 22.2119 Num ^= 0 125 M(Sign) 62.5 Sgn Rank 3937.5 Sum Wgts Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| 44 125 9911 1592.771 -0.49653 197503.6 3.569618 0.0001 125 0.0001 0.0001 Quantiles(Def=5) 100% 75% 50% 25% 0% Max Q3 Med Q1 Min Range Q3-Q1 Mode 174 103 72 47 14 99% 95% 90% 10% 5% 1% 160 56 38 45 173 152 136 31 24 14 Advantages of stem-and-leaf display: • Both the numerical values and the graphical shape can be seen on a stem-and-leaf display. • It is very easy to locate an individual measurement on a stem-and-leaf display. • You can sort a relative small data set by hand using stem-and-leaf display. • You can get the following information such as median, mode, range, maximum, minimum, upper quartile, lower quartile, and inner quartile range on a stem-and-leaf display. 46 • We can determine the symmetry information of a set of measurements from the stem-and-leaf display. A set of measurements is symmetric if its relative frequency distribution looks similar to Figure 2.1. The relative frequency distribution of Ozone data can be seen from the rotated stem-and-leaf display (Figure 2.6). Ozone data is skewed to the right because there are more observations with small values than observations with large values. 47 Example 2.9: the following table contains 48 measurements of the weight of a group of male students in STA 3023 last year. Table 2.1 123 128 130 135 140 142 145 151 155 155 155 156 156 156 160 160 163 165 165 170 170 170 170 173 174 175 175 180 182 185 185 185 185 186 190 190 191 195 195 198 200 205 206 208 215 220 220 230 a) Construct a stem-and-leaf display for data in Table 2.1. b) Is the data symmetric? c) Find the mean, the median, the range, the standard deviation, the lower quartile, and the upper quartile from SAS output 48 Depth Stem Leaves 2 4 7 14 19 (8) 21 14 8 4 3 1 120 130 140 150 160 170 180 190 200 210 220 230 49 3,8 0,5 0,2,5 1,5,5,5,6,6,6 0,0,3,5,9 0,0,0,0,3,4,5,5 0,2,5,5,5,5,6 0,0,1,5,5,8 0,5,6,8 5 0,0 0 120 130 140 150 160 170 180 190 200 210 220 230 2 4 7 14 19 (8) 21 14 8 4 3 1 3,8 0,5 0,2,5 1,5,5,5,6,6,6 0,0,3,5,9 0,0,0,0,3,4,5,5 0,2,5,5,5,5,6 0,0,1,5,5,8 0,5,6,8 5 0,0 0 Stem Leaves Depth Figure 2.7 Stem-and-Leaf Display with 90 Degree Rotation 50 SAS Output for Example 2.9 Variable=WEIGHT N 48 Sum Wgts Mean 174.3333 Sum Std Dev 25.41932 Variance Skewness 0.070001 Kurtosis USS 1489190 CSS CV 14.58087 Std Mean T:Mean=0 47.5157 Pr>|T| Num ^= 0 48 Num > 0 M(Sign) 24 Pr>=|M| Sgn Rank 588 Pr>=|S| 51 48 8368 646.1418 -0.43366 30368.67 3.668963 0.0001 48 0.0001 0.0001 Quantiles(Def=5) 100% 75% 50% 25% 0% Max Q3 Med Q1 Min Range Q3-Q1 Mode 230 190.5 173.5 156 123 99% 95% 90% 10% 5% 1% 107 34.5 170 52 230 220 208 140 130 123 Section 2.7: Box Plots • Inner Quartile Range (IQR): The upper quartile minus the lower quartile. • • • • • Step: 1.5*IQR Upper Inner Fence: Upper quartile plus one step. Lower Inner Fence: Lower quartile minus one step. Upper Outer Fence: Upper quartile plus two steps. Lower Outer Fence: Lower quartile minus two steps. • Outside Value: Any measurements that are greater than the upper inner fence or less than the lower inner fence. 53 Elements of a Box Plot: • A rectangle is drawn with the ends drawn at the lower and upper quartiles. The median of the data is shown in the box, usually by a line through the box. • The points at distances 1.5*IQR from each hinge mark the inner fences of the data set. Horizontal lines are drawn from each hinge to the most extreme measurement inside the inner fence. • A second pair of fences, the outer fences, exist at a distance of 3 *IQR from the hinges. One symbol (usually "*" in SAS) is use to represent measurements falling between the inner and outer fences. Another symbol (usually "0" in SAS) is use to represent measurements beyond the outer fence. 54 Interpretation of Box Plots • The median shows the central tendency of the data. • The length of the box (IQR) provides a measure of the variability of the middle 50% of the data. • The individual outside values give the viewer an opportunity to the presence of outliers, that is, observations that seem unsually, or even implausibly, large or small. Outside values are not necessarily outliers, but any outliers will almost certain appear as an outlier. • The box plot allows a partial assessment of symmetry. The box plot is symmetric about it median if the data is symmetric. If one whisker is clearly longer, the data is probably skewed to the direction of the longer whisker. 55 Example 2.10: Base on the box plot for data in Example 2.1 to answer the following: (a) Is the data symmetric? (b) Is there any outside value? (c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. 56 5 10 15 Figure 2.8 Box Plot for Data in Example 2.1 Weekly Expenditure (in Dollar) 57 Example 2.11: Base on the box plot for data in Example 2.2 to answer the following: a. Is the data symmetric? b. Is there any outside value? c. Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. d. Compute the inner quartile range and step. 58 700 800 900 1000 Figure 2.9 Velocity of the Light Speed of the Light 59 Example 2.12: Base on the box plot for data in Example 2.8 to answer the following: (a) Is the data symmetric? (b) Is there any outside value? (c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. (d) Compute the inner quartile range and step. 60 0 100 200 300 400 500 Figure 2.10 Box Plot for Data in Example 2.8 Yearly Contributions 61 Quick Review: • • • • • • • • Mean, Median, and Mode Range, Standard Deviation, and Variance Upper Quartile, Lower Quartile, and IQR Chebyshev's Rule and Empirical Rule Z-Score Symmetry and Skewness Mound-Shaped distribution Box-Plot and Stem-and-Leaf Display 62