Chapter 3 – Numerically Summarizing Data After we have become somewhat familiar with the data through representing it graphically and observing the characteristics of the distribution, we want to describe the characteristics with numerical values called descriptive statistics. Recall from Chapter 1: Defn: A parameter is a numerical characteristic of a population. Defn: A statistic is a numerical characteristic of a sample. (Remember, a sample is a subset of a population.) We want to use the value of a statistic found from the sample data to gain knowledge about the value of the corresponding parameter, which we would be able to get directly if we had access to the entire population. Measures of Central Tendency give us information about the location of the center (in some sense) of the distribution of (numeric) data values. We will discuss four measures of central tendency: mean, median, mode, and the midrange. Defn: If we have a set of n sample data values, x1, x2, … , xn, the mean of these data values is their arithmetic average: 1 1 n x x1 x 2 x n xi n n i 1 . If we have a set of N population data values, the mean of these values is: Note: 1 x1 x2 x N 1 N N x N x i 1 i . is a statistic; is a parameter. Example: p. 123, Example 6 1) Go to STAT, 1:Edit. 2) Enter the data, with a suitable variable name, such as BP. 3) Choose STAT, CALC, 1:1-Var Stats. 4) Enter the variable name, and press ENTER. 5) You will see a list of numerical values for the data, including 50 and x i 1 i 374.4 x 7.488 , . The average, or mean, birthweight for the babies was found to be 7.448 pounds. The total weight for the babies is 374.4 pounds. Properties of the Mean: 1) One computes the mean by using all of the values of the data. 2) The mean varies less than the other two measures of central tendency when samples are taken from the same population and all three measures are computed for these samples. 3) The mean is used in computing other statistics, such as the variance. 4) The mean for the data set is unique, and not necessarily one of the data values. 5) The mean is affected by extremely high or low values and may not be the appropriate measure to use in these situations. Example: Suppose that we had made a mistake in entering the data for the first baby, entering 0.8, rather than 5.8. The computed value of the mean would be than the value computed from the correct data. x 7.388 pounds, somewhat lower Sometimes, the correct raw data has extreme values. In these situations, the mean may not be the best measure of central tendency to use. In such cases, we might prefer to use the median. ~x such that at least 50% of the data ~ x and at least 50% of the data values lie above ~x . Defn: The median is the midpoint of the data set; it is a value values lie below Example: p. 123, Example 6 1) Go to STAT, 1:Edit. 2) Enter the data, with a suitable variable name, such as BP. 3) Choose STAT, CALC, 1:1-Var Stats. 4) Enter the variable name, and press ENTER. ~ 5) You will see a list of numerical values for the data, including x Med 7.35 . The median value of the birthweights for sample of babies was 7.35 pounds. Properties of the Median: 1) The median is used when one must find the center or middle value of a data set. 2) The median is used when one must determine whether the data values fall into the upper half or the lower half of the distribution. 3) The median is used to find the average of an open-ended distribution. 4) The median is affected less than the mean by extremely high or extremely low values. Example: Let’s return to the above example, and assume that the last data value was incorrectly entered as 0.8, rather than 5.8. The median value is found to be 7.35 pounds, since the single incorrect value has little effect on the calculation of the median. Sometimes, the median is a more appropriate measure of central tendency than the mean for a data set. Example: The U.S. Department of Commerce Bureau of Labor Statistics gives information about the distribution of personal incomes in the U.S. This distribution, of course, has extreme values. Hence the Bureau uses the median income, rather than the mean, as the appropriate measure of central tendency. In some situations, the most appropriate measure of central tendency is the mode of the distribution. Defn: The data value that occurs most often in a data set is called the mode. Note: Some data sets do not have a mode. For example, the data set consisting of the values 1, 1, 2, 2, 3, 3, 4, 4 does not have a single most frequently occurring value, and hence does not have a mode. For this data, the mean or median would be the most appropriate measure of central tendency. Examples: p. 124 Examples 7 and 8. The calculator will provide only a little help here, in sorting the data. 1) Rearrange the data so that the values are listed in increasing order. 2) Find the value that occurs most frequently, if such a value exists. For this data set, the mode is 0. Properties of the Mode: 1) The mode is used when the most typical case is desired. 2) The mode is the easiest average to compute. 3) The mode can be used when the data are categorical, such as religious preference, gender, or political affiliation. 4) The mode is not always unique. A data set can have more than one mode, in which case we say that it does not have a mode. Distribution Shapes: (See p. 122) 1) In a positively skewed distribution, the following relationship holds among the measures of ~ central tendency: Mode x x 2) In a negatively skewed distribution, the following relationship holds among the measures of ~ central tendency: x x Mode 3) In a symmetrical distribution, the data values are evenly distributed on both sides of the mean. In this situation, the following relationship holds among the measures of central tendency: x~ x Mode Measures of Variability: In addition to locating the center (in some sense) of a data distribution, we also want to know how spread out the data values are. We will talk about three measures of variability: Range, Variance, and Standard Deviation. Defn: The range of a data set is the difference between the largest and smallest data values: Range = Xmax – Xmin. Example: p. 123, Example 6 (baby birthweights) Range = 9.4 – 5.8 = 3.6 pounds The range is not the most useful measure of the variability of the data, however, since it ignores much of the information about variability. The following two data distributions have the same range, but we would not say that they have the same variability: Data set 1: 10, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 90 Data set 2: 10, 10, 10, 10, 50, 50, 50, 50, 90, 90, 90, 90 If we construct histograms for each of these data sets, we see that the first set of data values is more concentrated at the center. We need another measure of variability that will allow us to distinguish between these two situations. This measure of variability should include information about the location of each item in the data set relative to the center of the data distribution. Defn: For an observation xi, define the corresponding deviation from the mean to be ei xi x . Can we use the sum of all of the deviation scores for the data as our measure of variability? No. Why not? n For any data set x1, x2, …, xn, we have e i 1 i 0 . Why is this so? Defn: For a population of N data values, x1, x2, …, xN , having population mean 1 1 x1 x 2 x N N N 1 N 2 N x i 1 N x i 1 i , The variance of the population data set is . 2 i The standard deviation of the population data set is the square root of the variance. Defn: For a sample of n data values, x1, x2, …, xn, having sample mean 1 1 n x x1 x 2 x n xi n n i 1 s2 , The variance of the sample data set is 1 n x i x 2 . n 1 i 1 The standard deviation of the sample data set is the square root of the variance. Why do we need to define two different additional measures of variability for a data set? (Hint: units of measurement). Why do we divide by n – 1, rather than by n, when computing the sample variance? Defn: An unbiased estimator of a parameter is a statistic, such that the average of the values of the statistic for repeated random samples of the same size tends toward the true value of the parameter. When we divide by n – 1, rather than n, to compute the sample variance, we are creating an unbiased statistic for estimating the population variance. Example: p.123, Example 6. From the 1-Var Stats function of the calculator, we find that the mean is x 7.488 pounds. The standard deviation of the data set is s = 0.8030 pounds. The variance of the data set is then s2 = 0.6448 squared pounds. Now assume that we have committed two data entry errors, replacing 5.8 with 0.8 and replacing 9.4 with 19.4. What is are the values of the variability measures now? We find s = 2.0762 pounds and s2 = 4.3105 pounds. There is much more variability in the data with these two data entry errors. Example: Given the following data set: 5, 5, 5, 5, 5, 5, 5, 5, what is the standard deviation? (Hint: You don’t need to use the calculator to answer this question.) Uses of Variance and Standard Deviation: 1) As previously stated, variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. This information is useful in comparing two or more data sets to determine which is more (most) variable. 2) The measures of variance and standard deviation are used to determine the consistency of a variable. For example, in the manufacture of fittings, like nuts and bolts, the variation in the diameters must be small or parts will not fit together. 3) The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution. 4) Finally, the variance and standard deviation are used quite often in inferential statistics. These uses will be shown later in the course. The Empirical Rule: If a data distribution is bell-shaped (or normal), then the following statements are true: 1) Approximately 68% of the data values lie within one standard deviation on either side of the mean. 2) Approximately 95% of the data values lie within two standard deviations on either side of the mean. 3) Approximately 99.7% of the data values lie within three standard deviations on either side of the mean. Example: p. 128, Exercise 29 Suppose that we know that the distribution of M&M weights is approximately bell-shaped. From the Empirical Rule, we can then say that approximately 68% of the M&M’s in the sample have weights between x s 0.8746 g 0.0356 g 0.8390 g , and x s 0.8746 g 0.0356 g 0.9102 g . We can also say that approximately 95% of the M&M’s in the sample have weights between x 2s 0.8746 g (2*0.0356 g ) 0.8034 g , and x 2s 0.8746 g (2*0.0356 g ) 0.9458 g . Finally, we can say that approximately 99.74% of the M&M’s in the sample have weights between x 3s 0.8746 g (3*0.0356 g ) 0.7678 g , and x 3s 0.8746 g (3*0.0356) 0.9814 g . Example: p. 96, Exercise 42 Measures of Position In addition to summary statistics describing the entire data set, we are often interested in locating particular members of the sample, or particular data values, within the context of the data set as a whole. Defn: A standard score, or z-score, for a data value is obtained by subtracting the mean from the data value and then dividing by the standard deviation. For an observation xi in a sample data set, the zi z-score is xi x s . For a member of a population data set, the zi z-score is xi . With z-scores, we can compare relative locations of scores from two different data distributions. Example: Which of the following exam grades has a better relative position a) A grade of 43 on a test with a mean of 40 and standard deviation of 3 b) A grade of 75 on a test with a mean of 72 and standard deviation of 5. z1 The z-score corresponding to the data value from the first data set is 43 40 1 3 . z2 75 72 0 .6 5 . The z-score corresponding to the data value from the second data set is The first score is higher, relative to its score distribution, than the second score. Percentiles To locate the position of an individual score relative to its own data set, we often use percentiles. Defn: The kth percentile of a data set is the value for which at most k% of the data values are less than that value and at most (100 – k)% of the data values are more than that number. Finding a Data Value Corresponding to the kth Percentile. 1) Arrange the data in order from lowest to highest. 2) Substitute into the formula i kn , where n is the size of the data set, and k is the particular 100 percent. 3) a) If i is not an integer, round up the next higher integer. Starting at the smallest data value in the list, count up to the position corresponding to the rounded value of i. The data value at that position is the kth percentile of the data set. b) If i is an integer, use the value halfway between the ith and (i+1)th data values, counting from the smallest data value. Example: What value in the following data set corresponds to the 60th percentile? 1) 12, 28, 35, 42, 47, 49, 50 2) n = 7, and i (7)(60) 4.2 100 3) 4.2 is not an integer so we round up to 5. The 60th percentile is then the 5th data value, counting from the lowest, or 47. Example: The A. C. Nielsen Company publishes data on the TV-viewing habits of Americans in the Nielsen Report on Television. A sample of 20 people yielded the following data on weekly viewing times: 25, 41, 27, 32, 43, 66, 35, 31, 15, 5, 34, 26, 32, 38, 16, 30, 38, 30, 20, 21. What is the 25th percentile of the data? I.e., what is the value such that at least 25% of the data values are less than that value and at least 75% of the data values are greater than that value? 1) Rearrange the data in ascending order: 5, 15, 16, 20, 21, 25, 26, 27, 30, 30, 31, 32, 32, 34, 35, 38, 38, 41, 43, 66 2) i kn (20)( 25) 5 100 100 3) The position of the 25th percentile is halfway between the 5th and the 6th data observations, 21 25 23 2 namely, . Hence, 25% of the weekly viewing times are less than 23 hours, and 75% of the weekly viewing times are greater than 23 hours. Defn: The first quartile of a data set is Q1, the 25th percentile. The second quartile of the data set is ~ x Q2 , the median, or 50th percentile. The third quartile of the data set is Q3, the 75th percentile. Defn: An outlier is an extremely high or extremely low data value, when compared to the rest of the data values. Defn: The 5-number summary of a data set consists of the lowest value of the data, Xmin, the three quartiles, Q1, ~x , and Q , and the highest value of the data, Xmax. 3 Example: For the Nielsen data 1) Enter the data as the variable VIEW. 2) Go to STAT, CALC, 1-Var Stats, and enter the variable name VIEW. 3) Scroll down and read off the 5-number summary of the data set: Xmin = 5, Q1 = 23, Med = 30.5, Q3 = 36.5, Xmax = 66. Defn: The interquartile range of a data set is IQR = Q3 – Q1. We will consider a data value to be an outlier if its value is either greater than Q3 + 1.5IQR or less than Q1 – 1.5IQR. Example: The Nielsen data. We suspect that the largest value, 66 could be an outlying observation. We calculate Q3 + 1.5IQR = 36.5 + (1.5)(36.5 – 23) = 56.75. Since 66 > 56.75, then the largest data value is actually an outlier, and should be investigated individually. Defn: A boxplot is a graphical representation of a numeric data set, using the 5-number summary. The data values between Q1 and Q3 are represented by a box, with a vertical line at the median value. The data values between Xmin and Q1 are represented by a line segment attached to the left end of the box. The data values between Q3 and Xmax are represented by a line segment attached to the right end of the box. Note: The TI-83 will do boxplots. Information Obtained from a Boxplot a) If the median is near the center of the box and the two lines are of approximately equal length, the distribution is approximately symmetric. b) If the median falls to the left of the center of the box and the right line is longer than the left line, the distribution is positively skewed. c) If the median falls to the right of the center of the box and the left line is longer than the right line, the distribution is negatively skewed. Example: The Nielsen data 1) Enter the data in the calculator. 2) Set the WINDOW appropriately. 3) Clear all other plots and drawings. 4) Choose STAT PLOT, and turn Plot 1 on. 5) Choose the 5th Type of plot, the Boxplot. 6) For Xlist, choose the name of the variable, in this case VIEW. 7) To display the boxplot, hit the GRAPH key. 8) Use the TRACE key to find Xmin, Q1, the median, Q3, and Xmax.