CHAPTER 3 MEASURES OF CENTRAL TENDENCY CHAPTER 3: MEASURES OF CENTRAL TENDENCY We can get a better understanding of a data set if we can locate the middle or centre of the data, and also get an indication of its spread or dispersion. Knowing one of these without the other is often of little use. There are three statistics that are used to measure the centre of a dataset. They are the mode, the mean and the median. 3.1 Measures of central tendency If a dataset relates to more than one variable, one may construct a frequency distribution for each of the variables. After producing a frequency distribution for a particular variable, the next step in summarizing the values recorded for that variable is to indicate the point where the “centre of the distribution” (in some sense) lies, ie. where the values in the dataset tend to cluster. This is the purpose of a measure of central tendency, or average, or mean. Since the term “centre of a distribution” may be interpreted in several different ways, there are several different measures of central tendency, including arithmetic mean (usually just called “mean” or “average”), geometric mean, harmonic mean, median, and mode. A measure of central tendency may be calculated for a dataset relating to a population of N units or for a dataset relating to a sample of n units. The formulas in the following sections use capital letters and relate to populations; similar formulas using lower case letters relate to samples. 3.2 Definition of the arithmetic mean The arithmetic mean of a dataset is the value of the variable that each unit would have if every unit to which the dataset relates had the same value and the total of their values was the same as the actual total for the dataset, ie. it is the common value that every unit would have if the total of the dataset were re-allocated equally amongst all the units to which the dataset relates. 3.3 Calculating the mean of raw data To calculate the (arithmetic) mean for a particular variable, add the values of all the observations for that variable and divide by the number of observations. Formula: 3.4 X X / N where X denotes the value of an observation Calculating the mean of an ungrouped frequency distribution The procedure for calculating the mean in this case is as follows. Step 1: Multiply each possible value (X) by its frequency (F), in order to calculate the class total, ie. the total of all the observations in the class. Step 2: Add all these products, in order to obtain the grand total of all the observations in the dataset. Step 3: Divide the grand total by the total frequency (N), ie. the total number of observations. 13 CHAPTER 3 MEASURES OF CENTRAL TENDENCY Formulas: N F and X FX / N where X denotes the value of the variable for each class and F denotes the class frequency. Example 3.4.1: X 6 7 8 9 34 F 1 4 3 2 1 FX 6 28 24 18 34 Total 11 110 N = 11 X FX / N = 110/11 = 10 Additional Examples Example 3.4.2 The number of faulty products returned to an electrical goods store over a 21 day period is: 3 5 4 3 4 5 9 9 For this data set, find the: a. mean 8 8 8 6 6 3 b. median 4 7 c. mode 7 1 9 d. 60th percentile Solutions 3 4 4 9 8 8 .........1 113 5.38 faulty products 21 21 n 1 b. Median = as n 21, 11 , therefore from the ordered set: 2 113333444556677888999 the 11th score is 5. a. mean _ x c. Mode is 3 which occurs the most often 60 d. P60 = 100 12.6 13th score = 6 100 P60= 60/100X21=12.6=13th score=6 Example 3.4.3 If 6 people have a mean mass of 53.7 kg, find their mass. Solutions mass 53.7kg 6 sum of masses = 53.7x6 =322.2 kg 14 1 3 CHAPTER 3 3.5 MEASURES OF CENTRAL TENDENCY Calculating the mean of a grouped frequency distribution In a grouped frequency distribution, there is a loss of information; the precise value of each observation is not known - only the class within which it falls, ie. the class limits. In order to do any calculations relating to the distribution, it is necessary to use the class midpoint to represent, or estimate, every observation in the class. In formulas, we use X to refer to the class midpoint. The procedure for estimating the mean in this case is as follows. Step 1: Calculate the midpoint of each class. Step 2: Multiply each class midpoint by its frequency, in order to estimate the class total. Step 3: Add all these products, in order to estimate the grand total of all the observations in the dataset. Step 4: Divide the estimated grand total by the total frequency. N F Formulas: and X FX / N where X denotes the midpoint of each class and F denotes the class frequency. Example 3.5.1: Estimate the mean for the data in Example 1.11.2 on roller bearings. Diameters of a set of roller bearings Diameter (nearest mm) 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79 Total X FX / N Number of bearings (F) 7 16 30 14 8 3 2 Class midpoint (X) FX 14.5 24.5 34.5 44.5 54.5 64.5 74.5 101.5 392.0 1,035.0 623.0 436.0 193.5 149.0 80 2,930.0 = 2,930.0/80 = 36.6 mm 3.6 Properties of the arithmetic mean (a) It is affected by outliers, or extreme observations, ie. values that are very different from all the other values in the dataset. Example 3.6.1: In Example 2.4.1, if the value 34 (which is an outlier) is removed from the distribution, the mean becomes 76/10 = 7.6, which is much less than the original mean of 10. 15 CHAPTER 3 MEASURES OF CENTRAL TENDENCY (b) It uses all the values in the dataset. (c) It has a convenient mathematical formula. (d) It presents a problem if the distribution is open-ended, ie. if it has a class such as “Less than 20” or “Over 500”. One cannot determine the midpoint of such a class, so one has to make an arbitrary judgement about what value to use as an estimate of the class midpoint. 3.7 Definition of the median The median of a dataset is the value of the variable that divides the observations into two groups of equal size: the half whose values are less than the median and the half whose values are greater than the median. It is, of course, equal to the second quartile and the 50th percentile. For a frequency distribution, the median class is the class that contains the median. 3.8 Calculating the median For raw data, the procedure for calculating the median is as follows. Step 1: Arrange the observations in order, from lowest to highest. Step 2: If N is odd, find the middle observation; if N is even, find the middle two observations and average them. For frequency distributions, it is necessary to determine which class is the median class. From there, one can determine or estimate the median. For an ungrouped frequency distribution, the procedure for calculating the median is as follows. Step1: Construct the cumulative frequency distribution. Step2: Determine the median class, which is the class containing the observation whose rank is N/2 (rounded to a whole number, of course). Step 3: Determine the median, which is the common value of all the observations in the median class. For a grouped frequency distribution, the procedure for estimating the median is as follows. Step 1: Construct the cumulative frequency distribution. Step 2: Determine the median class, which is the class containing the observation whose rank is N/2 (rounded to a whole number, of course). Step 3: Estimate the median by using the following formula. Formula: Me = LL + W(N/2 - CB)/F where LL = lower real limit of median class W = width of median class N = total number of observations CB = cumulated frequency below median class F = frequency of median class 16 CHAPTER 3 Example 3.8.1: MEASURES OF CENTRAL TENDENCY Estimate the median for the data in Example 1.11.2 on roller bearings. Diameters of a set of roller bearings Diameter (nearest mm) Number of bearings (F) Cumulative frequency (<) 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79 7 16 30 14 8 3 2 7 23 53 67 75 78 80 Total 80 Median class is 30 – 39 (because it contains the 24th to 53rd observations and thus includes the 40th and 41st) LL = 29.5 W = 10 N/2 = 40 CB = 23 F = 30 Me = 29.5 + 10(40 - 23)/30 = 35.2 mm 3.9 Properties of the median (a) It is not affected by the sizes of the observations, especially extreme values. For example, the example in Section 2.6 shows that the mean is affected by an extreme value but the median for that dataset is 8 (the sixth value) regardless of whether the highest value in the dataset is 34, 10 or 500. (b) It is readily used with frequency distributions having open-ended classes. Again, the calculation of the median is not affected at all by the sizes of the values in the first or last classes. (c) Its formula is not very convenient mathematically. (d) It can be used for data measured on an ordinal type of scale (where the mean cannot be used). For example, the answers to an opinion-type question is a survey might be 17 CHAPTER 3 MEASURES OF CENTRAL TENDENCY Answer Strongly agree Agree Undecided Disagree Strongly disagree Total No. of respondents 14 38 6 18 11 Cumulative frequency 14 52 58 76 87 87 The median value is the 44th, which is “Agree”, but one cannot calculate an arithmetic mean for this data, of course. 3.10 Definition of the mode The mode of a dataset is the observed value that occurs most frequently, ie. it is the value of the variable that has the highest frequency. For a frequency distribution, the modal class is the class that has the highest frequency. It is sometimes observed that a distribution has two modal classes, ie. two classes share the highest frequency. Such a distribution is called bi-modal. 3.11 Calculating the mode If given raw data, one must firstly construct an ungrouped frequency distribution. Thereafter, the procedure for determining the mode is as follows. Step 1: Determine the modal class. Step 2: Determine the mode, which is the common value of all the observations in the modal class. For a grouped frequency distribution, the procedure for estimating the mode is as follows. Step 1: Determine the modal class. Step 2: Estimate the mode by using the following formula. Formula: Mo = LL + Wd1/(d1 + d2) where LL = lower real limit of modal class W = width of modal class d1 = absolute difference between modal class frequency and frequency of the immediately preceding class d2 = absolute difference between modal class frequency and frequency of the immediately following class 18 CHAPTER 3 MEASURES OF CENTRAL TENDENCY Example 3.11.1: Estimate the mode for the data in Example 1.11.2 on roller bearings. Diameters of a set of roller bearings Diameter (nearest mm) 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79 Number of bearings (F) 7 16 30 14 8 3 2 Total 80 Modal class is 30 – 39 LL = 29.5 W = 10 d1 = 30 - 16 = 14 d2 = 30 - 14 = 16 Mo = 29.5 + 10*14/(14 + 16) = 34.2 mm 3.12 Properties of the mode (a) It is not affected by the sizes of the observations, especially extreme values. (b) It is readily used with frequency distributions that have open-ended classes. (c) It is not always “central”, eg. for a skewed distribution or a bi-modal distribution. (d) It may be affected by the choice of class width when the distribution was constructed. (e) Its formula is not very convenient mathematically. (f) It can be used for data measured on a nominal type of scale (where the mean and median cannot be used). For example, the modal value of the following distribution, which relates to respondents’ colour preferences, is “green”: Colour Blue Green Purple Red Yellow Total Number of respondents 37 44 14 25 11 131 3.13 Comparison of measures of central tendency If a distribution is symmetric and uni-modal, the three measures we have considered all have the same value. If the distribution is skewed, they are generally not equal. In a positively skewed distribution, they usually occur in the order: mode, median and mean; in a negatively skewed distribution, they usually occur in the reverse order. 19 CHAPTER 3 MEASURES OF CENTRAL TENDENCY Tutorial exercises 1. The following data shows the purchase price of each vehicle in a company’s fleet in kina (X): 1,700; 2,000; 3,000; 3,000; 8,100; 1,500; 2,000; 2,800; 3,000; 3,700; 1,700; 2,000; 6,500; 2,900; 3,000; 4,200. (a) Define a variable, Y = X/100 and calculate the mean, median and mode values for Y. (b) Use the answers in (a) to calculate the mean, median and mode values for X. 2. The following table relates to readings of atmospheric pressure taken on 50 days by a weather office. Pressure (millibars) 986 to 990 991 to 995 996 to 1,000 1,001 to 1,005 1,006 to 1,010 1,011 to 1,015 1,016 to 1,020 Number of days 3 5 10 14 9 6 3 (a) If X denotes the pressure readings, estimate the mean, median and mode values for X. (b) Define another variable, Y, by subtracting 1,000 from every X-value, ie. Y = X - 1,000 for every observation. Draw up the frequency distribution for Y. (c) Estimate the mean, median and mode values for Y. (d) Compare the measures calculated in (c) with the corresponding measures calculated (for X) in (a). 20