Unit 3 Measures for Describing Data 3.0 Introduction Statistics is an area of Science concerned with the extraction of information from numerical data. Individual values, when taken together in their entirety, form a distribution or a population. Summary statistics are ways of characterising that distribution: saying whether the values are very similar; whether there are some exceptionally large or small values; what a typical value is like, and so on. In this unit, we are going to discuss various statistics that are used to describe the distributions from which data are obtained. 3.1 Objectives By the end of this unit, you should be able to: calculate the mean, median and mode for discrete and continuous data discuss the advantages and disadvantages of each of the measures of central tendency estimate quartiles and percentiles for given data sets use a box plot to summarise a given data set calculate the range, inter-quartile range, variance and standard deviation for given samples of discrete and continuous data calculate coefficient of variation for given data sets 3.2 Measures of Central Tendency Measures of central tendency are also called measures of central location, averages or averages of the first order. In this module, we shall use the terms measure of central tendency or average interchangeably to mean the representative value around which all the values of the variable cluster or concentrate. The three averages are: the arithmetic mean - commonly known as the mean, the median and the mode. 3.2.1 The arithmetic mean The layman calls it "the average" as if it were the only average. He calls it so, probably, since it is the most commonly used average. The mean of a small set of discrete data The mean of a set of n measurements x1 , x2 , x3, ..., xn is equal to the sum of the measurements divided by n. Denoting the mean by x we have: n xi x where the i 1 [3.1] n is the summation sign. 34 Example 3.1 The following data set is a record of the amount of money (in $) spent by a sample of 8 customers on groceries in a shop on a particular Saturday. The figures were rounded-off to whole numbers. 13, 8, 21, 4, 23, 16, 11, 15 Calculate the mean amount spent by customers on groceries. Solution 3.1 n xi i 1 x n 111 8 = 13.875 Based on the sample, a customer spent on average about $14 on groceries in the shop on that particular Saturday. The mean of discrete frequency data When a data set is large with some observations appearing several times each, the mean is found by multiplying each observed value by the corresponding frequency, adding up and then dividing the sum by the number of observations. The mean for discrete frequency data is obtained using formula 3.2. k f i xi x k where n i 1 [3.2] n f i . Note that n is the total frequency, k is the number of categories of i 1 observations and f i is the frequency of category i . Example 3.2 In Example 3.1, suppose there are: 5 customers who spent $13 each; 3 customers who spent $8 each; 1 customer who spent $21; 6 customers who spent $4 each; 2 customers who spent $23 each; 3 customers who spent $16 each; 4 customers who spent $11 each ;and 7 customers who spent $15 each. Find the mean amount spent in the shop. Solution 3.2 This is an example of frequency discrete data. We usually record such data in form of a frequency table as shown below. x f 13 5 8 3 21 1 4 6 23 2 16 3 11 4 15 7 There are a total of 31 individual observations. We multiplying each observed value by its corresponding frequency, add up and then divide by 31. 35 k f i xi x i 1 n 377 31 = 12.1613 Thus on average, the customers spent $12.16 each. Activity 3.1 1. A sample of ten university students had the following weekly expenditure, in dollars. 23, 15, 18, 35, 24, 45, 35, 28, 40, 32 Calculate the mean weekly expenditure for a student. 2. There were 5 categories of cash prices in a road-show promotion of a product. The following frequency distribution table shows the number of people who won the various categories of prices. Prize ($) 25 40 60 75 120 No. of winners 20 12 5 3 1 Calculate the mean cash price won at the road-show. The mean of grouped continuous data Continuous variables such as mass, height and distance take values which are not clear-cut. Large volumes of data of such measures are usually presented in form of grouped frequency tables. Once data is presented in this form, some information is lost as it is no longer possible to retrieve the original raw data. The mean is estimated on the basis of this limitation. Suppose that data is grouped into k classes/categories with frequencies f1 , f 2 ,..., f k . Let the classes have the midpoints x1 , x2 ,...xk respectively. Then the mean is estimated by: k x i 1 k f i xi [3.3] fi i 1 where xi is the class midpoint or class mark. The class midpoint is a representative mark of all the marks falling in the particular class. It is obtained by adding the lower and upper class boundaries and dividing the result by 2. Example 3.3 The following data show monthly salaries (in dollars) of 50 employees of a nongovernmental organisation. Salary($00) No. of employees 0 to less than 10 10 10 to less than 20 23 20 to less than 30 12 30 to less than 40 3 40 to less than 50 2 Calculate the mean monthly salary of the employees. 36 Solution 3.3 Salary($00) Number of employees, f i 0 - 10 10 10 - 20 23 20 - 30 12 30 40 3 40 50 2 f i =50 Mean, x Midpoint, xi 5 15 25 35 45 f i xi 50 345 300 105 90 f i xi 890 f i xi fi 890 50 = 17.8 The mean monthly salary is $1 780.00. Example 3.4 An organisation recorded monthly medical expenses incurred by families of 30 randomly selected employees. Amount($00) Number of employees 1 10 3 11 20 7 21 30 11 31 40 5 41 50 4 Calculate the mean monthly expenditure per family. Solution 3.4 Note that there are some gaps in between the classes. Amounts spent may assume values between say 10 and 11, so there is need to do some continuity correction to the class boundaries to obtain the real limits. The gaps are one unit each, so we obtain the limits by adding 0.5 to upper class boundaries, and by subtracting 0.5 from the lower class boundaries. Amount($00) Frequency, f i 0.5 10.5 3 10.5 20.5 7 20.5 30.5 11 30.5 40.5 5 40.5 50.5 4 f i =30 Mean, x Midpoint, xi 5.5 15.5 25.5 35.5 45.5 f i xi fi 765 30 = 25.5 The mean monthly medical expense per family was $2 550.00 37 f i xi 16.5 108.5 280.5 177.5 182 f i xi 765 Activity 3.2 1. The data shows mass, in kg, a sample of people who had applied to train as horse-riders. Mass (kg) 15 - 20 21- 25 26 - 30 31 - 35 36 - 40 Number of applicants 9 5 3 4 2 Calculate the sample mean. 2. The heights of the applicants were recorded as shown in the table below. Height (cm) 130 - 135 136- 140 141 - 145 146 - 160 Number of applicants 7 9 4 1 Calculate the mean height of the applicants. 3.2.2 The median The median is a positional average. It is the value such that half the observations in the data set are larger than it and half are smaller than it. The median is the central value after the observations are ranked according to size. Median for small set of discrete data Let the ordered values of a data set be y1 , y 2 , y 3 The median is given by y n 1 if n is odd , yn where n is the number of observations. [3.4] 2 1 (yn 2 2 y n 2 ) if n is even. [3.5] 2 Example 3.5 The data set shows scores for university students who wrote a management course. 36, 67, 41, 52, 73, 61, 58, 76, 33, 48, 68 Find the median score. Solution 3.5 Rearranging in order of size y: 33 36 41 48 52 58 61 67 Since n = 11 is odd, the median corresponds to y n 1 y11 1 y 12 y 6 58 2 2 68 73 76 2 Example 3.6 Suppose in Example 3.5, the student who obtained a score of 68 was disqualified for some reason. Find the median of the remaining scores. Solution 3.6 Since n = 10 is even, the median is given by 1 ( yn yn 2 ) 2 2 2 1 ( y5 y 6 ) 2 1 = (52 58) 2 = 55 = 38 Activity 3.3 1. The PXP bank issued loans to farmers toward the rain season. The loans, in thousand dollars, are listed below. 7.3, 5.4, 14.2, 9.1, 15.0, 8.6, 24.5, 3.7, 6.3, 16.4, 12.5, 18.2 Find the median of the data set. 2. An Omnibus operator expects the bus crew to cash in $100 dollars every day. Due to various reasons, the crew may sometimes fail to meet the target. On 11 randomly selected days, the following amounts were cashed in: 84, 93, 75, 88, 55, 69, 96, 100, 74, 80, 58 Find the median for the cash remittances. Median for discrete frequency data We shall proceed by giving an example to demonstrate how to locate the median of discrete frequency data. This approach is used in order to avoid a cumbersome task of having to list a large number of values in the data set. Example 3.7 In a survey to assess attitude to a new product, a random sample of 41 potential customers was obtained. A 60-point rating scale was used to measure the potential to become loyal to the product. The frequency table shows the distribution of scores that were obtained in the survey. x Frequency, f 18 4 24 7 27 5 35 12 42 7 53 6 Find the median score. Solution 3.7 n 1 41 1 = = 21 2 2 The median has a rank of 21 i.e. the 21st in the set of ascending values. We then construct a cumulative frequency table to help us identify the median value. Rank for median = x Cumulative frequency 18 4 24 11 27 16 35 28 42 35 53 42 We note from the table that there are 16 values that are below or equal to 27, that is, the 16th value is 27. Similarly, the 28th value is 35. A list of values from the 16th to the 28th will include the 21st. The list will be a 17 followed by a chain of 35s up to the 28th value. Thus the 21st value is 35, the median. Activity 3.4 During the first quarter of the year EST Department stores conducted a promotions for various products were prices worth various amounts of dollars were won. The frequency table shows the distribution of prizes that were won. Prize worth 20 35 60 75 120 No. of prizes 83 26 48 52 15 Find the median value of the prices. 39 Median for grouped continuous data To find the median you start by identifying the median class. The median class is the class that contains the nth 2 observation, where n is the total frequency. The class containing the nth 2 observation can easily be identified using the less than cumulative frequencies of the data. The median is given by Median = Lm C m ( n 2 Fm 1 ) fm [3.6] where - Lm = lower class boundary of median class, f m = frequency of the median class, Fm 1 = cumulative frequency up to(but excluding) the median class, C m = width of the median class, n = total frequency and m = subscript used to denote median class. Example 3.8 Calculate the median of data in Example 3.4. Solution 3.8 Class interval Class boundaries 1 10 11 20 21 30 31 40 41 - 50 0.5 - 10.5 10.5 20.5 20.5 30.5 30.5 40.5 40.5 50.5 Frequency, f i 3 7 11 5 4 Cumulative frequency, F 3 10 21 26 30 The median class contains the 30th 2 observation, that is, the 15th observation. This is contained in the class 20.5 30.5. Therefore, Lm 20.5 , Cm 10 , f m 11and Fm 1 10 . You now substitute these values into the formula. Cm (n 2 Fm 1 ) Median = Lm fm 10(15 10) 20.5 11 50 = 20.5 11 = 25.04545455 $2504.54 Example 3.9 The table below shows the distribution of the sales ($) made by vendors at Mbare Musika one particular morning. Sales, x 40 - 60 61 - 80 81 - 100 101 - 150 151 - 200 201- 250 Frequency, f 8 5 15 9 13 6 Estimate the median of the sales. 40 Solution 3.9 Note that there are some gaps in between the classes. For instance, the lowest class ends at 60 and the next class starts from 61. Sales may assume values between 60 and 61, so there is need to do some continuity correction to the class boundaries to obtain the real limits. Sales ($) Frequency, f Cumulative frequency, F 39.5 60.5 8 8 60.5 80.5 5 13 80.5 100.5 15 28 100.5 150.5 9 37 150.5 200.5 13 50 200.5 250.5 6 56 The median has a rank of 28, that is, it is the 28th value of the ordered data set. This is in the class 80.5 100.5, which is therefore the median class. The median is then interpolated using the formula: Cm (n 2 Fm 1 ) Median = Lm fm 20( 28 13) 80.5 15 300 = 80.5 15 = 100.5 The median is $100.50 Activity 3.5 1. The frequency table shows the distribution of bank deposits, in thousand dollars, made by companies over the month of February. Deposits 60 - 80 80 - 100 100 - 200 200 - 250 250 - 400 No. of banks 5 12 5 15 8 Estimate the median for the company deposits. 2. The distribution of investors by value of shares (thousand dollars) in Earthly limited company is shown in the frequency table. Value 2.5 - 4.5 5.0 - 7.5 8.0 - 12.5 13.0 - 18.5 19.0 - 25.5 Investors 12 9 24 8 5 Estimate the median share value. 3.2.3 The mode The mode of a data set is the observation that appears most. The mode represents fashion and, often, it is used in business. Example 3.10 A cross boarder trader is deciding to order shoes for resale. She will be guided by shoe sales recorded by a colleague in the same business in order to determine what proportion to order of each size. The sales (shoe sizes) recorded by the colleague on her last visit were: 4, 7, 8, 8, 9, 3, 8, 8, 7, 9, 5, 6, 7, 5, 8 Determine the modal shoe size. 41 Solution 3.10 The shoe size which appears most is 8. This is the shoe size with highest demand, and the cross boarder trader should order more size 8 shoes. Mode for discrete continuous data Example 3.11 Suppose in Example 3.10 the sales for the last 30 visits were as follows: Size 3 4 5 6 7 8 9 Frequency 11 7 16 10 23 47 7 10 1 What was the modal shoe size? Solution 3.11 Size 8 has the highest frequency of 47, hence it is the mode. This shoe size must constitute the biggest proportion of the new order. Activity 3.6 Mukwe Lodge recorded the following number of bookings per week for accommodation in the first quarter of the year. 15, 23, 9, 15, 25, 17, 18, 12, 15, 26, 13, 21 Find the modal number of bookings. The mode for grouped continuous data The mode for discrete data could be found easily by inspection. However, when raw data is put into classes, it is difficult to tell exactly how many times each value occurs, but you can tell the number of times each class occurs. The class that occurs the greatest number of times than any other class is the modal class. The actual mode lies in the modal class and can be estimated by calculation or graphically using a histogram. The mode is calculated using the formula: Cm ( f m f m 1 ) Mode Lm 2 fm fm 1 fm 1 [3.7] where Lm - lower class boundary of the modal class, Cm -class width of modal class, f m frequency of the modal class, f m 1 - frequency of the class one step below the modal class, f m 1 - frequency of the class one step above the modal class and m = subscript used to denote modal class. Example 3.12 Calculate the mode of the data in Example 3.4. Solution 3.12 The modal class is 20.5 30.5 Cm ( f m f m 1 ) Mode Lm 2 fm fm 1 fm 1 10(11 7) = 20.5 2(11) 7 5 42 = 20.5 40 10 = 24.5 The modal expense was $2 450.00 Example 3.13 The frequency table shows the distribution of loans (in thousand dollars) that were issued to small businesses by PXP bank. Amount 2-8 9 - 15 16 - 20 21 - 35 36 - 40 Frequency 15 8 23 14 5 Find the mode. Solution 3.13 We need to present the table using real limits as shown below. It is important to note that successive classes have small gaps of 1 unit between them. To close these gaps we subtract half the distance (0.5) from lower limits, and add the same to the upper class limits. Amount Frequency 1.5 - 8.5 15 8.5 - 15.5 8 15.5 - 20.5 23 20.5 - 35.5 14 35.5 - 40.5 5 The modal class is 15.5 - 20.5 since it has the highest frequency. Cm ( f m f m 1 ) Mode Lm 2 fm fm 1 fm 1 5(23 8) = 15.5 2(23) 8 14 75 = 15.5 24 = 18.625 The mode is $18 625.00 Activity 3.7 The table shows monthly tobacco sales, in tons, made over the last 32 months at BW Tobacco Auction Floors. Sales in tons No. of months 5.5 - 9.5 6 10 - 15.5 3 16 - 20.5 12 21 - 24.5 8 25 - 28.5 3 Estimate the mode for the monthly sales of tobacco by the company. 3.2.4 Estimating the mode using a histogram The use of a histogram to estimate the mode requires that the bars be of uniform width. The method is illustrated in Figure 3.1 using the data of Example 3.14. Example 3.14 The monthly salaries earned by a sample of 20 salespersons employed in the motor insurance industry are: 43 Salary/($) Number of employees 200 to less than 300 2 300 to less than400 4 400 to less than 500 8 500 to less than 600 5 600 to less than 700 1 Estimate the mode using a histogram. Solution 3.14 You start by identifying the modal class. The modal class is the one with the tallest bar. You then estimate the position of the mode within the modal class by drawing diagonals as shown in Figure 3.1. Where the diagonals intersect you now draw a straight vertical line downwards to meet the horizontal axis. Number of employees 8 6 4 2 0 200 300 400 500 600 700 Salary ($) Figure 3.1 Estimation of Mode The arrow indicates the position of the mode which can be read off from the horizontal scale. 3.2.5 Choosing the appropriate average The suitability of the mode, median or mean as an average for a given situation largely depends on the advantages and disadvantages of the particular measure. Advantages of the median Using the median in describing a distribution has advantages in that it: is ease to calculate; eliminates the effect of extreme values; is capable of further algebraic use in analysing other measures; can be estimated graphically using an ogive. Disadvantages of the median Using the median has disadvantages in that it: may not be representative of all the items as it ignores the extreme values; cannot be determined precisely when it falls between two middle values; has no use when items are weighted according to size; requires ranking of items which may be involving. 44 Advantages of the arithmetic mean The mean has the following advantages: it is ease to calculate; it is based on all the observations; it has further algebraic use in calculating other measures; it is easily understood. Disadvantages of the arithmetic mean The mean has the following disadvantages: it is affected by extreme values (outliers), if any, in a data set; it does not give information on composition of the data; it does not depict the entire picture of the data; it does not always represent the characteristics of individual items; it is usually not one of the observed values. Advantages of the mode The advantages of using the mode are that: it is easy to find; it is easy to understand; it is usually one of the observed values of the data set. Disadvantages of the mode The disadvantages of mode are that: the mode may not exist; it may not be unique; its use in further statistical analysis is limited; it does not take into account all other values except the most frequent. Activity 3.8 The planning department of a Building Society would like to estimate the average household size of workers at a particular company for which they are to develop a housing project. The Society gathered the following data pertaining to household sizes from a random sample of 20 workers at the company. 2, 1, 3, 6, 2, 5, 3, 5, 1, 7, 1, 2, 5, 3, 3, 4, 5, 5, 7, 15 (a) Find the mean, median and mode of the data. (b) Which average is most suitable to estimate household size? Justify your answer by saying why the other two are not suitable. 3.3 Measures of Position These measures provide the position of a value in an ordered set of data. The median, for instance, is a measure of position which divides the distribution into halves. Other commonly used measures of position are the lower quartile (Q1), the upper quartile (Q3), and the percentiles. The quartiles (Q1, Q2 and Q3) are positions which divide the entire distribution into four portions of equal frequency. The lower quartile (Q1) is the value below which lies 25% of the distribution. The median (Q2) has 50% of the distribution lying below it while the upper quartile (Q3) is the value below which lies 75% of the distribution. 45 Rank for the quartiles In finding or estimating the quartiles, the data is first arranged in ascending order. We then need to know the rank for the quartiles, since these will be used in making estimation. Table 3.1 summarises the rank for the quartiles for discrete and continuous data in which the number of observations is n . Table 3.1 Rank for the Quartiles Quartile Rank in discrete data Q1 n 1 4 Q2 n 1 2 Q3 3( n 1) 4 Rank in continuous data n 4 n 2 3n 4 3.3.1 Quartiles for discrete data We demonstrate how the quartiles are estimated for given sets of discrete data, using the ranks. Example 3.15 A phone- shop operator recorded the daily revenue she received, in dollars, over 14 days as shown below. 12, 18, 23, 27, 14, 17, 25, 43, 16, 37, 22, 28, 10, 36 (a) Estimate Q1 and Q3 from the data. (b) Based on the calculated values, what is the probability that on a given day her revenue exceed Q3? Solution 3.15 (a) There are 14 observations, hence n 14 . We first arrange these values in ascending order to obtain the ordered data set below. 10, 12, 14, 16, 17, 18, 22, 23, 25, 27, 28, 36, 37, 43 14 1 The rank for Q1 is = 3.75 4 Thus the rank is 3.75, that is, 3 + 0.75. We therefore consider Q1 to be the third value plus 0.75 of the distance between this and the fourth value. Put mathematically, this is Q1 14 0.75(16 14) = 15.5 3(14 1) The rank for Q3 is = 11.25 4 The upper quartile is, therefore, the 11th value plus 0.25 of the difference between this and the 12th value. Q3 28 0.25(36 28) = 30 The upper quartile, Q3, has 0.25 of the distribution lying above it. The probability that her revenue exceeds $30 is 0.25. 46 Activity 3.9 Find the lower and upper quartiles for the following data sets (a) 15, 23, 9, 15, 25, 17, 18, 12, 15, 26, 13, 21 (b) 7.3, 5.4, 14.2, 9.1, 15.0, 8.6, 24.5, 3.7, 6.3, 16.4, 12.5, 18.2 Percentiles are found in a very similar way to quartiles. The 25th percentile and the 75th percentile are in fact Q1 and Q3 respectively. 3.3.2 Quartiles for continuous data Quartiles can be obtained from grouped data in a similar way as was used for the median. You begin by identifying the appropriate quartile class. The lower quartile class is the class that contains the nth 4 observation while the upper quartile class contains the 3nth 4 observation. The following computational formulae are then made use of to estimate the quartiles: Cq ( n 4 Fq 1 ) Lower quartile, Q1 = Lq fq [3.8] Cq (3n 4 Fq 1 ) Upper quartile, Q3 = Lq fq [3.9] where Lq = lower limit of the quartile class, C q = class width of the quartile class, f q = frequency of the quartile class and Fq 1 = cumulative frequency of the class one step below the quartile class. Example 3.16 Calculate the lower quartile, Q1 and upper quartile, Q3 for the data of Example 3.4. Solution 3.16 The lower quartile class contains the 30th 4 observation, that is, the 7.5th observation. The lower quartile class is therefore 10.5 - 20.5. Cq ( n 4 Fq 1 ) Lower quartile, Q1 = Lq fq 10(7.5 3) 10.5 7 10.5 6.428571429 = 16.9285 The upper quartile class contains the 3nth 4 observation, that is, the 22.5th observation. This class is 30.5 - 40.5. Cq (3n 4 Fq 1 ) Upper quartile, Q3 = Lq fq 10( 22.5 21) 30.5 5 = 30.5 + 3 = 33.5 47 Activity 3.10 Using the data of Example 3.13, calculate the lower and upper quartiles of monthly salaries of the employees. 3.4 Measures of Dispersion Measures of dispersion give an indication of how widely scattered the observations are around their mean. When values in a sample or population are close to the mean, they exhibit less dispersion. The measures of dispersion we are going to look at are the range, interquartile range, semi inter-quartile range, variance and standard deviation. 3.4.1 Range The range gives a simple indicator of the variability of a set of observations. The range of a set of observations is the difference between the largest observation and the smallest observation. Range = highest observed value lowest observed value [3.10] Example 3.17 Find the range of the following data 13 12 16 19 26 20 14 21 15 18 22 36 Solution 3.17 Range = highest observed value lowest observed value Range = 36 - 12 = 24 The range for grouped data is found by subtracting the real lower limit of the lowest class interval from the real upper limit of the highest class interval. Although it is very easy to use and understand, the range is not a reliable way of measuring the spread of data because it is only based on only two observations which are the highest and lowest values. If one of these two values is an outlier, then the spread of data is rather exaggerated. Moreover, it is not applicable where class intervals are open-ended. Inter-quartile range The inter-quartile range (IQR) is the range between quartiles. More specifically, it is the difference between the upper quartile and the lower quartile, that is: IQR = Q3 - Q1 [3.11] In turn, the semi inter-quartile range (SIQR) is half the inter-quartile range and is obtained from the formula: SIQR = Q3 Q1 [3.12] 2 The SIQR is limited in that, just like the range, it is based on selected observations in a distribution so it cannot always detect dispersion in data. However, it is more resistant to extreme observations compared to the range. Activity 3.11 Find the range, inter-quartile range and the semi inter-quartile range for the following data 12 19 19 26 20 14 21 15 17 22 36 12 18 33 15 21 18 19 11 48 3.4.2 Variance and standard deviation of ungrouped data The variance and standard deviation allow us to avoid the shortcomings of the range and inter-quartile range as measures of dispersion because they take into account all the observations in the data set as opposed to just selecting a few. The variance of a set of data is the average squared deviation of the data points from their mean. Computationally, the variance of a sample of n observations x1 , x2 ,..., xn is obtained by the formula: s n 1 2 x n 1 i 1 1 N N i 1 2 x i 1 is given by: 2 N 1 N 2 i [3.13] xi The formula for population variance 2 2 n 1 n 2 i xi i 1 [3.14] The standard deviation of a set of observations is the positive square root of the variance of the set. The variance is a squared quantity and its units which are (units)2 often have no practical meaning. For example, the variance of sales data in dollars is (dollars)2 which is practically meaningless. By taking the square root of the variance, we unsquare the units and get the standard deviation which has the same units as those of the quantity being measured and thus easier to interpret compared to the variance. When calculating variance or standard deviation, you should verify whether the data relate to a population or a sample. Example 3.18 The numbers of vehicles stopping to refuel at a service station on 20 randomly selected days are: 32 37 29 40 35 26 45 37 34 29 30 34 56 74 40 48 45 43 32 35 Find the variance and standard deviation of the data. Solution 3.18 s2 1 n 1 n xi2 i 1 1 n 2 n xi i 1 1 (781) 2 32821 19 20 = 122.2605263 The standard deviation is then obtained by finding the positive square root of the variance. s 122.2605263 = 11.0571 The variance is 122.2605 and the standard deviation is 11.0571. Activity 3.12 The commissions (in dollars) earned by a sample of 15 ice cream vendors in one month were: 78 50 65 79 97 80 102 45 54 75 98 86 92 69 72 75 80 Find the variance and standard deviation of the data. 49 3.4.3 Variance and standard deviation of grouped data Suppose that data were put into k classes. Let x1 , x2 ,..., xk be the midpoints of the class intervals and f1 , f 2 ,..., f k be the respective class frequencies, then the population variance is given by: ( f i xi ) 2 1 2 ( f i xi2 ) N N [3.15] k where N f i is the population size. i 1 The sample variance is given by: ( f i xi ) 2 1 s2 ( f i xi2 ) n 1 n k where n [3.16] f i is the sample size. i 1 The standard deviation is found by taking the square root of the variance. Example 3.19 Calculate the variance and standard deviation of the following data. Class interval Frequency 2 - 9 2 10 - 17 6 18 - 25 12 26 - 33 5 34 - 41 3 42 - 49 2 Solution 3.19 Class boundaries Frequency, f i 2 6 12 5 3 2 f i 30 1.5 - 9.5 9.5 - 17.5 17.5 - 25.5 25.5 33.5 33.5 41.5 41.5 49.5 Variance, s 2 1 ( f i xi2 ( Class midpoint, xi 5.5 13.5 21.5 29.5 37.5 45.5 f i xi ) 2 ) n 1 n 1 (701) 2 (19411.5 ) = 29 30 1 (19411.5 16380.03333) 29 = 104.5333334 104.5333 50 f i xi 11 81 258 147.5 112.5 91 f i xi 701 f i xi2 60.5 1 093.5 5 547 4 351.25 4 218.75 4 140.5 f i xi2 19411.5 Standard deviation, s 104.5333334 = 10.22415441 10.2242 Example 3.20 Calculate the variance and standard deviation of the data in Example 3.4. Solution 3.20 Amount($00) Frequency, f i 0.5 10.5 3 10.5 20.5 7 20.5 30.5 11 30.5 40.5 5 40.5 50.5 4 f i =30 2 Variance, s 1 ( f i xi2 Midpoint, x i 5.5 15.5 25.5 35.5 45.5 f i xi ) 2 ( n 1 n 1 (765) 2 (23507 .5 ) 29 30 f i xi 16.5 108.5 280.5 177.5 182 f i xi 765 f i x i2 90.75 1681.75 7152.75 6301.25 8281 2 f i xi 23507.5 ) 1 ( 23507 .5 19507 .5) 29 1 (4000) 29 = 137.9310345 Standard deviation = 137.9310345 = 11.74440439 The standard deviation was $1174.44 Activity 3.13 The annual profits made by a random sample of 40 companies in the textiles industry are shown in the table below. Profit ($00) 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 60 but less than 100 Calculate the: Number of companies 3 7 12 10 5 3 51 i. ii. iii. iv. v. vi. Mean Median Mode Semi inter-quartile range Variance Standard deviation 3.4.4 Coefficient of variation The coefficient of variation is the standard deviation given as a percentage of the mean. It is calculated using the following formula: s Coefficient of variation (CV) [3.17] 100 The coefficient of variation is a relative measure and it is used to compare variability of two or more distributions especially where the units of measurement differ. Example 3.21 A German based firm would like to purchase stock in one of two companies (A and B) listed on the Zimbabwe Stock Exchange. The firm considered the monthly returns of the two companies over the last 10 months. A: 34 42 36 38 45 40 32 34 39 41 B: 21 24 32 64 50 35 28 30 42 55 Compare the variability in returns between the two companies. In which company should the firm invest? Solution 3.21 A: mean = 38.1 standard deviation = 16.7667 16 . 7667 CV= 100 38.1 = 44.01 % B: mean = 38.1 standard deviation = 14.2162 14 . 2162 CV = 100 38.1 = 37.31 % The returns of company A are more variable and therefore, risky compared to company B. The German based firm should invest in company B. Activity 3.14 Sekai and Sam stay in the same suburb and are employed by the same company in town. Sekai travels to work by bus and Sam cycles. The times (in minutes) taken by each to get to work on a sample of 10 days were: Sekai: 35 26 41 38 36 48 37 30 35 24 Sam: 24 28 24 21 27 26 24 28 22 23 Calculate the coefficient of variation for each set of times. Whose travel time is more consistent? Justify your answer. 52 3.5 Coefficient of Skewness Pearsons coefficient of skewness, denoted Skp, is a measure of the degree of departure from symmetry which is based on the difference between the mean and the median. It is calculated using the formula 3(mean median) Skp [3.18] s tan dard deviation A symmetrical distribution has a coefficient of skewness which is equal to zero. A coefficient of skewness which is close to zero indicates moderate skewness. A positive coefficient of skewness shows that data are positively skewed whilst a negative coefficient means data are negatively skewed. Example 3.22 Calculate the coefficient of skewness for the data in Example 3.18 Solution 3.22 You are now capable of finding the mean, median and standard deviation of ungrouped data. Show that mean = 39.05, median = 36 and standard deviation = 11.0571 3(39.05 36) Coefficient of skewness = 11.0571 = 0.8275 Since the coefficient of skewness is positive, the data is positively skewed. 3.6 The Box-and-Whisker Plot A box- and- whisker plot is useful in comparing distributions. It highlights five summary measures of a distribution which are: the median, lower quartile, upper quartile, the smallest observation and the largest observation. The middle half of the values in a distribution is represented by a box which has the lower quartile at one end and the upper quartile at the other. The median is shown by a line inside the box. Observations in the top and bottom quarters are represented by straight lines called whiskers which extend from each end of the box, one from the lower quartile to the smallest observation and the other from the upper quartile to the largest observation. Because of these features, a box plot makes it easier to determine skewness, spread, central tendency and possible outliers of a distribution. Example 3.23 Draw a box-and-whisker plot of the following sales data 10 5 14 11 16 24 21 12 16 20 22 15 24 18 10 14 19 8 12 20 Solution 3.23 The smallest observation is 5 while the largest observation is 24. By now you should be able to show that the lower quartile is 11.25; the median is 15.5 and the upper quartile is 20. 53 Sales 20 15 10 5 Figure 3.1 Box- and- Whisker Plot of Sales Data Note: The length of the box (showing inter-quartile range) and that of the whiskers (showing the range) give an indication of the spread of the data. 3.7 Summary We looked at three broad categories of measures of describing data namely measures of central tendency, measures of location and measures of dispersion. The measures of central tendency locate the centre of data; these are the mean, mode and median. Quartiles and percentiles which are classified as measures of position provide the position of a value in an ordered set of data. Measures of dispersion give an indication of how widely scattered the observations are around their mean. When values in a sample or population are close to the mean, they exhibit less dispersion. The measures of dispersion we looked at are the range, inter-quartile range, semi inter-quartile range, variance and standard deviation. We also considered the advantages and disadvantages of these measures of describing data. 54 Further Reading Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata McGraw-Hill. Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth Heinemann. Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill Trade. Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill Proffessional Publishing. 55