Last Update 16th March 2011 SESSION 19 & 20 Measures of Dispersion Measures of Variability - Grouped Data - Lecturer: University: Domain: Florian Boehlandt University of Stellenbosch Business School http://www.hedge-fundanalysis.net/pages/vega.php Learning Objectives All measures for grouped data: 1. Measures of relative standing: Median, Quartiles, Deciles and Percentiles 2. Measures of dispersion: Range 3. Measures of variability: Variance and Standard Deviation 4. Empirical Rule and Chebysheff’s Theroem 5. Coefficient of Variation Percentiles We can determine any percentile for grouped data using the following formula: For quartiles, the formula ‘simplifies’ to: Where m = 1, 2 , 3 or 4 for the first, second, third and fourth quartile Calculation of Percentile 1. Calculate the less than cumulative frequencies f(<) from the observed frequencies f 2. Use the following formula to determine the location of the Pth percentile: Lp = (n + 1) * (P / 100) 3. Locate the interval Lp falls into Calculation of Percentile 4. Determine the following parameters P The percentile (e.g. 25 for the first quartile) n Sample size OLP The lower limit of the interval Lp falls into C Class width f(<) The cumulative frequency of the previous interval of the interval Lp falls into fLP The observed frequency of the interval Lp falls into 5. Apply formula for Pth Percentile Percentile: An example Let us assume the following grouped data is to be assessed: Interval 40 to 49 50 to 59 60 to 69 70 to 79 80 to 89 C n f f(<) 6 14 11 6 3 10 40 C = Upper + 1 – Lower C = 49 + 1 – 40 = 10 6 20 31 37 40 Percentile: An example If the data is interval (student marks approximately are), inequalities in the intervals may be more appropriate. Interval 40 to 49 50 to 59 60 to 69 70 to 79 80 to 89 C n f f(<) 6 14 11 6 3 10 40 C = Upper + 1 – Lower C = 49 + 1 – 40 = 10 6 20 31 37 40 This example comes from your student manual. The intervals on the right including inequalities may be somewhat more intuitive Interval 40 to <50 50 to <60 60 to <70 70 to <80 80 to <90 C n f f(<) 6 14 11 6 3 10 40 C = Upper – Lower C= 50 – 40 = 10 6 20 31 37 40 Solution – Step 1 Interval 40 to < 50 50 to < 60 60 to < 70 70 to < 80 80 to < 90 C n P Lp f f(<) 6 14 11 6 3 6 20 31 37 40 10 40 25 9.75 =(40 + 1) * (25/100) Use the formula for the calculation to determine what interval the median falls into. Since 6 < 9.75 < 20, the median interval is 50 to < 60. Beware that the median interval is to be looked up in the cumulative frequency column, not the interval column! Solution – Step 2 Interval 40 to < 50 50 to < 60 60 to < 70 70 to < 80 80 to < 90 f f(<) C n P Lp OLP fLP f(<) 10 40 25 9.75 50 14 6 6 14 11 6 3 6 20 31 37 40 Read of the parameters required for the median formula for grouped data. The formula: Now yields: It is left as an exercise to confirm that the formula for Q yields the same result. Variance Using the midpoints allows us to calculate the variance of grouped data as well. In the case of interval data, as with the mean, the original data is to be preferred to the grouped data. For ordinal or nominal data the variance has no probabilistic meaning! Measures of relative standing (i.e. percentiles) may be used for ordinal data. There are no measures of variability for nominal data (Example: 1 = married, 2 = single, 3 = divorced, 4 = widowed). Calculation of Variance 1. Determine the interval midpoints x 2. Multiply the observed frequencies f with the interval midpoints (fx) 3. Sum the results from 2. and divide by n (Steps 1 to 3 are identical to calculating the mean for grouped data) 4. Square x and multiply by f yielding fx2 Calculation of Variance 6. Use the following formula to determine the variance for grouped data (sample): And for the population: Note that x denotes the midpoints here and not the actual observations. Variance: An example Let us assume the following grouped data is to be assessed: Interval 40 to < 49 50 to < 59 60 to < 69 70 to < 79 80 to < 89 C n f 6 14 11 6 3 9 40 Solution – Step 1 Interval 40 to < 49 50 to < 59 60 to < 69 70 to < 79 80 to < 89 Total Average f 6 14 11 6 3 40 x 44.5 54.5 64.5 74.5 84.5 fx 267.0 763.0 709.5 447.0 253.5 2440.0 61.0 x2 1980.25 2970.25 4160.25 5550.25 7140.25 61 40 153950 fx2 11881.5 41583.5 45762.75 33301.5 21420.75 153950 Solution – Step 2 Using the formula yields: As before, the square root yields the standard deviation. Empirical Rule x 68,2 6% 95,4 4% 2 s 1 s x + 1 s + 2 s In normal bell-shaped frequency distribution polygons, we find the following: 1. Approx. 68.2% of all observations fall within one standard deviation of the mean 2. Approx. 95.4% of all observations fall within two standard deviations of the mean 3. Approx. 99.7% of all observations fall within three standard deviations of the mean Chebycheff’s Theorem The Chebycheff Theorem is a more general alternative to the empirical rule, which applies to all shapes of histograms. The proportion of observations that lie within k standard deviations of the mean is at least: 1 – 1 / k2 for k > 1 Where k denotes the standard deviations away from the mean Chebycheff’s Theorem - Example k Formula Chebycheff Empirical k=1 not defined n.a. =68.2% k=2 1–1/4 =75% =95.4% k=3 1–1/9 =88.9% =99.7% K=4 1 – 1/16 =93.75% n.a. The Empirical Rule provides approximate proportions under the assumption of a bellshaped normal distribution, whereas Chebycheff’s Theorem provides lower bounds on the approximations for any types of distribution. Consequently, the tail-ends of the distribution are further apart. Chebycheff is not relevant to your examination! Coefficient of Variation The coefficient of variation of a set of observations is the standard deviation divided by their mean: Sample Population By relating the standard deviation to its mean one can make a statement about the variability of the data. Compare a standard deviation of 10 to a mean of 100 and a mean of 1,000,000!