The Sample Variance © Chistine Crisp Edited by Dr Mike Hughes The Sample Variance Can you find the medians and means for the following 3 data sets? Median Mean, Set A Set B Set C 1 1 1 2 1 5 3 1 5 4 4 5 5 5 5 6 6 5 7 9 5 8 9 5 9 9 9 5 5 5 x 5 5 5 Although the medians and means are the same, the data sets are not really alike. The spread or variability of the numbers is quite different. How can we measure the spread within the data sets? ANS: The range and inter-quartile range both measure spread but neither uses all the data items. Interquartile range we will do later with Cumulative Frequency The Sample Variance Median Mean, Set A Set B Set C 1 1 1 2 1 5 3 1 5 4 4 5 5 5 5 6 6 5 7 9 5 8 9 5 9 9 9 5 5 5 x 5 5 5 If you had to invent a method of measuring spread that used all the data items, what could you do? One thing we could do is find out how far each item is from the mean and add up these differences. e.g. Set A: x x x 1 2 3 4 4 3 2 1 ( x x) 5 0 6 1 7 2 8 3 9 4 x5 4 3 . . . + 3 + 4 = 0 Data sets B and C give the same result. The negative and positive values have cancelled each other out. The Sample Variance To avoid the effect of the negative values we can either • • ignore the negative signs, or square each difference ( since the squares will all be positive ). Squaring is more convenient for developing theory, so, e.g. Set A: x xx ( x x)2 1 2 3 4 4 3 2 1 16 9 4 1 5 0 0 6 1 1 7 2 4 2 ( x x ) 60 Let’s do this calculation for all 3 data sets: 8 3 9 9 4 16 The Sample Variance Mean, x Set A: x Set B: x Set C: x Set A: 1 1 1 2 ( x x ) 60 2 1 5 3 1 5 4 4 5 Set B: 5 5 5 6 6 5 7 9 5 2 ( x x ) 98 8 9 5 9 9 9 5 5 5 Set C: 2 ( x x ) 32 The larger value for set B shows greater variability. Set C has least variability. Can you see a snag with this measurement? ANS: The calculated value increases if we have more data, so comparing data sets with different numbers of items would not be possible. To allow for this, we need to take n, the number of items, into account. The Sample Variance There are 2 formulae that can be used, msd or s2 2 ( x x ) n 2 ( x x ) n1 the mean square deviation. the sample variance. Our data is nearly always a sample from a large unknown set of data ( the population ) and we take samples to find out about the population. The 1st formula does not give the best estimate of the variance of the population so is not used. The Sample Variance So, there are 2 quantities and their square roots that we need to be clear about 2 ( x x ) msd n and rmsd Also s2 and s 2 ( x x ) n 2 ( x x ) n1 2 ( x x ) n1 the mean square deviation, POPULATION VARIANCE the root mean square deviation. POPULATION STANDARD DEVIATION the sample variance, WE nearly ALWAYS use THESE TWO formula the sample standard deviation. The Sample Variance e.g. Find the rmsd and msd of the following data: x Mean, (i) 7 x x n 9 14 30 x 10 3 2 ( x x ) (7 10) 2 (9 10) 2 (14 10) 2 2 n 3 9 1 16 8 67 ( 3 s. f . ) 3 2 x 2 (ii) x 2 49 81 196 102 326 100 n 3 3 8 67 ( 3 s. f . ) The 2nd form is exactly the same as the first form but quicker to use !! The Sample Variance e.g. Find the sample SD and Variance of the following data: x Mean, (i) (ii) 7 x x n 9 14 30 x 10 3 2 2 2 2 ( x x ) ( 7 10 ) ( 9 10 ) ( 14 10 ) s2 n1 2 9 1 16 13.0 ( 3 s. f . ) 2 2 2 x nx 49 81 196 3 102 326 2 s 150 n1 2 2 13.0 ( 3 s. f . ) The 2nd form is in general quicker to use. The Sample Variance This all seems very complicated but help is at hand. Both the quantities, rmsd and s are given by your calculator. e.g. Find the root mean square deviation, rmsd, and the sample standard deviation, s, for the following data: x 7 9 14 Use the Statistics function on your calculator and enter the data. Select the list of calculations. You will be able to find the following: 2 94392 . . . and 3 60555 . . . The rmsd is smaller than s ( because we are dividing by a larger number ). Correct to 3 s.f. we have rmsd 2 ( x x ) n 2 94 , s 2 ( x x ) n1 3 61 The Sample Variance So, for the data x 7 9 14 we have rmsd 2 ( x x ) n 2 94 , 2 ( x x ) s n1 3 61 Squaring these gives msd ( x x) n 2 8 67, ( mean square deviation ) The part of the formula, s2 2 ( x x ) n1 13 ( sample variance ) 2 ( x x ) , is in your formulae sheet, labelled Sxx. (said as Sum of squares X X) An expanded form of the expression is also given. All you have to do is divide by the correct quantity. SUMMARY The Sample Variance The mean square deviation, msd, and sample variance, both measure the spread or variability in the data. If we have raw data we use the statistical functions on the calculator to find the rmsd or sample standard deviation. The sample standard deviation is the larger than the rmsd because we divide by (n-1) To find the msd or sample variance, we square the relevant quantity given by the calculator: sample variance s2 msd = (rmsd)2 Your formulae sheet will gives the formula or 2 equivalent: S xx ( x x ) x 2 2 x n Then, we divide by n for the msd or (n – 1) for s2. The Sample Variance Frequency Data The formula for the variance can be easily adapted to find the variance of frequency data. x 2 S xx ( x x ) x 2 2 Becomes for FREQUENCY DATA n xf 2 S xx ( xf x ) x f 2 2 f We usually only use the formulae if we are given summary data. With raw data we enter the data into the calculator and use the statistical functions to get the answers directly. The Sample Variance Frequency Data But note that ............... S xx ( x x ) x 2 becomes S xx ( xf x ) x 2 x 2 2 x nx 2 n xf f 2 2 f 2 x 2 f nx 2 The Sample Variance SO becomes MSD= SXX/n and VARIANCE = SXX/(n-1) 2 2 2 x n x x 2 msd x n n 2 2 2 2 x n x x n x Variance n1 n1 Frequency Data x msd 2 f nx n 2 x 2 2 x f n x Variance n1 n 2 f x 2 2 2 x f n x n1 The Sample Variance e.g.1 Find the mean and sample standard deviation of the following data: x Frequency, f 1 3 2 5 5 8 10 4 Solution: Using the calculator functions, the mean, m = 4 65 sample standard deviation, s 3 17 ( 3 s. f . ) Although we don’t need the formula for this question, let’s check we have the correct value by using the formula: The Sample Variance e.g.1 Find the mean and sample standard deviation of the following data: x Frequency, f 1 3 2 5 Solution: 5 8 10 4 xf 2 S xx x f 2 f 2 ( 1 3 . . . 10 4 ) So, S xx 1 2 3 . . . 10 2 4 20 190 55 190 55 S xx 2 10 029 s 3 17 ( 3 s. f . ) s 19 n1 The Sample Variance e.g.2 Find the sample standard deviation of the following lengths: Length (cm) 1-9 Frequency, f 2 10-14 15-19 20-29 7 12 9 The Sample Variance e.g.2 Find the sample standard deviation of the following lengths: Solution: We need the class mid-values Length (cm) 1-9 10-14 15-19 20-29 x 5 12 17 24·5 Frequency, f 2 7 12 9 xf 10 84 204 220.5 x2 25 144 289 600.25 x2f 50 1008 3468 5402.25 2 2 x f n x Sample Variance n1 n f 30 x xf f 17.283 9928.25 30(17.2832 ) 33.351 29 Standard deviation, s = 5 77 ( 3 s. f . ) The Sample Variance e.g.3 Find the mean and sample variance of 20 values of x given the following: x 82 and 2 x 370 Solution: Since we only have summary data, we must use the formulae sample mean, x x n S xx x 2 nx 2 82 x 41 20 S xx 370 20(4.12 ) 33 8 S xx sample variance, s 1 78 ( 3 s. f . ) n1 2 The Sample Variance SUMMARY Raw data msd x nx n 2 2 2 2 x n x Sample Variance n1 Frequency data 2 2 x f n x msd n 2 2 x f n x Sample var iance n1 MSD is called POPULATION VARIANCE Take square root for rmsd and sample standard deviation RMSD is called POPULATION STANDARD DEVIATION The Sample Variance Exercise Find the mean, sample standard deviation and sample variance for each of the following samples, using calculator functions where appropriate. 1. 2. x f 1 7 Time ( mins ) f 2 9 3 14 1-5 7 3. 10 observations where 4 12 6-10 9 5 8 11-15 16-20 21-25 14 12 8 x 432 and x 18912 2 The Sample Variance 1. x f 1 7 Answer: 2 9 mean, 3 14 4 12 5 8 x 31 standard deviation, s = 1 28 ( 3 s. f . ) 2 variance, s 64calculator value N.B. To find s we need to use the 1full for s, not the answer to 3 s.f. 2. Time ( mins ) 1-5 6-10 11-15 16-20 21-25 2 x 3 8 13 18 23 f 7 9 14 12 8 Answer: mean, x 13 5 standard deviation, s = 6 41 ( 3 s.f. ) 2 variance, s 41 1 ( 3 s. f . ) The Sample Variance 3. 10 observations where x 432 and Solution: x x mean, x x 43 2 n 2 S xx x 2 variance, n S xx s n1 2 x 18912 2 Standard deviation, s S xx (432) 2 18912 10 249 6 s 2 27 7 (3 s.f. ) 27 7 5 27 ( 3 s.f. ) The Sample Variance There are 2 formulae that can be used to measure spread: msd or s2 2 ( x x ) n 2 ( x x ) n1 the mean square deviation. the sample variance, In many books you will find the word variance used for the 1st of these formulae and you may have used it at GCSE. However, our data is nearly always a sample from a large unknown set of data ( the population ) and we take the sample to find out about the population. The 1st formula does not give the best estimate of the variance of the population so is not used. The Sample Variance So, there are 2 quantities and their square roots that we need to be clear about msd and rmsd Also s2 and s 2 ( x x ) n 2 ( x x ) n 2 ( x x ) n1 2 ( x x ) n1 the mean square deviation the root mean square deviation. the sample variance, the sample standard deviation. The Sample Variance e.g. Find the root mean square deviation, rmsd, and the sample standard deviation, s, for the following data: x 7 9 14 Use the Statistics function on your calculator and enter the data. Select the list of calculations. You will be able to find the following: 2 94392 . . . 3 60555 . . . Ignore the calculator notation. The rmsd is smaller than s ( because we are dividing by a larger number ). Correct to 3 s.f. we have rmsd 2 ( x x ) n 2 94 , s 2 ( x x ) n1 3 61 The Sample Variance Squaring these gives msd 2 ( x x ) n 8 67, ( mean square deviation ) s2 2 ( x x ) n1 13 ( variance ) Using the formulae: If summary data are given, you will need to use the formulae instead of the calculator functions. The part of the formula, 2 ( x x ) , is in your formulae booklet ( see correlation and regression ), labelled Sxx. An expanded form of the expression is also given. All you have to do is divide by the correct quantity, n or n 1. SUMMARY The Sample Variance The mean square deviation, msd, and sample variance, both measure the spread or variability in the data. If we have raw data we use the stats functions on the calculator to find the rmsd or sample standard deviation. The sample standard deviation is the larger of these quantities. To find the msd or sample variance, we square the relevant quantity given by the calculator: sample variance s2 msd = (rmsd)2 For summary data, we use the formulae book, choosing 2 the appropriate form: S xx ( x x ) x 2 2 x n Then, we divide by n for the msd or (n – 1) for s2. The Sample Variance e.g.1 For the following sample data, find (a) the root mean square deviation, rmsd, (b) the mean square deviation, msd, (c) the sample standard deviation, s, and (d) the sample variance s2. x 12 15 14 9 Answer: Using the calculator functions, (a) (c) rmsd 2 29 ( 3 s. f . ) (b) msd 5 25 s 2 65 ( 3 s. f . ) (d) s2 7 The Sample Variance e.g.2 Given the following summary of data for a sample of size 5, find (a) the mean square deviation, msd, (b) the root mean square deviation, rmsd , (c) the sample variance s2 (d) the sample standard deviation, s , and, n 5, 2 ( x x ) 24 Solution: Using the formulae book, S xx (a) (b) (c) (d) S xx 24 msd = 48 5 n rmsd = 4 8 2 19 ( 3 s . f . ) S xx 24 2 2 s s 6 n1 4 s 6 2 45 ( 3 s . f . ) 2 ( x x ) Frequency Data The Sample Variance The formula for the variance can be easily adapted to find the variance of frequency data. x 2 S xx ( x x ) x 2 2 n becomes xf 2 S xx ( xf x ) x f 2 2 f As before, we only use the formulae if we are given summary data. The Sample Variance e.g.1 Find the mean and sample standard deviation of the following data: x Frequency, f 1 3 2 5 5 8 Solution: 10 4 xf 2 S xx ( xf x ) x f 2 2 f 1 3 . . . 10 4 So, S xx 1 3 . . . 10 4 20 190 55 S xx 190 55 2 s 10 039 s 3 17 ( 3 s. f . ) n1 19 2 2 The Sample Variance e.g.2 Find the sample standard deviation of the following lengths: Length (cm) 1-9 Frequency, f 2 10-14 15-19 20-29 7 12 9 Solution: We need the class mid-values x 5 12 17 24·5 Frequency, f 2 7 12 9 We can now enter the values of x and f on our calculators. Standard deviation, s = 5 77 ( 3 s. f . ) The Sample Variance SUMMARY To find the root mean square deviation, rmsd, or the sample standard deviation, s, using the calculator functions, • • the values of x ( and f ) are entered and checked, the table of calculations gives both values, • the larger value is the sample standard deviation, s, and this is the value that is most often used by statisticians, • the variance is the square of the standard deviation. The Sample Variance Outliers We’ve already seen that an outlier is a data item that lies well away from the other data. It may be a genuine observation or an error in the data. e.g. 1 Consider the following data: 10 12 14 17 19 21 81 With this data set, we would immediately suspect an error. The value 81 was likely to have been 18. If so, there would be a large effect on the mean and standard deviation although the median would not be affected and there would be little effect on the IQR. The presence of possible outliers is an argument in favour of using median and IQR as measures of data. The Sample Variance In an earlier section, we met a method of identifying outliers using a measure of 1·5 IQR above or below the median. A 2nd method used to identify outliers is to find points that are further than 2 standard deviations from the mean. e.g. 2. Consider the following sample: 10 12 14 17 18 19 21 22 24 33 The sample mean and sample standard deviation are : mean, x 19 standard deviation, s = 6 62 ( 3 s. f . ) So, 2 s 13 2 and x 13 2 32 2 The point 33 is more than 2 standard deviations above the mean so, using this measure, it is an outlier.