Measures of Variability Sample I: Sample II: Sample III: 30, 35, 40, 45, 50, 55, 60, 65, 70 30, 41, 48, 49, 50, 51, 52, 59, 70 41, 45, 48, 49, 50, 51, 52, 55, 59 Liang Zhang (UofU) Applied Statistics I June 10, 2008 1 / 18 Measures of Variability Sample Range: the difference between the largest and the smallest sample values. e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70 the sample range is 40(= 70 − 30). Deviation from the Sample Mean: the diffenence between the individual sample value and the sample mean. e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70 the sample mean is 50 and thus the deviation from the sample mean for each data is -20, -15, -10, -5, 0, 5, 10, 15, 20. Liang Zhang (UofU) Applied Statistics I June 10, 2008 2 / 18 Measures of Variability Sample Variance: the mean (or average) of the sum of squares of the deviations from the sample mean for each individual data. If our sample size is n, and we use x̄ to denote the sample mean, then the sample variance s 2 is given by: Pn (xi − x̄)2 Sxx 2 s = i=1 = n−1 n−1 Sample Standard Deviation: the square root of the sample variance s= Liang Zhang (UofU) √ s2 Applied Statistics I June 10, 2008 3 / 18 Measures of Variability e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70, the mean is 50 and we have xi 30 35 40 45 50 55 60 65 70 xi − x̄ -20 -15 -10 -5 0 5 10 15 20 (xi − x̄)2 400 225 100 25 0 25 100 225 400 Therefore the sample variance is (400 + 225 + 100 + 25 + 0 + 25 √ + 100 + 225 + 400)/(9 − 1) = 187.5 and the standard deviation is 187.5 = 13.7. Liang Zhang (UofU) Applied Statistics I June 10, 2008 4 / 18 Measures of Variability e.g. for Sample II: 30, 41, 48, 49, 50, 51, 52, 59, 70, the mean is also 50 and we have xi 30 41 48 49 50 51 52 59 70 xi − x̄ -20 -9 -2 -1 0 1 2 9 20 (xi − x̄)2 400 81 4 1 0 1 4 81 400 Therefore the sample variance is (400 + 81 + 4 + 1 + 0 + 1 + 4√+ 81 + 400)/(9 − 1) = 121.5 and the standard deviation is 121.5 = 11.0. Liang Zhang (UofU) Applied Statistics I June 10, 2008 5 / 18 Measures of Variability e.g. for Sample III: 41, 45, 48, 49, 50, 51, 52, 55, 59, the mean is also 50 and we have xi 41 45 48 49 50 51 52 55 59 xi − x̄ -9 -5 -2 -1 0 1 2 5 9 2 (xi − x̄) 81 25 4 1 0 1 4 25 81 Therefore the sample variance is (81 + 25 + 4 + 1 + 0 + 1 + 4 + √ 25 + 81)/(9 − 1) = 27.75 and the standard deviation is 27.75 = 4.9. Liang Zhang (UofU) Applied Statistics I June 10, 2008 6 / 18 Measures of Variability sample variance for Sample I is 187.5, for Sample II is 121.5 and for Sample III is 27.75. Liang Zhang (UofU) Applied Statistics I June 10, 2008 7 / 18 Measures of Variability Remark: 1. Why use the sum of squares of the deviations? Why not sum the deviations? Because the sum of the deviations from the sample mean EQUAL TO 0! n n n X X X (xi − x̄) = xi − x̄ i=1 i=1 = = n X i=1 n X i=1 xi − nx̄ n xi − n( i=1 1X xi ) n i=1 =0 Liang Zhang (UofU) Applied Statistics I June 10, 2008 8 / 18 Measures of Variability Remark: 2. Why do we use divisor n − 1 in the calculation of sample variance while we use use divisor N in the calculation of the population variance? The variance is a measure about the deviation from the “center”. However, the “center” for sample and population are different, namely sample mean and population mean. P If we use µ instead of x̄ in the definition of s 2 , then s 2 = (xi − µ)/n. But generally, population mean is unavailable to us. So our choice is the sample mean. In that case, the observations xi0 s tend to be closer to their average x̄ then to the population average µ. So to compensate, we use divisor n − 1. Liang Zhang (UofU) Applied Statistics I June 10, 2008 9 / 18 Measures of Variability Remark: 3. It’ customary to refer to s 2 as being based on n − 1 degrees of freedom (df). s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . , (xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0. Therefore if we know any n − 1 of them, we know all of them. e.g. {x1 = 4, x2 = 7, x3 = 1, and x4 = 10}. Then the mean is x̄ = 5.5 and x1 − x̄ = −1.5, x2 − x̄ = 1.5 and x3 − x̄ = −4.5. From that, we know directly that x4 − x̄ = 4.5 since their sum is 0. Liang Zhang (UofU) Applied Statistics I June 10, 2008 10 / 18 Measures of Variability Some mathematical results for s 2 : P P Sxx s 2 = n−1 where Sxx = (xi − x̄)2 = xi2 − If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c, P ( xi )2 ; n then sy2 = sx2 ; If y1 = cx1 , y2 = cx2 , . . . , yn = cxn , then sy =| c | sx . Here sx2 is the sample variance of the x’s and sy2 is the sample variance of the y ’s. c is any nonzero constant. Liang Zhang (UofU) Applied Statistics I June 10, 2008 11 / 18 Measures of Variability e.g. in the previous example, Sample III is {41, 45, 48, 49, 50, 51, 52, 55, 59} then we can calculate the sample variance as following xi 41 45 48 49 50 51 52 55 59 2 x 1681 2025 2304 2401 2500 2601 2704 3025 3481 Pi P x2i 450 xi 22722 Therefore the sample variance is (22722 − Liang Zhang (UofU) 4502 )/(9 − 1) = 27.75 9 Applied Statistics I June 10, 2008 12 / 18 Measures of Variability Boxplots e.g. A recent article (“Indoor Radon and Childhood Cancer”) presented the accompanying data on radon concentration (Bq/m2 ) in two different samples of houses. The first sample consisted of houses in which a child diagnosed with cancer had been residing. Houses in the second sample had no recorded cases of childhood cancer. The following graph presents a stem-and-leaf display of the data. 2. No cancer 1. Cancer 9683795 86071815066815233150 12302731 8349 5 7 Liang Zhang (UofU) 0 1 2 3 4 5 6 7 8 95768397678993 12271713114 99494191 839 55 5 Stem: Tens digit Leaf: Ones digit Applied Statistics I June 10, 2008 13 / 18 Measures of Variability The boxplot for the 1st data set is: Liang Zhang (UofU) Applied Statistics I June 10, 2008 14 / 18 Measures of Variability The boxplot for the 2nd data set is: Liang Zhang (UofU) Applied Statistics I June 10, 2008 15 / 18 Measures of Variability We can also make the boxplot for both data sets: Liang Zhang (UofU) Applied Statistics I June 10, 2008 16 / 18 Measures of Variability Some terminology: Lower Fourth: the median of the smallest half Upper Fourth: the median of the largest half Fourth spread: the difference between lower fourth and upper fourth fs = upper fourth − lower fourth Outlier: any observation farther than 1.5fs from the closest fourth An outlier is extreme if it is more than 3fs from the nearest fourth, and it is mild otherwise. Liang Zhang (UofU) Applied Statistics I June 10, 2008 17 / 18 Measures of Variability The boxplot for the 2nd data set is: Liang Zhang (UofU) Applied Statistics I June 10, 2008 18 / 18