Measures of Variability Sample I: Sample II: Sample III: 30, 35, 40, 45, 50, 55, 60, 65, 70 30, 41, 48, 49, 50, 51, 52, 59, 70 41, 45, 48, 49, 50, 51, 52, 55, 59 Measures of Variability Sample I: Sample II: Sample III: 30, 35, 40, 45, 50, 55, 60, 65, 70 30, 41, 48, 49, 50, 51, 52, 59, 70 41, 45, 48, 49, 50, 51, 52, 55, 59 Measures of Variability I Sample Range: the difference between the largest and the smallest sample values. Measures of Variability I Sample Range: the difference between the largest and the smallest sample values. e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70 the sample range is 40(= 70 − 30). Measures of Variability I Sample Range: the difference between the largest and the smallest sample values. e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70 the sample range is 40(= 70 − 30). I Deviation from the Sample Mean: the diffenence between the individual sample value and the sample mean. Measures of Variability I Sample Range: the difference between the largest and the smallest sample values. e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70 the sample range is 40(= 70 − 30). I Deviation from the Sample Mean: the diffenence between the individual sample value and the sample mean. e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70 the sample mean is 50 and thus the deviation from the sample mean for each data is -20, -15, -10, -5, 0, 5, 10, 15, 20. Measures of Variability I Sample Variance: the mean (or average) of the sum of squares of the deviations from the sample mean for each individual data. Measures of Variability I Sample Variance: the mean (or average) of the sum of squares of the deviations from the sample mean for each individual data. If our sample size is n, and we use x̄ to denote the sample mean, then the sample variance s 2 is given by: Pn (xi − x̄)2 Sxx 2 = s = i=1 n−1 n−1 Measures of Variability I Sample Variance: the mean (or average) of the sum of squares of the deviations from the sample mean for each individual data. If our sample size is n, and we use x̄ to denote the sample mean, then the sample variance s 2 is given by: Pn (xi − x̄)2 Sxx 2 = s = i=1 n−1 n−1 I Sample Standard Deviation: the square root of the sample variance s= √ s2 Measures of Variability e.g. for Sample I: 30, 35, 40, 45, 50, 55, 60, 65, 70, the mean is 50 and we have xi 30 35 40 45 50 55 60 65 70 xi − x̄ -20 -15 -10 -5 0 5 10 15 20 (xi − x̄)2 400 225 100 25 0 25 100 225 400 Therefore the sample variance is (400 + 225 + 100 + 25 + 0 + 25√+ 100 + 225 + 400)/(9 − 1) = 187.5 and the standard deviation is 187.5 = 13.7. Measures of Variability e.g. for Sample II: 30, 41, 48, 49, 50, 51, 52, 59, 70, the mean is also 50 and we have xi 30 41 48 49 50 51 52 59 70 xi − x̄ -20 -9 -2 -1 0 1 2 9 20 (xi − x̄)2 400 81 4 1 0 1 4 81 400 Therefore the sample variance is (400 + 81 + 4 + 1 + 0 + 1 + 4√+ 81 + 400)/(9 − 1) = 121.5 and the standard deviation is 121.5 = 11.0. Measures of Variability e.g. for Sample III: 41, 45, 48, 49, 50, 51, 52, 55, 59, the mean is also 50 and we have xi 41 45 48 49 50 51 52 55 59 xi − x̄ -9 -5 -2 -1 0 1 2 5 9 (xi − x̄)2 81 25 4 1 0 1 4 25 81 Therefore the sample variance is (81 + 25 + 4 + 1 + 0 + 1 + 4 + √ 25 + 81)/(9 − 1) = 27.75 and the standard deviation is 27.75 = 4.9. Measures of Variability sample variance for Sample I is 187.5, for Sample II is 121.5 and for Sample III is 27.75. Measures of Variability Remark: 1. Why use the sum of squares of the deviations? Why not sum the deviations? Measures of Variability Remark: 1. Why use the sum of squares of the deviations? Why not sum the deviations? Because the sum of the deviations from the sample mean EQUAL TO 0! Measures of Variability Remark: 1. Why use the sum of squares of the deviations? Why not sum the deviations? Because the sum of the deviations from the sample mean EQUAL TO 0! n n n X X X (xi − x̄) = xi − x̄ i=1 = i=1 n X i=1 xi − nx̄ i=1 = n X i=1 =0 n xi − n( 1X xi ) n i=1 Measures of Variability Remark: 2. Why do we use divisor n − 1 in the calculation of sample variance while we use use divisor N in the calculation of the population variance? Measures of Variability Remark: 2. Why do we use divisor n − 1 in the calculation of sample variance while we use use divisor N in the calculation of the population variance? The variance is a measure about the deviation from the “center”. However, the “center” for sample and population are different, namely sample mean and population mean. Measures of Variability Remark: 2. Why do we use divisor n − 1 in the calculation of sample variance while we use use divisor N in the calculation of the population variance? The variance is a measure about the deviation from the “center”. However, the “center” for sample and population are different, namely sample mean and population mean. If we P use µ instead of x̄ in the definition of s 2 , then 2 s = (xi − µ)/n. Measures of Variability Remark: 2. Why do we use divisor n − 1 in the calculation of sample variance while we use use divisor N in the calculation of the population variance? The variance is a measure about the deviation from the “center”. However, the “center” for sample and population are different, namely sample mean and population mean. If we P use µ instead of x̄ in the definition of s 2 , then 2 s = (xi − µ)/n. But generally, population mean is unavailable to us. So our choice is the sample mean. In that case, the observations xi0 s tend to be closer to their average x̄ then to the population average µ. So to compensate, we use divisor n − 1. Measures of Variability Remark: 3. It’ customary to refer to s 2 as being based on n − 1 degrees of freedom (df). Measures of Variability Remark: 3. It’ customary to refer to s 2 as being based on n − 1 degrees of freedom (df). s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . , (xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0. Therefore if we know any n − 1 of them, we know all of them. Measures of Variability Remark: 3. It’ customary to refer to s 2 as being based on n − 1 degrees of freedom (df). s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . , (xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0. Therefore if we know any n − 1 of them, we know all of them. e.g. {x1 = 4, x2 = 7, x3 = 1, and x4 = 10}. Measures of Variability Remark: 3. It’ customary to refer to s 2 as being based on n − 1 degrees of freedom (df). s 2 is the average of n quantities: (x1 − x̄)2 , (x2 − x̄)2 , . . . , (xn − x̄)2 . However, the sum of x1 − x̄, x2 − x̄, . . . , xn − x̄ is 0. Therefore if we know any n − 1 of them, we know all of them. e.g. {x1 = 4, x2 = 7, x3 = 1, and x4 = 10}. Then the mean is x̄ = 5.5 and x1 − x̄ = −1.5, x2 − x̄ = 1.5 and x3 − x̄ = −4.5. From that, we know directly that x4 − x̄ = 4.5 since their sum is 0. Measures of Variability Some mathematical results for s 2 : Measures of Variability Some mathematical results for s 2 : P P I s 2 = Sxx where Sxx = (xi − x̄)2 = xi2 − n−1 P ( xi )2 ; n Measures of Variability Some mathematical results for s 2 : P P I s 2 = Sxx where Sxx = (xi − x̄)2 = xi2 − n−1 I If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c, P ( xi )2 ; n then sy2 = sx2 ; Measures of Variability Some mathematical results for s 2 : P P I s 2 = Sxx where Sxx = (xi − x̄)2 = xi2 − n−1 P ( xi )2 ; n then sy2 = sx2 ; I If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c, I If y1 = cx1 , y2 = cx2 , . . . , yn = cxn , then sy =| c | sx . Here sx2 is the sample variance of the x’s and sy2 is the sample variance of the y ’s. c is any nonzero constant. Measures of Variability e.g. in the previous example, Sample III is {41, 45, 48, 49, 50, 51, 52, 55, 59} then we can calculate the sample variance as following xi 41 45 48 49 50 51 52 55 59 2 xi 1681 2025 2304 2401 2500 2601 2704 3025 3481 P x i P 2 450 xi 22722 Therefore the sample variance is (22722 − 4502 )/(9 − 1) = 27.75 9 Measures of Variability Boxplots Measures of Variability Boxplots e.g. A recent article (“Indoor Radon and Childhood Cancer”) presented the accompanying data on radon concentration (Bq/m2 ) in two different samples of houses. The first sample consisted of houses in which a child diagnosed with cancer had been residing. Houses in the second sample had no recorded cases of childhood cancer. The following graph presents a stem-and-leaf display of the data. 1. Cancer 9683795 86071815066815233150 12302731 8349 5 7 2. No cancer 0 1 2 3 4 5 6 7 8 95768397678993 12271713114 99494191 839 55 5 Stem: Tens digit Leaf: Ones digit Measures of Variability The boxplot for the 1st data set is: Measures of Variability The boxplot for the 2nd data set is: Measures of Variability We can also make the boxplot for both data sets: Measures of Variability Some terminology: I Lower Fourth: the median of the smallest half Measures of Variability Some terminology: I Lower Fourth: the median of the smallest half I Upper Fourth: the median of the largest half Measures of Variability Some terminology: I Lower Fourth: the median of the smallest half I Upper Fourth: the median of the largest half I Fourth spread: the difference between lower fourth and upper fourth fs = upper fourth − lower fourth Measures of Variability Some terminology: I Lower Fourth: the median of the smallest half I Upper Fourth: the median of the largest half I Fourth spread: the difference between lower fourth and upper fourth fs = upper fourth − lower fourth I Outlier: any observation farther than 1.5fs from the closest fourth Measures of Variability Some terminology: I Lower Fourth: the median of the smallest half I Upper Fourth: the median of the largest half I Fourth spread: the difference between lower fourth and upper fourth fs = upper fourth − lower fourth I Outlier: any observation farther than 1.5fs from the closest fourth An outlier is extreme if it is more than 3fs from the nearest fourth, and it is mild otherwise. Measures of Variability The boxplot for the 2nd data set is: