Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 4 (MWF) Boxplots and standard deviations Suhasini Subba Rao Lecture 4 (MWF) Review of previous lecture • In the previous lecture we defined the population mean and median and the sample mean and median. • Both the mean and median are measures of centrality in a population or sample. • It is also useful to have measures of spread. Measures of spread give us information about how spread out the population is and it will also give information about how accurate the sample mean will be. The more spread out the population the less accurate the sample mean is likely to be (see variance explanation lecture4.pdf, pages 1-2). • The most simple measure of spread is to use the range of the data (smallest and largest number in the data set). But as illustrated by the 1 Lecture 4 (MWF) extreme examples (given in lecture 3) Sample 1: −1, 0, 0, 0, 0, 0 Sample 2: −1, 0, 0, 0, 0, 2 Sample 3: −1, 0, 0, 0, 0, 20 using the range does not necessarily represent the ‘typical’ spread. • An alternative measure of spread is the Interquartile Range. 2 Lecture 4 (MWF) Boxplots, medians and means largest value 75th percentile Order data Median 25th percentile lowest value 3 Lecture 4 (MWF) Example The number of people volunteering to give blood at a center was recordered for each of 20 successive Fridays. The data is shown below 320 274 370 308 386 315 334 368 325 332 315 260 334 295 301 356 270 333 310 250 Make a boxplot of the observations. 4 Lecture 4 (MWF) Solution in JMP 5 Lecture 4 (MWF) Comparing multiple samples using boxplots • Boxplots are excellent for comparing two samples. • In JMP you can compare multiple samples through boxplots. • Instructions are given in my JMP notes on how to to do this. • Please do this: Make boxplots for the sample (since we only have the data from class) distribution of the total number of M&Ms for each type of M&M. 6 Lecture 4 (MWF) Measures of spread - The Variance • The variance is a commonly used measure of spread. Given the population x1, . . . , xN we define the population variance as N X 1 σ2 = (xi − µ)2, N µ = population mean. k=1 • In words: this is the sum of all the squared differences between all possible outcomes and the population mean. • In reality it will never be observed (since the population is unknown) and it has to be estimated instead. 7 Lecture 4 (MWF) • Given the sample X1, . . . , Xn we define the sample variance as n X 1 (Xi − X̄)2, s2 = (n − 1) i=1 given the sample we not know the mean µ, so we estimate it with using the sample mean X̄ (swop µ with X̄). The reason we divide by (n − 1) and not n when obtaining the sample variance is a quirk of statistics. We do it to reduce the bias. I don’t much care whether you divide by n or (n − 1). What you need to remember is that the difference between the sample variance and the population is that the population variance is calculated using the entire population, whereas the sample variance is calculated using the sample. 8 Lecture 4 (MWF) Example - the variance (by hand) We are given the sample 5, 4, 3, 0, 3. The sample mean x̄ is 3. Calculating the sample variance: xi 5 4 3 0 3 frequency xi − x̄ 5−3=2 4−3=1 3−3=0 0 − 3 = −3 3−3=0 (xi − x̄)2 (5 − 3)2 = 4 (4 − 3)2 = 1 (3 − 3)2 = 0 (0 − 3)2 = 9 (3 − 3)2 = 0 sum = 14 The sample variance is n X 1 14 14 s2 = = . (xi − x̄)2 = (n − 1) i=1 5−1 4 9 Lecture 4 (MWF) The variance does not change with shifts • The variance is invariant to shifts in the data. Suppose we shift the data by 20 (this could be because of a change in units of observations), so we observe 25, 24, 23, 20, 23. The mean is shifted by 20, but spread stays the same. To see why look at the calculation: xi 25 24 23 20 23 frequency xi − x̄ 25 − 23 = 2 24 − 23 = 1 23 − 23 = 0 20 − 23 = −3 23 − 23 = 0 (xi − x̄)2 (5 − 3)2 = 4 (4 − 3)2 = 1 (3 − 3)2 = 0 (0 − 3)2 = 9 (3 − 3)2 = 0 sum = 14 10 Lecture 4 (MWF) • The above example show that the variance is simply measuring squared the distance from the mean (which stays the same for shift transformations). • See the explanation of the variance given in https://www.stat.tamu. edu/~suhasini/teaching651/variance_explanation_lecture4.pdf pages 3-5. 11 Lecture 4 (MWF) The population, sample and variance 3,4,6,9,10,15,44,16 Set of all 3,16 Sample measurements 1 (population) 6,9,3 Sample 2 • Population variance In this case it is σ 2 = 81 [(3 − 13.375)2 + (4 − 13.375)2 + . . . + (16 − 13.375)2] = 153.4844. 12 Lecture 4 (MWF) Observations • Like the sample mean the sample variance is random and varies from sample to sample. • Sample variance For Sample 2: it is s22 = For Sample 1: s21 = 1 3−1 [(6 1 2−1 [(3 − 6)2 + (9 − 6)2 + (3 − 6)2] = 9. − 9.5)2 + (16 − 9.5)2] = 84.5. • Similar to the sample mean, the sample variance is also random (changes according to the sample). • We see that the sample variance tends to be smaller than the population variance. 13 Lecture 4 (MWF) • When we start learning statistical methods in this class, we will work under the assumption that the variance of the population known. In many situations this is an unrealistic assumption, and we need to estimate the variance from the data. This changes our inference and is the reason that we will be using the t-distribution (this comes much later - circa Lecture 14). 14 Lecture 4 (MWF) The variance does not measure distance! • Example Consider the observations −4, −3, −2, −1, 1, 2, 3, 4. The mean 1 is zero and the sample variance is 8−1 (42 +32 +22 +12 +12 +22 +32 +42) = 8.57 • The range of the data is −4, 4, which has length 8. We see that the sample variance = 8.57, it is even bigger than the range itself! • This is because the variance squares the distances. To overcome this we square root the variance, to give the standard deviation. 15 Lecture 4 (MWF) The standard deviation • The standard deviation is the square root of the variance. That is v u N u1 X (xi − µ)2 σ=t N N = size of population k=1 • The sample standard deviation is v u u s=t n X 1 (Xi − X̄)2 (n − 1) i=1 n = size of sample • The standard deviation has the advantage that it has the same set of units as the data. 16 Lecture 4 (MWF) • The variance ‘squares’ the amount of spread. The standard deviation gives a more accurate measure of the amount of spread. • The variance and standard deviation are closely related. The variance is just the square of the standard deviation. Often students treat them as different measures of spread. They are not. The variance is not a measure of spread, the square root of it (the standard deviation) is. 17 Lecture 4 (MWF) Comparing standard deviations and IQR • For the majority of the analysis we do in this class, we be using the standard deviation, but it is useful to understand how outliers may influence the standard deviation: • Observe that the majority of the points are about zero but one value is 5. This pulls the mean and the standard deviation to the right. However, there is not effect on the median or IQR. 18 Lecture 4 (MWF) Standard deviation: Graphical illustration Mean is 29 and standard deviation is 6.6. The green line [29 − 6.6, 29 + 6.6] = [22.4, 35.6] is one standard deviation from the mean. Observe that 7/11 = 63.3% of the points are within one standard deviation of the mean. 19 Lecture 4 (MWF) Standard deviation and IQR • Here we illustrate the differences between the two measures of spread; IQR and standard deviations. – We observe if all the data takes the same value, the IQR and standard deviation are both zero. The standard deviation is only zero if all the data is the same. – However, we see on the second plot that the IQR can still be zero if most of the values (but not all) are the same. 20 Lecture 4 (MWF) The empirical rule (this just a rule of thumb) For observations from several different disciplines (nature, engineering etc) we often observe that • Approximately 68% of the observations lie inside the interval [x̄−s, x̄+s]. • Approximately 95% of the observations lie inside the interval [x̄ − 2s, x̄ + 2s]. • Approximately 99.7% of the observations lie inside the interval [x̄ − 3s, x̄ + 3s]. This is only a rule of thumb and it only applies to data, which is ‘normally’ distributed. Don’t take it seriously. What it does say is that the majority of the data will be within three standard deviations of the mean. 21 Lecture 4 (MWF) Interpretation of the empirical rule This is only a rule of thumb. It should not be taken too seriously. Some explanations: • What we mean by the interval [a, b]. These are two points of the ‘time line’. Starting with a and ending at b. Example, what does the interval [3, 4.5] mean? • What we mean by ‘ Approximately 68% of the observations lie inside the interval [x̄ − s, x̄ + s]’. Is that we calculate s and X̄ from the data. For example, it could be X̄ = 3 and s = 1. For this example, the interval [x̄ − s, x̄ + s] is [3 − 1, 3 + 1], and we count the number of observations in this interval. The empirical rule states that for many data sets 68% of the observations lie in this interval. Similarly [x̄−2s, x̄+2s] = [3−2, 3+2] and [x̄−3s, x̄+3s] = [3−3, 3+3]. 22 Lecture 4 (MWF) Where you may have previously encountered the standard deviation • When you go to the doctors office, the technicians take several measurements, such as weight, height, blood pressure and blood samples. • How do they determine whether it is ‘normal’. Often you will get your reading with the percentile (we come to this later) and the number of standard deviations your reading is from the average (mean). Consider the following example: – Suppose that you have a blood sample taken. The mean for the reading is 20 and you have 14, is that normal? – The difference of 14 from 20 does not tell you anything unless you know how spread out the readings can be. 23 Lecture 4 (MWF) – Yhe standard deviation is 8, so there is a fair amount of variation or spread. Then you are within (14 − 20)/8 = −6/8 standard deviations of the mean. This means you are not too far from the average (what we mean by far and not too far is really determined by the percentile and the distribution of the measurements, but we come to this later). – In addition if blood samples followed the empirical rule, since your reading is within one standard deviation of the mean then your reading is ‘within’ 68% of the mean • Another example: – A friend has their blood sample taken, he has the reading 45. Is that normal? – His reading is 25 from the mean and (45 − 20)/8 = 3.125 standard deviations from the mean. – This seems to be in the extreme (recall by the empirical rule, for many data sets 99.7% of observations are within 3 standard deviations of 24 Lecture 4 (MWF) the mean). – This suggests that (i) either he is one of the very few people who are fine with such a large reading or (ii) he is not fine. The data suggests that (ii) could be true. 25