A Look at Means, Variances, Standard Deviations, and z-Scores Dr. Margie Mason College of William and Mary mmmaso@wm.edu Based on the Technical Assistance Document 2009 Algebra I Standard of Learning A.9 http://www.doe.virginia.gov/instruction/ high_school/mathematics/technical_assis tance_algebra1_a9.pdf Algebra Standard of Learning A.9 The student, given a set of data, will interpret variation in real-world contexts and calculate and interpret mean absolute deviation, standard deviation, and z-scores. Mathematics SOL 5.16 5.16 The student will a) describe mean, median, and mode as measures of center; b) describe mean as fair share; c) find the mean, median, mode, and range of a set of data; and d) describe the range of a set of data as a measure of variation. Algebra SOL A.10 The student will compare and contrast multiple univariate data sets, using box-and-whisker plots. Algebra SOL A.10 Let’s use unifix stick heights of 5 13 6 15 8 8 10 17 18 20 Determining a Box-and-Whisker Plot 1. Find: Median – the middle value when arranged from smallest to largest. 11.5 Lower extreme (LE) – the smallest value 5 Upper extreme (UE) – the largest value 20 Lower quartile (LQ) – the value halfway between the lower extreme and the median. 8 Upper quartile (UQ) – the value halfway between the upper extreme and the median. 17 Interquartile range (IQR) – The difference between the upper quartile and the lower quartile. 9 Determining a Box-and-Whisker Plot 2. Determine and draw the scale. Subtract the smallest value from the largest value to determine the range. Choose a reasonable size for the intervals based on the range to be covered, e.g., 1, 2, 5, or 10. Draw the scale much like a number line at the bottom of your plot. Determining a Box-and-Whisker Plot 3. Draw the box: a. Length of the box extends for LQ to UQ. It is drawn above the scale. b. Mark the median. c. Width of box can be anything. Determining a Box-and-Whisker Plot 4. Draw the whiskers: a. Draw from the box you just drew to LE and UE. Determining a Box-and-Whisker Plot Alternate method for drawing the whiskers: 4. Determine the outliers: Multiply the IQR by 1.5 and add this number to the UQ and subtract it from the LQ. Any values outside these limits are outliers. Determining a Box-and-Whisker Plot Alternate Method: 5. Draw whiskers: a. Draw from box to LE and UE excluding outliers. b. Place asterisks on any outliers. Determining a Box-and-Whisker Plot How many years of experience as a teacher do you have? Mathematics SOL 6.15 The student will a) describe mean as balance point; and b) decide which measure of center is appropriate for a given purpose. Mathematics SOL 6.15 The mean is the numerical average of the data set and is found by adding the numbers in the data set together and dividing the sum by the number of data pieces in the set. In grade 5 mathematics, mean is defined as fair-share. Mathematics SOL 6.15 Mean can be defined as the point on a number line where the data distribution is balanced. This means that the sum of the distances from the mean of all the points above the mean is equal to the sum of the distances of all the data points below the mean. This is the concept of mean as the balance point. Defining mean as balance point is a prerequisite for understanding standard deviation. Mathematics SOL 6.15 Balance Point: The sum of the distances on a number line from the mean of all the points above the mean is equal to the sum of the distances of all the data points below the mean. 7+6+4+4+2=1+3+5+6+8 Sample vs. Population Data A statistical population includes all elements in the set. A sample is a subset of the population. A data set, whether a sample or population, is comprised of individual data points referred to as elements of the data set. Start with small defined population data sets of approximately 30 items or less. *Elements* An element of a data set will be represented as xi. Where i represents the ith term of the data set. Sample vs. Population Data The arithmetic mean of a population is represented by the Greek letter m (mu), while the calculated arithmetic mean of a sample is represented by , read “x bar.” Mean Absolute Deviation vs. Variance and Standard Deviation Measuring dispersion or spread of a data set about the mean One measure of spread is to find the sum of the deviations between each element and the mean; however, this sum is always zero. Mean Absolute Deviation vs. Variance and Standard Deviation Two methods: take the absolute value of the deviations before finding the average, (Mean Absolute Deviation) or square the deviations before find the average (Variance and Standard Deviation) Mean Absolute Deviation vs. Variance and Standard Deviation Summation Notation Mean Absolute Deviation The arithmetic mean of the absolute values of the deviations of elements from the mean of a data set. 5 6 8 8 10 13 15 17 18 20 |5-12| + |6-12| + |8-12| + |8-12| +|10-12| + |13-12| + |15-12| +|17-12| + |18-12| + |20-12| 10 =7+6+4+4+2+1+3+5+6+8 10 = 46 10 = 4.6 *Mean Absolute Deviation* Mean absolute deviation = where m represents the mean of the data set, n represents the number of elements in the data set, and xi represents the ith element of the data set. Mean Absolute Deviation Mean absolute deviation is less affected by outlier data than the variance and standard deviation. Outliers are elements that fall at least 1.5 times the interquartile range (IQR) below the first quartile (Q1) or above the third quartile (Q3). Variance The average of the squared deviations from the mean is known as the variance and is another measure of the spread of the elements in the set. (5-12)2 + (6-12)2 + (8-12)2 + (8-12)2 +(10-12)2 + (13-12)2 + (15-12)2 +(17-12)2 + (18-12)2 + (20-12)2 10 = (-7)2 + (-6) 2 + (-4) 2 + (-4) 2 + (-2) 2 + 12 + 32 + 52 + 62 + 82 10 = 49 + 36 + 16 + 16 + 4 + 1 + 9 + 25 + 36 + 64 = 256 = 25.6 10 10 *Variance* Variance (s 2)= where m represents the mean of the data set, n represents number of elements in the data set, and xi represents the ith element of the data set. Variance The differences are squares so that they don’t cancel each other out when finding the sum. When squaring the differences, the units of measure are squared and the larger differences are “weighted” more heavily than smaller differences. In order to provide a measure of variation in terms of the original units of the date, the square root of the variance is taken, yielding the standard deviation. Standard Deviation The positive square root of the variance of the data set. The greater the value, the more spread out the data are about the mean. The lesser (closer to 0) the value, the closer the data are clustered about the mean. *Standard Deviation* Standard deviation (s)= where m represents the mean of the data set, n represents the number of elements in the data set, and xi represents the ith element of the data set. Standard Deviation “ s ”, written and read “sigma”, represents the standard deviation of a population and “s” represents the sample standard deviation. s = the square root of 25.6 = 5.06 Standard Deviation The population standard deviation can be estimated by calculating the sample standard deviation. The formulas for sample and population look similar except that the sample standard deviation formula uses n – 1 instead of n in the denominator. This is to account for the possibility of greater variability of data in the population than what was seen in the sample. Standard Deviation When n-1 is used in the denominator, the result is a larger number. So the calculated value of the sample standard deviation will be larger than the population standard deviation. As sample sizes get larger, the difference gets smaller. The use of n-1 is known as Bessel’s correction. SOL A.9 used the population standard deviation with n in the denominator. Interpreting Standard Deviation • • Standard deviation is a measure of the typical amount an entry deviates from the mean. The more the entries are spread out, the greater the standard deviation. Interpreting Standard Deviation Empirical Rule (68 -95-99.7 Rule) For data with a (symmetric) bell-shaped distribution, the standard deviation has the following characteristics: •About 68% of the data lie within one standard deviation of the mean. •About 95% of the data lie within two standard deviations of the mean. •About 99.7% of the data lie within three standard deviations of the mean. Empirical Rule (68 – 95 – 99.7 Rule) 99.7% within 3 standard deviations 95% within 2 standard deviations 68% within 1 standard deviation 34% 2.35% x 3s 34% 13.5% x 2s 13.5% x s x x s 2.35% x 2s x 3s Example: Using the Empirical Rule In a survey conducted by the National Center for Health Statistics, the sample mean height of women in the United States (ages 20-29) was 64.3 inches, with a sample standard deviation of 2.62 inches. Estimate the percent of the women whose heights are between 59.06 inches and 64.3 inches. Example: Using the Empirical Rule • Because the distribution is bell-shaped, you can use the Empirical Rule. 34% + 13.5% = 47.5% of women are between 59.06 and 64.3 inches tall. Chebychev’s Theorem • The portion of any data set lying within k standard 1 deviations (k > 1) of the mean is at least: 1 k • 2 k = 2: In any data set, at least 1 12 3 or 75% 2 4 of the data lie within 2 standard deviations of the mean. 1 8 • k = 3: In any data set, at least 1 3 9 or 88.9% of the data lie within 3 standard deviations of the mean. 2 Using Chebychev’s Theorem The age distribution for Florida is shown in the histogram. Apply Chebychev’s Theorem to the data using k = 2. What can you conclude? Using Chebychev’s Theorem k = 2: μ – 2σ = 39.2 – 2(24.8) = – 10.4 (use 0 since age can’t be negative) μ + 2σ = 39.2 + 2(24.8) = 88.8 At least 75% of the population of Florida is between 0 and 88.8 years old. z-Scores A z-score, also called a standard score, is a measure of the position derived from the mean and standard deviation of the data set. In Algebra I, the z-score will be used to determine how many standard deviations an element is above of below the mean of the data set. It can also be used to determine the value of the element, given the z-score of an unknown element and the mean and standard deviation of a data set. The Standard Score (z-Score) Represents the number of standard deviations a given value x falls from the mean μ. value - mean x-m z= = standard deviation s z-Scores The z-score will be positive if the element lies above the mean and negative if it lies below the mean. A z-score is calculated by subtracting the mean of the data set from the element and dividing the result by the standard deviation of the data set. *z-Scores* z-score (z) = where x represents an element of the data set, m represents the mean of the data set, and s represents the standard deviation of the data set. z-Scores z-score of 5 = (5 - 12)/5.06 = -1.38 z-score of 6 = (6 - 12)/5.06 = -1.19 z-score of 8 = (8 - 12)/5.06 = -.79 z-score of 10 = (10 - 12)/5.06 = -.40 z-score of 13 = (13 - 12)/5.06 = .20 z-score of 15 = (15 - 12)/5.06 = .59 z-score of 17 = (17 - 12)/5.06 = .99 z-score of 18 = (18 - 12)/5.06 = 1.19 z-score of 20 = (20 - 12)/5.06 = 1.58 z-Scores Suppose you had the misfortune to have an Algebra test and a history test on the same day. Why you got your tests back, here is the information given to you regarding your performance and the performance of the class on these exams. Which test did you do better on? Algebra History Your score 82 93 Mean 71.06 85.43 Stan. Dev 10.32 18.91 z-Scores Suppose the mean and standard deviation on an algebra test were given as 72 and 12, respectively. Susan’s z-score on the test was 2.34. Was Susan’s test score above or below the mean? How do you know? David’s z-score on the same test was -1.25. Was David’s test score above or below the mean? How do you know? z-Scores Suppose the mean and standard deviation on an algebra test were given as 72 and 12, respectively. Dakota had a z-score of 0.08 on the test. What does this z-score tell you about Dakota’s test score relative to the mean? z-Scores Suppose the mean and standard deviation on an algebra test were given as 72 and 12, respectively. If the z-score for Susan was 2.34, for David was -1.25, and for Dakota was 0.08, find their actual test scores. z-Scores Suppose the mean and standard deviation on an algebra test were given as 72 and 12, respectively. Rebecca made an 80 on the biology test. Find her z-score. Interpretation of Descriptive Statistics Data set 1 Number of Basketball Players Recorded Once Each Day from April 1-14 Data set 2 Number of Basketball Players Recorded Once Each Day from April 15-28 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 1-10 11-20 21-30 31-40 41-50 51-60 61-70 Number of Players Mean = 45.0 Variance = 106.1 Standard Deviation = 10.3 Mean Absolute Deviation = 9.1 1-10 11-20 21-30 31-40 41-50 51-60 61-70 Number of Players Mean = 45.0 Variance = 420.3 Standard Deviation = 20.5 Mean Absolute Deviation = 16 Interpretation of Descriptive Statistics Maya represented the heights of boys in Mrs. Constantine’s and Mr. Kluge’s classes on a line plot and calculated the mean and standard deviation. Heights of Boys in Mrs. Constantine’s and Mr. Kluge’s Classes (in inches) x 64 x 65 x x x 66 x x 67 x x x x x 68 x x x 69 x x 70 x 71 x x 72 Mean = 68.4 Standard Deviation = 2.3 Note: In this problem, a small, defined population of the boys in Mrs. Constantine’s and Mr. Kluge’s classes is assumed. x 73