Math 11 – Statistics Chapter 3 – Measures of Central Tendency for ungrouped data 3 different measures Mean (what we normally think of as average) x x : population sample x N n n 1 Median – the middle term (This gives the position of middle term) 2 Mode: The term that occurs the most often. Unimodal – a single value occurs most times Bimodal – two terms occur the maximum number of times Multimodal – many values occur the maximum number of times Shapes of distributions and the affect on position of mean, median and mode Median Mean Median Mode Mode Median Mean Mode Measures of Dispersion for ungrouped data The Range Largest Value Smallest Value Mean Standard Deviation – the positive square root of the variance Variance - x or population: x N Sample: s2 2 2 x x 2 n 1 x x is called the deviation of the x value from the mean. If we add all the deviations from the mean we get zero. x 0 x x 0 Therefore, we standardize by squaring and then taking the square root. 2 s s2 There are shortcut formulas for the variance – you can use whichever is most convenient for the problem. x N 2 s2 x x n 1 2 2 x 2 x N The variance and the standard deviation are both positive. N x 2 x n 1 n 2 2 2 and s 2 because they are squared and s because they are the positive square roots The numerical measure ( mean, median, mode, variance, or standard deviation ) of a population is called a population parameter or just a parameter. ( , 2 , ) The numerical measure of a sample is called a sample statistic or just a statistic ( x , s 2 , s ) 3.3 Measures of Central Tendency for Grouped Data Mean, Variance and Standard Deviation of Grouped Data When data is already grouped, we don’t know the individual values of the variables. ex The following table gives the grouped data on the weights of all 100 babies born at a hospital in 2005. (Notice, this is the total population) Weight (pounds) 3 to less than 5 5 to less than 7 7 to less than 9 9 to less than 11 11 to less than 13 # of babies 5 30 40 20 5 Notice that this is continuous data, therefore the class limits are the same as the class boundaries. To calculate the mean we will have to assume that observations in each class are evenly distributed throughout the class. Then the average value of the class would be the midpoint. So, find the midpoint of each class and multiply it by the frequency of the class (i.e. the number of observations in that class) This gives us the approximate total of that class. Next – add the totals together to get the sum of all values in all classes. Weight (pounds) 3 to less than 5 5 to less than 7 7 to less than 9 9 to less than 11 11 to less than 13 The mean now becomes mf Frequency (# of babies) f 5 30 40 20 5 N 100 . mf N Midpoint of class m 4 6 8 10 12 for a population and x mf n 780 7.8 pounds N 100 The mean weight of babies born in 2005 was about 7.8 pounds. Variance and Standard Deviation for Grouped Data s 2 2 f (m ) 2 N f m x n 1 or shortcut 2 2 or shortcut s 2 m m 2 mf f N N 2 mf f n 1 n 2 2 Approx total of values in class ( mf ) 20 180 320 200 60 mf 780 . Back to babies We need to find mf weight 3 to less than 5 5 to less than 7 7 to less than 9 9 to less than 11 11 to less than 13 m 2 and f f 5 30 40 20 5 f 100 m 4 6 8 10 12 mf 20 180 320 200 60 mf 780 m2f 80 1080 2560 2000 720 2 m f 6440 Substitute into the formula 2 m 2 mf f N N 2 6440 780 100 6440 6084 356 3.56 This is the variance. 100 100 100 3.56 1.88679 1.89 pounds When using single valued classes, the value of the class will be used as m. ex This table gives the number of errors committed by a college baseball team in all 45 games played in the 2005-2006 season. (Note – this is a population) Find the mean, variance and standard deviation # of errors 0 1 2 3 4 5 Mean: Variance: # of games (f) 11 14 9 7 3 1 f 45 mf N 2 Standard Deviation: mf 0 14 18 21 12 5 mf 70 70 1.56 errors per game. 45 m2 f N mf N 2 186 70 45 77.11 1.71 45 45 1.71 1.309 m2f 0 14 36 63 48 25 2 m f 186 3.4 How do we use Standard Deviation? When we have a symmetric bell-shaped curve (i.e. unimodal) then we can use The Empirical Rule. 1. 68% of the observations lie within one standard deviation of the mean. 2. 95% of the observations lie within two standard deviations of the mean. 3. 99.7% of the observations lie within three standard deviations of the mean 95% 68% 3 2 99.7% 2 3 3.5 Measures of Position Position means where in the data set a specific value falls. Quartiles are summary measures that divide the data set into 4 equal parts. (The data must be ranked first.) 25% 25% 25% 25% 1st Quartile 2nd Quartile 3rd Quartile Q1 Q2 Q3 Q2 = the median IQR – Interquartile Range Q3 Q1 Let’s look at the kitten data again. The data must be ranked first 37 38 40 41 44 45 49 51 52 52 54 56 57 57 58 60 63 63 65 69 72 76 77 84 91 25 1 13 2 entry is 57 Q2 = the median which is So the 13th To find Q1 – find the median of the lower half: Q1 45 49 94 47 2 2 Q3 65 69 134 67 2 2 12 1 6.5 So halfway between the 6th and 7th items 2 The interquartile range is Q3 Q1 67 47 20 5 number summary: min, Q1 , Q2 , Q3 , max We can see that 50% of the kittens weighed between 47 and 67 grams 25% weighed less than 47 grams, and 25% weighed more than 67 grams. Where is 60 grams located? Between 50% and 75% or in the 3rd quartile. Box and Whisker Plots Definition: A plot that shows the center, the spread, and the skewness of a data set. The following 5 steps are used to construct a box-and-whisker plot. Back to kitten weights 1. Find the median = 57 Q1 47 IQR 20 Q3 67 2. Find the points that are 1.5 IQR below Q1 , and 1.5 IQR above Q3 . These are called the lower inner fence and the upper inner fence respectively. For the kittens 1.5 20 30 Lower Inner Fence ( f L ) Q1 30 47 30 17 Upper Inner Fence ( f U ) Q3 30 67 30 97 kittens – 37 kittens – 91 3. Determine:the smallest value within the two inner fences The largest value within the two inner fences 4. Draw a horizontal line and mark observation values on it so that all data in the set are covered. 5. Draw a box that has Q1 as one side and Q3 as the other. Draw a vertical line at Q2. whisker whisker This shows the kitten data is skewed to the right. (the right whisker is longer) | 35 | 40 | 45 | 50 | 55 | 60 | 65 | 70 | 75 | 80 | 85 | 90 Make a mark at the lowest value inside the lower inner fence. Make a mark at the largest value inside the upper inner fence. Connect these to the box. These are the whiskers Any values outside the two inner fences would be indicated by asterisks. These would be outliers. For the kittens there are none outside the fences. So no outliers for the kittens. Another example: | 95 Problem 30 page 114 This is the number of runners left on bases by each of 30 Major league teams. 15.5 8th 3 4 4 5 5 5 5 5 6 6 6 6 6 6 6 7 7 8 8 8 8 8th 8 Place of Q2 8 9 9 10 10 11 13 18 30 1 15.5 Remember this is the place of the value. 2 Q2 67 6.5 2 Place of Q1 15 1 8 2 Therefore Q2 5 Place of Q1 15 1 8 2 Therefore Q2 5 , Q3 8 IQR Q3 Q1 8 5 3 1.5 IQR 1.5 3 4.5 Lower inner fence: f L 5 4.5 .5 Upper inner fence: fU 8 4.5 12.5 Notice, there are two values not within the fences: 13 and 18. These are outliers and we represent them with asterisks * | 0 | 2 | 4 | 6 | 8 | 10 | 12 * | 14 | 16 | 18 | 20 We can see from the box and whisker plot that the data is skewed right with two outliers.