Section 2.4 Numerical Measures of Central Tendency 2.4.1 Definitions Mean: The Mean of a quantitative dataset is the sum of the observations in the dataset divided by the number of observations in the dataset. Median: The Median (m) of a quantitative dataset is the middle number when the observations are arranged in ascending order. Mode: The Mode of a datset is the observation that occurs most frequently in the dataset. 2.4.2 How to calculate these Mean: There are two means, the Population Mean μ and the Sample mean x . The calculation of both is the same except that μ is calculated for the entire population and x is calculated for a sample taken from that population. We will now refer to x as in practice we never calculate μ, after all not calculating but estimating μ is the whole point of inferential statistics. 1 Dataset: X1 X2 X3 X4 X5 . . . . . Xn so there are n observations in this dataset n Sample Mean: x = ∑ xi i=1 n Median: Arrange the n observations in order from smallest to largest, then: if n is odd, the median (m) is the middle number, if n is even, the median is the mean of the middle two numbers Given a histogram the median is the point on the X-axis such that half the area under the histogram lies to the left of the median and half lies to the right. An example of finding the median from a histogram with Class Intervals is shown in Example Median 50% 50% 2 Mode: If given a dataset, the mode is easily chosen as the value with the highest relative frequency. If given a relative frequency distribution with class intervals then the mode is chosen to be the mid point of the class interval which has the highest relative frequency. This class interval which has the highest relative frequency is called the Modal Class. The mode measures data concentration and so can be used to locate the region in a large dataset where much of the data is concentrated. NOTE: unlike the mean and median the mode must be an element of the original dataset. 2.4.3 Example Calculate the Mean Median and Mode for the following datasets: Example A: Dataset: 5, 3, 8, 5, 6 5 x = ∑ i=1 5 xi = 5+3+8+5+6 = 5.4 5 Mode = 5 Median: 3, 5, 5, 6, 8 so m = 5 Note: 5.4 is not one of the original values in the dataset B: 11, 140, 98, 23, 45, 14, 56, 78, 93, 200, 123, 165 3 n = 12, n x = ∑ i=1 n xi = 1046/12 = 87.16666666 Median: 11, 14, 23, 45, 56, 78, 93, 98, 123, 140, 165, 200 m = (78 + 93)/2 = 85.5 C: generate a dataset containing 9 numbers using the Day, Month and Year of your birth and that of the people sitting to your left and right. ie: DD/MM/YY 4 *** D: Class Interval 2 -< 4 4 -< 6 6 -< 8 8 -< 10 Frequency 3 18 9 7 Modal Class is 4 -< 6 as frequency of 18 is highest, mode is in the middle of this so mode = 5 Mean = (3*3 + 5*18 + 7*9 + 9*7)/(3 +18 + 9 +7) = 225/37 = 6.081 Median: There are 37 observations in this datset so the median is the 19th observation. There are 3 observations in the first Class Interval 2 -<4 and as 19 - 3 =16 we need to find the 16th observation in the Class Interval 4 -< 6. Assuming the observations are distributed uniformly within each Class Interval we find that the 16th observation in the second interval should lie 16/18 = 0.89 of the way between 4 and 6. The distance between 4 and 6 is 2 units, 2*.89 = 1.78, and so we find: median (m) = 4 + 1.78 = 5.78 5 2.4.4 Mean vs Median vs Mode - which measures the centre best? Choosing which of these three measures to use in practice can sometimes seem like a difficult task. However if we understand a little about the relative merits of each we should at least be able to make an informed decision. If the distribution is symmetric then Mean = Median If the distribution is Positively Skewed (to the right) then Mean > Median If the distribution is Negatively Skewed (to the left) then Median > Mean So the difference between the mean and median can be used to measure the skewness of a dataset. ***********INSERT SLIDE Note: The presence of outliers affects the mean but not the median. This can be seen from the diagrams and from the following example: 6 *** 2.4.5 Example Ten statistics graduates who are now working as statisticians are surveyed for their annual salary. The survey produced the following dataset: £60,000 £20,000 £19,000 £22,000 £21,500 £21,000 £18,000 £16,000 £17,500 £20,000 Calculate the Mode, Median and Mean: Mode = £20,000 Median = £20,000 Mean = £23,500 Notice that the distribution is positively skewed, the presence of the one high earner has affected the Mean causing it to be £1,500 higher than the highest of all the salaries excluding £60,000. For this dataset the Mean is therefore not a good measure of the centre of the dataset. Notice also that the median would be unaffected if the £60,000 was changed to a value like £23,000 which is more in line with the rest of the data. Because of this sensitivity of the mean to outliers and because the median is completely insensitive to outliers a revised version of the mean is sometimes used called the trimmed mean. 7 2.4.6 Definition: Trimmed Mean NOTE: This definition is NOT in the textbook A trimmed mean is computed by first ordrering the data values from smallest to largest, then deleting a selected number of values from each end of the ordered list and finally averaging the remaining undeleted values. The trimming percentage is the percentage of values deleted from EACH end of the ordered list. So if a dataset contained 10 observations and we wanted to find a 20% trimmed mean we would delete 2 observations from the top of the ordered dataset and 2 from the bottom leaving 6 remaining values. The mean is then calculated for these 6 remaining values and this is the 20% Trimmed Mean. Example: Compute a 10% trimmed mean for the dataset in Example 2.4.5, compare with previous measures. There are 10 observations in the dataset, 10% of 10 is 1 so we delete the largest and smallest observations ie the values £60,000 and £16,000 are deleted. The mean of the remaining values is then calculated: 10% Trimmed Mean = (£17,500 + £18,000 + £19,000 + £20,000+ £20,000 + £21,000 + £21,500 + £22,000)/8 = £19,875 This is very similar to the median and mode for this data. 8 2.4.7 Some more Examples Sometimes we are not presented with a dataset but with a a Histogram or a Stem and Leaf Diagram. It is still possible to measure the centre of the dataset from these graphs. **********INSERT MPG Histogram and Stem&Leaf 2.4.8 Example Measurements were taken of the pulses of a certain number of UCD Students, the observations are listed below. Find the median and mode of this dataset. What is the best way to present this data which will allow the median and mode to be calculated more easily? 2.4.9 Examples Would you expect the datasets described below to possess relative Frequency distributions which are symmetric, skewed to the right or skewed to the left. A. The salaries of people employed by UCD B. The grades on an easy exam C. The grades on a diffucult exam D. The amount of time spent by students in a difficult 3 hour exam. E. The amount of time students in this class studied last week. F. The age of cars on a used car lot 9 2.4.10 Example: The median age of the population in Ireland is now 32 years old. The median age of the Irish population in 1986 was 27. Interepret these values and explain the trend, what implications does this data have for Irish society. What are the consequences for the entertainment industry in Ireland? 10 Section 2.5 Numerical Measures of Variability When we want to describe a dataset providing a measure of the centre of that dataset is only part of the story. Consider the following two distributions: A B Both of these distributions are symmetric and meanA = meanB, modeA=modeB and medianA=medianB. However these two distributions are obviously different, the data in A is quite spread out compared to the data in B. This spread is technically called variability and in this section we will examine how best to measure it. 11 2.5.1 Definitions Range: The Range of a quantatitive dataset is equal to the largest value minus the smallest value. Sample Variance: The Sample Variance is equal to the sum of the squared distances from the mean divided by n-1. n ∑(x − x) 2 s = i=1 2 i n−1 An easier formula to be used when calculating the variance is: ⎛ ⎞ x ⎜ ∑ i⎟ n 2 ⎝ i=1 ⎠ xi − ∑ n 2 i=1 s = n−1 n 2 12 Sample Standard Deviation: The Sample Standard Deviation, s, is defined as the positive square root of the Sample Variance, s2. 2.5.2 Which is best? The meaning of the Range is easily seen from its definition. It is a very crude measure of the variability contained in a dataset as it is only interested in the largest and smallest values and does not measure the variability of the rest of the dataset. ExampleA: These two datasets have the same range but do they have the same variability? Dataset1: 1, 5, 5, 5, 9 Dataset2: 1, 2, 5, 8, 9 NO, Dataset2 is obviously more spread out than Dataset1 which has threee values clustered at 5. The Sample Variance is a much better measure of the variability in the whole dataset. This is because the term ( x i − x ) in s2 calculates the distance of each observation in the dataset from the centre of the dataset (as measured by the Sample Mean). 13 As some of the xi s are smaller than x and some are larger they tend to cancel each other out. For this reason we square each ( x i − x ) term before adding them together and dividing by n-1 to get an average measure of the squared distance of each observation from the mean. The Sample Variance therefore will be small if all observations are close to the Sample Mean but will be large if the observations are far away from the mean. This is best illustrated by comparing the calculation of s2 for the two datasets in ExampleA above. Dataset1: 1, 5, 5, 5, 9 ........ x =5 s2 = [(1-5)2 + (5-5)2 + (5-5)2 + (5-5)2 + (9-5)2]/4 = [ (-4)2 + (0)2 + (0)2 + (0)2 + (4)2 ]/4 = [16 + 0 + 0 + 0 + 16]/4 =8 Dataset2: 1, 2, 5, 8, 9 ........ x =5 s2 = [(1-5)2 + (2-5)2 + (5-5)2 + (8-5)2 + (9-5)2]/4 = [ (-4)2 + (-3)2 + (0)2 + (3)2 + (4)2 ]/4 = [16 + 9 + 0 + 9 + 16]/4 = 12.5 So the increased spread contained in Dataset2 is indeed measured by s2. 2.5.3 Samples and Populations 14 You will have noticed that although we described s2 as an average of the squared distances from the sample mean, in fact we divided the sum of the squares not by n but by n-1. Now there were n observations in the dataset so surely the correct thing would be to divide by n and not n-1. The reason we divided by n-1 is because we are as always intereted in Inferential Statistics and we want to use s2 (the Sample Variance) to estimate for the Population Variance which we will denote by σ2 ( sigma squared). And we will find later that s2 with the n-1 provides a more accurate estimator of σ2. So again we have a sample and a Population and two Population Characteristics estimated by two Sample Statistics. Population Characteristic Population 2 Variance Sample Statistic Sample 2 Variance Population Standard Deviation Sample Standard Deviation σ σ 2.5.4 Example Two samples are chosen from a population: 15 s s Sample1: 10, 0, 1, 9, 10, 0, 8, 1, 1, 9 Sample2: 0, 5, 10, 5, 5, 5, 6, 5, 6, 5 Answer the following questions based on these two samples: A. Examine both samples and identify which has the greater variability B. Calculate the Range for each sample, does your result aggree with the answer in A. C. Calculate the Standard Deviation for each sample, does this result aggree with your answer to part A. D. Which of the two, Range or Standard Deviation provides the best measure of variability. Answers: Range1 = 10, Range2 = 10 S1=4.5814, S2 = 2.3944 2.5.5 Example Once upon a time there were two lecturers A & B, each delivered the same course to two different classes. When exam time came both classes had the same average marks of 70%. The marks for Lecturer A’s class however had a standard deviation of 25% whereas the Standard Deviation for Lecturer B’s class was 5%. Who’s class would you rather be in? 16 Section 2.6 Interpreting the Standard Deviation Chebyshev’s Rule and the Empirical Rule We have seen that the Variance and hence the Standard Deviation of a dataset provides us with a relative measure of the variability contained in a dataset. So that if we are given two datasets the one with the larger Standard Deviation will be the dataset which exhibits the greater variability. Is it posssible for the Standard Deviation to give more than a relative measure of variability? Can we actually say how spread ou the data is? The answer is yes, we will see later how to give detailed answers for particular distributions. In the meantime there are two rules which will provide us with a good deal of information about some general datasets. 2.6.1 Chebyshev’s Rule This rule applies to any dataset (population or sample) regardless of the shape or frequency distribution of the data. For k > 1 the proportion of observations which are within k Standard Deviations of the mean is at least 1- 1/k2. 17 Computing this for several values of k gives: k: Number of Standard Deviations 2 3 4 4.472 5 10 Proportion of the observations within k Standard Deviations from the Mean At least 1-1/4 = 0.75 At least 1-1/9 = 0.89 At least 1-1/16 = 0.94 At least 1-1/20 = 0.95 At least 1-1/25 = 0.96 At least 1-1/100 = 0.99 Note: Chebyshev’s Rule provides us with an idea of the spread of distributions. Because it is meant to work for all distributions regardless of their shape it doesn’t give definite specific results. Instead it tells us that “at least” a certain proportion of observations lie in a specified interval. The proportions in Chebyshev’s Rule are therefore very conservtive and for certain distributions we may find a much higher proportion of observations within these intervals. The Empirical rule provides us with some definite statements about the proportion of observations in a specified interval. It only works for Symmetric BellShaped (mound-shaped) distributions. Also this rule is an approximation and more or less data than is indicated by the rule may lie in each interval. 18 2.6.2 The Empirical Rule For a Symmetric Bell-Shaped distribution; • Approximately 68% of the observations are within 1 Standard Deviation of the Mean • Approximately 95% of the observations are within 2 Standard Deviation of the Mean • Approximately 99.7% of the observations are within 3 Standard Deviation of the Mean 19 2.6.3 Some Examples ExampleA The following is a list of the times it takes 12 UCD students to get to college in the morning : 12, 23, 56, 14, 17, 21, 33, 42, 45, 38, 51, 29 Calculate x and s and calculate the percentage of data between x - 2s and x + 2s and also between x - 3s and x + 3s. Compare these results with the predictions of Chebyshev’s Rule. Assuming that the data is distributed in an approximate Bell shape use the Empirical Rule to calculate the percentage of the data within 2 standard deviations of the mean and within 3 S.Devs of the mean. Comment on your results. =31.75 s = 14.78 2s = 2*14.78 = 29.56 3s = 44.34 x x x - 2s = 31.75 - 29.56 = 2.19 + 2s = 31.75 + 29.56 = 61.31 x x - 3s = 31.75 - 44.34 = -12.59 ~ 0 !!!!!!!!!!!!!!!! + 3s = 31.75 + 44.34 = 76.09 20 Interval Actual x -2s ~ x +2s 100% 2.19 ~ 61.31 x -3s ~ x +3s 100% 0 ~ 76.09 Chebyshev’s Empirical at least 75% approx. 95% at least 89% approx 99.75% This table illustrates very clearly how Chebyshev’s rule generally underestimates the amount of data in each interval. The empirical rule provides, in this case, more accurate results. ExampleB: A lecturer in UCD has assigned some problems to be done by the 120 students in her class. When it comes time to collect the problems 9 students inform her that “The dog ate my homework”. From many years of teaching classes this size she has observed that the mean for homeworks actually eaten by pets of all kinds is 3 homeworks and the standard deviation is 0.8 homeworks. Should the lecturer believe that the homeworks of all 9 students were eaten by their dogs or not. By Chebyshev’s rule at least 1-1/k2 of the observations should in the interval ( x - ks, x + ks). This gives the following table: 21 k# of Standard Deviations 2 3 4 5 6 7 8 Interval “At least” Percentage of observations in interval 75% 89% 93% 96% 97% 98% 98.4% 1.4, 4.6 0.6, 5.4 0, 6.2 0, 7 0, 7.8 0, 8.6 0, 9.4 From this table we can see that there is an AT MOST 2% chance that dogs ate 9 homeworks in this class. Remembering that Chebyshev’s rule is extremely conservative we could conclude that the chances are very high that some of the students just didn’t do their homeworks. 22 Example C: In Tombstone, Arizona Territory people used Colt .45 revolvers. However people used different ammunition. Wyatt Earp knew that his brothers and Doc Holliday were the only ones in the territory who used Colt .45s with Winchester ammunition. The Earp brothers conducted tests on many different combinations of weapons and ammunition. They found that dataset of observations produced by the combination of Colt .45 with Winchester shells showed a Mean velocity of 936 feet/second and a Standard Deviation of 10 feet/second. The measurements were taken at a distance of 15 feet from the gun. When Wyatt examined the body of a cowboy shot in the back in cold blood he concluded that he was shot at a distance of 15 feet and that the velocity of the bullet at impact was 1,000 feet/second. The dastardly Ike Clanton claimed that this cowboy was shot by the Earp brothers or Doc Holliday. Was Wyatt able to clear his good name using the Empirical Rule? 23 The distribution of this bullet velocity data should be approximately bell-shaped. This implies that the empirical rule should give a good estimation of the percentages of the data within each interval. k# of Standard Deviations 2 3 4 5 6 7 Interval Chebyshev’s Empirical “At least” approximate Percentage Percentage 916, 956 906, 966 896, 976 886, 986 876, 996 866, 1006 75% 89% 93% 96% 97% 98% 95% 99.7% ~100% ~100% ~100% ~100% This table quite clearly demonstrates that since the bullet velocity in the shooting was 1000 ft/sec and since this lies more than 6 Standard Deviations away from the mean the probability is extremely high that the Earps were not responsible for this shooting. This is especially evident from looking at the column showing percentages from the empirical rule. Practically 100% of bullet velocities should be between 896 and 976 ft/sec. 24 Example C2: During “The Troubles” in Northern Ireland both Republicans and Loyalists used 9mm handguns however they used different brands of handgun and ammunition. The security forces in NI knew that the republicans used Heckler and Koch 9mm handguns with Winchester ammunition. The security forces conducted tests on many different combinations of weapons and ammunition. They found that dataset of observations produced by the combination of a H&K 9mm with Winchester shells showed a Mean velocity of 936 feet/second and a Standard Deviation of 10 feet/second. The measurements were taken at a distance of 15 feet from the gun. Forensic scientists examining the body of a shooting victim concluded that he was shot at a distance of 15 feet and that the velocity of the bullet at impact was 1,000 feet/second. Describe the distribution of the bullet velocities. Did they conclude that the shooter was a member of a Republican terrorist organisation or a Loyalist organisation? 25 The distribution of this bullet velocity data should be approximately bell-shaped. This implies that the empirical rule should give a good estimation of the percentages of the data within each interval. k# of Standard Deviations 2 3 4 5 6 7 Interval Chebyshev’s Empirical “At least” approximate Percentage Percentage 916, 956 906, 966 896, 976 886, 986 876, 996 866, 1006 75% 89% 93% 96% 97% 98% 95% 99.7% ~100% ~100% ~100% ~100% This table quite clearly demonstrates that since the bullet velocity in the shooting was 1000 ft/sec and since this lies more than 6 Standard Deviations away from the mean the probability is extremely high that Republicans were not responsible for this shooting. This is especially evident from looking at the column showing percentages from the empirical rule. Practically 100% of bullet velocities should be between 896 and 976 ft/sec. 26 2.6.4 Example to illustrate the difference beween Chebyshev’s Rule, The Empirical Rule and some actual data. A survey was conducted to measure the height 14 year olds, a sample of 1052 children were measured and it was found that : = 62.484 inches s = 2.390 inches x A bell-shaped symmetric distribution provided a good fit to the data, applying Chebyshev’s and the Empirical rule we get: k: number of SDevs 1 2 3 Interval: Actual % of ( x -ks, x +ks) Obs. in Interval 60.094 - 72.1% 64.874 57.704 - 96.2% 67.264 55.314 - 99.2% 69.654 Empirical Chebyshev’s Rule: % Rule: of Obs. 68% >= 0% 95% >= 75% 99.7% >= 89% Clearly in this instance Chebyshev’s Rule underestimates the proportions very severely. 27 2.6.5 Estimating the Standard Deviation from the Range According to the Empirical rule for Bell-Shaped distributions almost all of the data should be in the interval ( x -3s, x +3s). So the Range should be approximately 6s ie: x +3s - ( x -3s). This gives us a crude but useful measure of the Standard Deviation. Standard Deviation ~ Range/6 28 Section 2.7 Numerical Measures of Relative Standing While it is useful to know how to measure the centre of a dataset and the variability of a dataset, many times we want to be able to compare one observation with the rest of the observations in the dataset. Is one observation larger than many others? For Example suppose you get 35% on the exam for this course you will probably feel quite bad about your performance but what if 90% of the class actually did worse than you? Then you might feel a bit better about your 35%. So in some cases knowing how one observation compares with others can be more useful than just knowing the value of that observation. This chapter will introduce some different ways of measuring Relative Standing. 29 2.7.1 Definitions Percentile: For any dataset the pth percentile is the observation which is greater in value than P% of all the numbers. Consequently this observation will be smaller than (100-P)% of the data. Z-Score: The Z-Score of an observation is the distance between that observation and the mean expressed in units of standard deviations. So: Sample Z-Score for an observation x is: x−x Z= s Population Z-Score of an observation is: Z= x−μ σ The numerical value of the Z-score reflects the relative standing of the observation. A large positive Z-score implies that the observation is larger than most of the other observations. A large negative Z-score indicates that the bservation is smaller than almost all the other observations. A Z score of zero or close to 0 means that the observation is located close to the mean of the dataset. 2.7.2 30 ExampleA: The 50th percentile of a dataset is the median (The median remember is the value which is larger than half of the data). ExampleB: Dataset 15, 3, 1, 7, 5, 17, 19, 11, 9, 13 In this dataset the 80th percentile is the value 15 as 15 is greater than or equal to 80% of the data. This is easily seen if we arrange the data in ascending order: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19 Exercise 2.79 in textbook The distribution of scores on a nationally administered college achievement test has a median of 520 and a mean of 540. a. Explain how it is possible for the mean to exceed the median for this distribution. b. Suppose that you are told that the 90th percentile is 660, what does this mean? c. Suppose you are told that you scored at the 94th percentile, what does this mean? Answers: a. Distribution is positively skewed (to the right) b. 90% of the test scores are below 660 and 10% are above. c. 94% of the test scores were below yours and only 6% were above. 31 Example D. A sample of 120 statistics students was chosen and their exam results summarised, the mean and standard deviation were shown to be: x = 53% and s = 7% Eric and Kenny are two students in this class and Eric’s exam result was 47% what was his Z-score? If Kenny’s Z-Score is 2, what was his percentage on the exam? 2.7.3 Z-scores and the Empirical Rule For a bell shaped distribution the Empirical Rule tells us the following about Z-scores: 1. Approximately 68% of the observations have a Z-Score between -1 and 1. 2. Approximately 95% of the observations have a Z-Score between -2 and 2. 3. Approximately 99.7% of the observations have a Z-Score between -3 and 3. Example 2.14 in the textbook: Suppose a female bank employee believes that her salary is low as a result of sex discrimination. To substantiate her belief, she collects information on the salaries of her male counterparts. She finds that their salaries have a mean of $34,000 and a standard deviation of $2,000. Her salary is $27,000 does this information support her claim of sex discrimination? Answer: 32 Calculate her Z-score with respect to her male counterparts: x − x $27,000 − $34,000 Z= = = −3.5 s $2,000 So the woman’s salary is 3.5 Standard Deviations below the mean of the male salary distribution. If the male salaries are distributed in a bell shape then the empirical rule tells us that very few salaries in this distribution should have a z-score below -3. Therefore a Z-score of -3.5 represents either a highly unsual observation from the male salary distribution or is from a different distribution. Do you think her claim of sex discrimination is justified? Answer: Need more data, on the collection technique the woman used, the length of time she has been in her job, her competence at her job etc. If she truly chose a representative sample, if she had been employed there as long as others and if she was good at her job then one might conclude that she was discriminated against. 33