Chapter 6: Putting Statistics to Work In Chapter 5 we learned the 5 basic steps of a statistical study. We learned about different methods of sampling and the strengths and weaknesses of each (step 2). We also learned about experiments and observational studies and the benefits of both. These are both setups to collect data (step 3). The only mention of step 4 is with the margin of error and confidence interval in surveys and polls. The only thing we heard about conclusions (step 5) is that correlation may hint that there could be a cause and effect relationship between the variables. In other words, Chapter 5 does not cover steps 4 and 5 in much detail. This brings us to Chapter 6. Chapter 6 begins with a discussion of the characterization of data and how “average” can have several meanings and be misleading. The shape of the distribution can lead to large discrepancies between the mean, median and mode. Furthermore, it is not always clear how the average is calculated. Then the focus moves to the variation of a distribution. We learn about the 5 number summary, the range of a distribution, and the standard deviation. These will be important concepts for inferring population parameters from sample statistics. In unit 6C, the “bell curve”, or more accurately, the normal distribution, is analyzed. Its significance and characteristics are discussed and analyzed. Finally, we get to the pot of gold at the end of the rainbow. This is where the power of statistics lies: statistical inference. This sets guidelines for properly making conclusions based on a study, and what the statistics actually tell you. We learn that all statistics tell us is how likely that we got the data by chance. The less likely it is that it is just chance, the 1 more confident we can be that something is actually going on. Unit 6A: Characterizing Data The purpose of this section is to discuss the various ways of measuring the “average” and to see the effects of the distribution on the different “averages.” In particular, if the distribution is skewed left or right, it will pull the mean to the direction of the skewness more than the median or the mode. Other ways of describing distributions are also discussed. What is average? The word “average” is thrown around in statistics frequently. However, there are three types of averages: mean, median and mode. All three are distinct for most data sets. Typically, in statistical studies the “average” is the mean. If you were to make a histogram of the data and hold it with one finger from below with your finger at the mean, the histogram would balance on your finger. Another type of average is the median. This is just the “center” of the distribution. Half of the data values are above it, and half are below it. This average is less commonly used. However, the advantage of the median is that outliers do not affect it as much as the mean. The mode is barely ever used in statistics, and it is just the value that occurs most frequently. If you had a frequency table, it would just be the category with the highest number in the frequency column. Unlike the mean and the median, the mode can be used for qualitative data. However, most studies use numerical variables so there is some sort of natural ordering. For example, there is no natural ordering on hair color, but there is for salary. 2 Deciding which “average” to use can be a hard decision. A general rule of thumb is that if you have (an) outlier(s), your mean can be way off, so the median should be used. Alternatively, in reality some studies throw out outliers and then calculate the mean as the average. This is skating on thin ice. There needs to be a very good reason to throw out a data point. Unfortunately, this is used more often that you would think. The main reason this happens is because the statistical community at large prefers the mean as the average. Since outliers can make the mean unrepresentative of the data, outliers are often disregarded in the calculation of the mean. For example, let’s say you are calculating the average speed of processors in the new Apple G6 computers. You get a random sample of 5 computers and test them with a simple program to calculate the processor speed. You get the following speeds (in seconds): 1.32, 1.35, 1.34, 1.37, 0.15. Here is the problem: the measurement of 0.15 would require the processor to do each computation faster than the speed of light. For this reason, it is impossible for the program to have completed in under 0.15 seconds. Now, if we compute the mean with this value in, we get 1.106 seconds. If we compute the mean without this value, we get 1.345 seconds. Finally, if we compute the median, we get 1.345 seconds. Therefore, the median and the mean without the outlier are the same (they will typically be close). In this case, they would be justified in throwing out the 0.15 second result because it is impossible due to the laws of physics. It would be misleading to include the value of 0.15 seconds since the average speed would be lower than all but one of the speeds, and make the processors look faster than they really are. The moral of the story is that outliers can have a dramatic effect on the mean, and one must be very careful about how to compute the “average” of a data set so it is not misleading. 3 In cases where there is an outlier, the median is often a better measure of the “average.” See the textbook for their examples on Confusion About “Average”. Shapes of Distributions There are several ways of describing a distribution. To get the best description, if possible, all three of the following should be used: number of peaks, skewness (or symmetry), and variation. Below are three example distributions. Two Peaks Right Skewed Left Skewed The number of peaks is pretty obvious. You look for humps in the histogram and count them. The first graph at the beginning of this has two peaks, and the other two have only one peak. The number of peaks usually tells you something about the data being measured. For example, if you are measuring the height of adult Americans to the nearest inch, you would probably notice two peaks. This would tell you that there may be a natural stratification of the data. Clearly, this is men and women. The skewness or symmetry of the data is also apparent from a histogram. If there is a tail on one side of the data that is not on the other side, then the data is skewed in the direction of the tail. There are many possible explanations of skewness. For example, if you look at the distribution of grades in a large class where most of the students do well, there will be a 4 tail towards the lower grades. This would likely be due to the fact that you can’t get higher than 100%, but you can get as low as 0%. Since most of the students do well, the grades will be stuck around As and Bs, whereas the rest of the class will be spread out through Cs, Ds and Fs. So, there would probably be a tail towards the Fs. When data is skewed, the mean is pulled in the direction of the skewness. The median may also be pulled that direction, but not as much as the mean. Symmetry is also clear from the histogram. If the left side is pretty much a mirror image of the right side, then the data is symmetric. For example, if you plotted the heights of adult American men, one would expect this to be symmetric. When data is symmetric, the mean, median and mode are all the same! Typically, when there is just random variation in the data, the distribution will be symmetric. However, one thing the book is not clear (and maybe even misleading) is that the tail values may or may not be outliers. The book points to the tail as outliers, but there may be a good reason for the data to be skewed! For example, if you look at a distribution of the income of all American families it would be skewed towards the high incomes. There are good reasons for this data to be skewed, and the “outliers” (according to the book) tell a story. Finally, the variation of the data tells how spread out it is. If all of the data is close together, then there is little variation. If the data take on a wide range of values, then there is large variation. However, be careful about interpreting variation qualitatively, like “a lot of variation” or “a little variation.” Changing the units of the x-axis can make a “lot of variation” look like “a little variation” and vice-versa. 5 6B: Measures of Variation Variation is a useful idea in a quantitative sense. In a way, it tells you how close the data are to the “average” (more on which average later). There are three main ways of measuring the variation. The most crude of these ways is the range. A little better is quartiles and the five-number summary. The best way is the standard deviation. The range is simply the difference between the maximum value and the minimum value. This is a barbaric way of telling how spread out the data is. The plus side of it is that this is VERY easy to calculate. The down side is that it really isn’t very useful since it says nothing about any measure of average, and it tells nothing about the shape of the distribution. Quartiles and the Five-Number Summary Quartiles are a way of dividing the data up into four chunks with equal numbers of data values in each chunk. The quartiles are calculated by computing medians. To get the 2nd quartile (or middle quartile), you simply compute the median of the data. To get the 1st quartile (or lower quartile), you take all the data points below the median (the lower half of the values) and compute their median. Similarly, for the 3rd (or upper) quartile, you find the median of the data points above the middle quartile. To complete the five-number summary, you include the minimum and maximum value. Let’s say you want to know the five-number summary for the final grades for introductory physics. The grades were (in descending order): 98, 96, 94, 89, 84, 83, 83, 81, 79, 78, 76, 76, 75, 75, 73, 71, 70, 69, 64, 52. The median is 77 (since there are 20 grades, and 77 is half way between the middle two grades of 76 and 78). So, the middle quartile is 77. Now, this 6 naturally breaks the data into two parts, those above the median and those below the median. Looking at the grades above the median (98, 96, 94, 89, 84, 83, 83, 81, 79, and 78) we can calculate their median: 83.5. This is the upper quartile since it is the median of the top half of the data. Similarly, the lower half of the data (76, 76, 75, 75, 73, 71, 70, 69, 64 and 52) has a median of 72. That is, 72 is the lower quartile. To finish off the five-number summary, we add the minimum value of 52 and the maximum value of 98. Putting it all together, we give the five-number summary: 52, 72, 77, 83.5, 98. This summary gives you, at a glance, both the range and the median. Furthermore, it gives you some idea of skewness by looking at the difference between the median and the lower/upper quartile and the difference between the median and the max/min. However, looking at numbers is not always as telling as a picture. Therefore, we have the boxplot. This is a plot designed to represent the five-number summary clearly. 98 83.5 90 77 80 72 70 52 60 50 Figure 1: Box Plot This plot tells us a lot more than the range. It gives us clues about symmetry or skewness and about the spread (variation). Although it is not excessively clear from the diagram, the right “whisker” is longer than the left “whisker.” This means that the data might be skewed 7 to the right. Standard Deviation There is a reason this is called “standard” deviation. This is the most common measure of variance (and the most precise). However, unlike the five-number summary which is based on medians, the standard deviation is a measure of how much the values deviate from the mean. This is a mathematical measure that is used in a lot of the theory behind drawing conclusions based on statistical analysis. Technically, the standard deviation is the square root of ((the sum of (value - mean)2 )/number of values -1). To calculate this, follow these steps: 1. Compute the mean, and then for each value, calculate the difference between the data value and the mean (data value - mean). 2. Calculate the square of each of these deviations. 3. Add up the squares of the deviations. 4. Divide the sum by one less than the number of values (num values - 1) 5. Take the square root of the answer of step 4. As an example, we will use the data that we made the box plot with. The mean of the data is 78.5. We will calculate everything in a table: 8 value deviation deviation squared 98 19.5 380.25 96 17.5 306.25 94 15.5 240.25 89 10.5 110.25 84 5.5 30.25 83 4.5 20.25 83 4.5 20.25 81 2.5 6.25 79 .5 .25 78 -.5 .25 76 -2.5 6.25 76 -2.5 6.25 75 -3.5 12.25 75 -3.5 12.25 73 -5.5 30.25 71 -7.5 56.25 70 -8.5 72.25 69 -9.5 90.25 64 -18.5 342.25 52 -26.5 702.25 The sum of the squares of the deviations is 2440. Now we have to divide this by 20−1 = 19. Calculating 2440/19 ≈ 128.42. Now, we must take the square root of this. And, the square 9 root of 128.42 is about 11.3. Therefore, the standard deviation of this set of class grades is 11.3. Finally, the book gives a rule of thumb for estimating the standard deviation. The standard ≈ deviation is approximately 1/4 of the range. In our example, the range was 46. So, range 4 11.5. This is pretty close to our calculated value of 11.3. It is not precise, but it is in the ball park. So, this rule of thumb is alright for approximating the standard deviation (but should not be used in place of the standard deviation!). 6C: The Normal Distribution The famous “bell curve” is a normal distribution. The reason this distribution is so important is that we know a lot about it, mathematically. The normal distribution is symmetric with one distinct peak. The highest point of the peak is the mean, median and mode of the distribution. A distribution is likely to be normal if there is expected to be random variations, and only one peak (corresponding to the average value) with the likelihood of values further from the average being less likely. For example, the price of gasoline across the country would likely be a normal distribution. The height of adult Americans would probably NOT be normal since there is a different average height for men and women. The Standard Deviation in Normal Distributions The standard deviation is an intricate part of the normal distribution (through the equation for the curve, which we do not talk about). It also tells us a lot about how much of 10 the data is above or below a given point. In fact, there is a rule about normal distributions called the 68-95-99.7 rule. This asserts that 68% of the data is between (mean + 1 standard deviation) and (mean - 1 standard deviation), 95% is between the same with 2 standard deviations, and 99.7% of the data is between (mean + 3 standard deviation) and (mean - 3 standard deviation). If we look at this graphically, we see the following: 68% 95% 99.7% −3σ −2σ −σ mean +σ +2σ +3σ This rule allows us to calculate other things as well. For example, you can determine what percentage of the data points are above (mean + 1 standard deviation). Since 68% of the data is between (mean - σ) and (mean + σ), we know that 32% is outside of this range. Since the curve is symmetric, half of this is below (mean - σ) and half is above (mean + σ). Therefore, 16% of the data lies above (mean + σ). The 68-95-99.7 rule also gives us some power in computing the percentile a given data point lies in, just by knowing the standard deviation of the distribution and the value of the data point. However, there is a more precise way of determining the percentile of a data point. This is found by calculating the z-score (or standard score) and then looking up the percentile in a table. The z-score measures how many standard deviations the value is away 11 from the mean. The equation for calculating the z-score is z= data value − mean x−µ = standard deviation σ After computing the z-score, you look it up in a table and determine the percentile. It is common for people to want to know their percentile on standardized tests (SAT, ACT or IQ tests). The SAT verbal section (and math section) has a mean of 500, and a standard deviation of 100. From our 68-95-99.7 rule, we see that 68% of the people who take the test score between 400 and 600 on the verbal section. Also, only 16% score above 600. This means that if you get a 600 on the verbal section, you are in the 84th percentile. If you go up one more standard deviation to 700, only 2.5% of the population scored higher than you! Finally, only .15% of the population gets an 800 on the verbal section. In other words, only 15 out of 10000 students taking the SAT get a perfect score on the verbal section! The same goes for math. The power of the normal distribution is apparent from the preceeding discussion. The mean and standard deviation alone determine the entire distribution. Furthermore, given a data value, it is possible to find out what percent of the data lies below it. This is a powerful tool for things like standardized test, grading a class on a “curve,” etc. 6D: Statistical Inference The power of statistics lies in statistical inference. This is the ability to say that something is actually going on. However, the way we must go about it is to rule out the possibility that the results were due to chance. When an event is not likely to have occurred by chance alone, it is said to be statistically significant. For example, if the average adult male can throw a 12 baseball at 60 miles per hour, and a college student can throw a baseball at 93 miles per hour, the college student’s throwing is statistically significant. Similarly, if we look at the President’s approval rating, we see that on September 20, 2001 his approval rating was about 70%. Now, (May 2005) his approval rating is about 42%. This is statistically significant since the margin of error in each survey is about 5%, and this change is large compared to the margin of error. Therefore, we can actually say with confidence that his approval has gone down. However, if you look at a year ago verses today, his rating has gone from 45% to 42%. This is not statistically significant since the change is small compared to the margin of error. Therefore, we can not say that his approval has gone down in the last year without more information. However, simply saying whether or not an event is statistically significant is not quite correct. One has to specify what criterion they used to determine the event was “not likely to have occurred by chance.” You determine how “not likely” it is by saying something is statistically significant at some level. For example, statistical significance at the 0.05 level means that there is less than a 5% chance that the event occurred by chance. Similarly, if there is less than a 1% chance an event occurred by chance, it is said to be statistically significant at the 0.01 level. With one of these specifications, it is clear what you mean by a statistically significant event. Margin of Error and Confidence Intervals In Chapter 5 we mentioned the margin of error and the confidence interval. However, we did not get into details about how it was calculated or what it means, technically. We know that sample statistics only approximate the actual value for the population 13 (the population parameter). However, we do know that the population parameter is very likely to fall within the confidence interval. The “very likely” technically means that there is a 95% chance that the population parameter lies somewhere inside the confidence interval. (Actually, it means that there is at most a 5% chance that these measurements were taken and the population parameter is actually outside our confidence interval). In fact, if a survey is carried out many times with the same number of people surveyed, the distribution of the proportions measured will be a normal distribution. This gives us all of the tools described in section 6C. This distribution has a special name: sampling distribution. The name comes from the fact that it is a distribution of the values of a sample statistic over many trials. The “Law of Large Numbers” is a mathematical theorem that describes these distributions. It √ also states that the standard deviation is 1/2 n. From the 68-95-99.7 rule, 95% of the values are between (mean + 2 standard deviation) and (mean - 2 standard deviation), or (mean + √ √ 1/ n) and (mean - 1/ n) where n is the number of participants in each survey. Therefore, √ the margin of error for 95% confidence in a survey is about 1/ n. The 95% confidence interval √ is found by adding and subtracting 1/ n to the sample statistic (or sample proportion). This backs up something that seems to make sense: the more people you survey, the more accurate the survey is (i.e. the survey has a smaller margin of error when you survey more people). One thing to note is that a poll of only 500 people has a margin of error of about 4.5 percent. A 5% margin of error corresponds to a sample of only 400 people. A poll with 1000 people has a margin of error of about 3.2%. Most surveys reported on the news have a margin of error of 3% or 5%. The 3% margin of error is for a sample of 1111 people. In other words, when the news gives you the results of a “national” survey or poll with a 3% margin of error, they really only asked about 1000 people. 14 Moreover, a 95% confidence means that there is a 5% chance that the population parameter is actually outside the confidence interval. In other words, on average, 1 out of every 20 of these surveys has a population parameter outside the confidence interval! Hypothesis Testing This is one of the most powerful techniques in statistics. This technique is used to REJECT a claim. This is done by assuming the population parameter is a specific value. This is called the null hypothesis. In experiments, this is the assumption that the treatment had NO EFFECT. The alternative hypothesis is accepted (i.e. there was a difference between the control and treatment groups) ONLY IF the null hypothesis is rejected. So, in a way, it is possible to get an affirmative result from rejecting the null hypothesis. This method can ONLY say that the null hypothesis is most likely false. If we can not reject the null hypothesis, we do not know that the null hypothesis is true. All we can say in this case is that there is not sufficient evidence to reject the null hypothesis and accept the alternative. Now that we know what can happen, we have to decide how to accept or reject a null hypothesis. Rejecting the null hypothesis is saying that it is not likely that we obtained our data from a sample that had the assumed population parameter. There is always a probability that we did get this data set from a sample of the population whose population parameter is in fact the value in the null hypothesis. However, this probability can be very small. When this probability is small, it is not likely that the population parameter is that in the null hypothesis. If, for instance, the probability that values as extreme as ours came from a population with 15 the population parameter of the null hypothesis is less than 1 in 100 (i.e. 0.01 or 1%), the test is significant at the 0.01 level. This means that there is only a 1% chance that the population parameter is actually what we stated in the null hypothesis. In other words, there is a 99% chance that the null hypothesis is wrong and the population parameter is different from that in the null hypothesis. So, we are 99% confident that we can reject the null hypothesis. Typically, significance at the 0.05 level (i.e. there is only a 1 in 20 chance, or 5% chance, that the null hypothesis is correct) is enough to reject the null hypothesis. However, sometimes one needs to be very careful, such as in medical studies, and significance at the 0.01 level is usually required (such as by the FDA). If, on the other hand, the probability that the null hypothesis is correct is greater than 1 in 20, there is not sufficient evidence to reject the null hypothesis. In this case, you can make no conclusions other than there is not enough reason to doubt the null hypothesis. Chapter 6 Definitions distribution of a variable (or data set) describes the values taken on by the variable and the frequency (or relative frequency) of these values. sum of all values mean: a measure of the average, defined as total number of values median: a measure of the average, defined as the middle value of the data set. If there are an even number of values, it is half way between the two middle values. mode: a measure of the average, defined as the most common value (or group of values) in a distribution. 16 outlier: a data value that is much higher or much lower than almost all other values. symmetric distribution: a distribution whose left half is a mirror image of its right half. left-skewed distribution: a distribution whose values are more spread out on the left side (the left side has a tail). right-skewed distribution: a distribution whose values are more spread out on the right side (the right side has a tail). variation: a measure of how widely data values are spread about the center of a distribution. range: (of a data set:) the difference between its highest and lowest data value. lower quartile: (or first quartile:) the median of the data values in the lower half of the data set. middle quartile: (or second quartile:) the median of the entire data set. upper quartile: (or third quartile:) the median of the data values in the upper half of the data set. five-number summary: summary of a data set consisting of the lowest value, lower quartile, middle quartile, upper quartile and highest value. boxplot: graphical representation of the five-number summary. The lower and upper quartiles are the ends of a box, the median is a line. and whiskers extend to the high and low values. standard deviation: a (good) measure of variation. 17 deviation: the difference between the mean and the data value. normal distribution: a symmetric, bell shaped distribution with a single peak. Is peak is the mean, median AND mode of the distribution. 68-95-99.7 rule: In a normal distribution, 68% of the values are between (mean - 1 standard deviation) and (mean + 1 standard deviation), 95% of the values are between (mean - 2 standard deviations) and (mean + 2 standard deviations), and 99.7% of the values are between (mean - 3 standard deviations) and (mean + 3 standard deviations). standard score: (or z-score:) the number of standard deviations a data value lies from the mean. nth percentile: the smallest value in the set with the property that n% of the data values are less than or equal to it. statistically significant: it is unlikely that the data set or observations occurred by chance. sampling distribution: a distribution consisting of the averages of many samples. null hypothesis: a claim that the population is a specific value (usually the claim that what you are testing has NO EFFECT). alternative hypothesis: the claim that is accepted if the null hypothesis is rejected. Chapter 6 Equations µ = mean = sum of all values total number of values range = highest value (max) − lowest value (min) 18 σ = standard deviation = z = standard score = v u u sum t of (deviations from the mean)2 total number of data values − 1 value − mean standard deviation 19