Inference Statistical confidence Confidence intervals Confidence interval for a population mean Choosing the sample size 1 Distinguish chance variations from permanent features of a phenomenon: ◦ Give SAT test to a SRS of 500 Indiana seniors sample mean = 461 What does it say about µ(the mean SAT score of all HS seniors in Indiana)? ◦ Is 12/20 vs. 8/20 improvements in treatment vs. control group strong enough evidence in favor of a drug? Methods of formal inference rely on the assumption that the data come from properly randomized experiment ◦ For example an SRS The field of Statistics gives methods that give correct results a high percentage of times (if repeated many times) Two most prominent methods are: 1. confidence intervals 2. tests of significance (hypothesis testing) Example 1: ◦ Observe 15 plots of corn with yields in bushels: 138, 139.1, 113, 132.5, 140.7, 109.7, 118.9, 134.8, 109.6, 127.3,115.6, 130.4, 130.2, 111.7, 105.5 ◦ Sample Mean = 123.8 What can be said about the mean yield of this variety of corn for the population? Assume that yield is N(µ, σ) with unknown µ and σ=10 (just assume σ is known) 10 Then X N , N , N ( , 2.58) n 15 68-95-99.7% rule: 95% of time sample mean is within 2 standard deviations of population mean ◦ 2×2.58 = 5.16 from µ Thus, 95% of time: 5.16 X 5.16 X 5.16 X 5.16 The random interval covers the unknown (but nonrandom) population parameter µ 95% of time. Our confidence is 95%. We need to be extremely careful when observing this result. is between X 5.16 123.8 5.16 (118.64, 128.96) This particular confidence interval may contain µ or not… However, such a systematic method gives intervals covering the population mean µ in 95% of cases. The Big Idea: The sampling distribution of x tells us how close to µ the sample mean x is likely to be. All confidence intervals we construct will have a form similar to this: estimate ± margin of error A level C confidence interval for a parameter has two parts: An interval calculated from the data, which has the form: estimate ± margin of error A confidence level C, where C is the probability that the interval will capture the true parameter value in repeated samples. In other words, the confidence level is the success rate for the method. It does not give us the probability that our parameter is inside the interval. Confidence Interval for a Population Mean Confidence Interval for the Mean of a Normal Population Choose an SRS of size n from a population having unknown mean µ and known standard deviation σ. A level C confidence interval for µ is: x z* n The critical value z* is found from the standard Normal distribution. C= 95%. Find z* from table A. See also last row in Table D. We can use a table of z/t values (Table D). For a particular confidence level, C, the appropriate z* value is just above it. 14 Tim Kelley weighs himself once a week for several years. Last month he weighed himself 4 times with an average of 190.5. Examination of Tim’s past data reveals that over relatively short periods of time, his weight measurements are approximately normal with a standard deviation of about 3. Find a 90% confidence interval for his mean weight for last month. Then, find a 99% confidence interval. More confidence wider interval Less confidence narrower interval Suppose Tim had only weighed himself once last month and that his one observation was x=190.5 (the same as the mean before). Estimate µ with 90% confidence. More sample size narrower interval Less sample size wider interval The z confidence interval for the mean of a Normal population illustrates several important properties that are shared by all confidence intervals in common use. The user chooses the confidence level and the margin of error follows. We would like high confidence and a small margin of error. High confidence suggests our method almost always gives correct answers. A small margin of error suggests we have pinned down the parameter precisely. The margin of error for the z confidence interval is: z *× s n To decrease the margin of error we can: Make z* smaller (the same as a lower confidence level C). Get a bigger n! Since n is under the square root sign, we must take four times as many observations to cut the margin of error in half. Make σ smaller. Usually not possible. 19 The spread in the sampling distribution of the mean is a function of the number of individuals per sample. Standard error ⁄ √n The larger the sample size, the smaller the standard deviation (spread) of the sample mean distribution. The spread decreases at a rate equal to √n. Sample size n 20 To have a desired margin of error mz * n Take sample (at least) of z n m * 2 You may need a certain margin of error (e.g., in drug trials or manufacturing specs). In most cases, we have no control over the population variability (), but we can choose the number of measurements (n). The confidence interval for a population mean will have a specified margin of error m when the sample size is: m z* n z * n m 2 Remember, though, that sample size is not always stretchable at will. There are typically costs and constraints associated with large samples. The best approach is to use the smallest sample size that can give you useful results. 22 Tim wants to have a margin of error of only 2 pounds with 95% confidence. How many times must he weigh himself to achieve this goal? The data should be an SRS from the population. The confidence interval and sample size formulas are not correct for other sampling methods. Inference cannot rescue badly produced data. Confidence intervals are not resistant to outliers. If n is small (<15) and the population is not Normal, the true confidence level will be different from C. The standard deviation of the population must be known. We will learn what to do when is unknown in chapter 7. 24 The reasoning of tests of significance Stating hypotheses Test statistics P-values Statistical significance Test for a population mean Two-sided significance tests and confidence intervals 25 Confidence intervals are one of the two most common types of statistical inference. Use a confidence interval when your goal is to estimate a population parameter. The second common type of inference, called tests of significance, has a different goal: to assess evidence in the data about some claim concerning a population. A test of significance is a formal procedure for comparing observed data with a claim (also called a hypothesis) whose truth we want to assess. The claim is a statement about a parameter, like the population proportion p or the population mean µ. We express the results of a significance test in terms of a probability, called the P-value, that measures how well the data and the claim agree. 26 Suppose a basketball player claimed to be an 80% free-throw shooter. To test this claim, we have him attempt 50 free throws. He makes 32 of them. His sample proportion of made shots is 32/50 = 0.64. What can we conclude about the claim based on this sample data? We can use software to simulate 400 sets of 50 shots each on the assumption that the player is an 80% free-throw shooter. You can say how strong the evidence against the player’s claim is by giving the probability that he would make as few as 32 out of 50 free throws if he really makes 80% in the long run. Assuming that the actual parameter value is p = 0.80, the observed statistic is so unlikely that it gives convincing evidence that the player’s claim is not true. 27 1. Are Tim Kelly’s weight measurements compatible with the claim that his true mean weight is 187 pounds? 2. In a random sample of 100 light bulbs, 7 are found defective. Is this compatible with the manufacturer’s claim of only 5% of the light bulbs produced are defective? What are we in favor of or against? How do we stat this in terms of an appropriate hypothesis? The hypothesis is a statement about the parameters in a population or model – not about the data at hand. ◦ We usually have the data and can answer questions directly about it. The results of a test are expressed in terms of a probability that measures how well the data and the hypothesis agree. ◦ This is similar to confidence but altogether different as well. In hypothesis testing, we need to state 2 hypotheses: ◦ The null hypothesis: H0 ◦ The alternative hypothesis: Ha The null hypothesis is the claim which is initially favored or believed to be true. Often default or uninteresting situation of “no effect” or “no difference”. THEN, we usually need to determine if there is strong enough evidence against it. The test of significance is designed to assess the strength of the evidence against the null hypothesis. The alternative hypothesis is the claim that we “hope” or “suspect” something else is true instead of H0. Sometimes it is easier to begin with the alternative hypothesis Ha and then set up H0 as the statement that the hoped-for effect is not present. H0: μ = 187 In words: true weight is 187 pounds. Ha: μ > 187 In words: He weighs more than 187 pounds. A so-called one-sided alternative Ha. (Looking for a departure in one direction.) H0: μ = 187 vs. Ha: μ <187 Suspect the weight is lower. One-sided Ha. H0: μ = 187 vs. Ha: μ >187 Suspect the weight is higher. One-sided Ha. H0: μ =187 vs. Ha: μ ≠187 Suspect weight is different. Two-sided Ha. Note: you must decide on the setting, based on general knowledge, before you see the data or other measurements. Translate each of the following research questions into appropriate hypotheses. Census Bureau data show that the mean household income in the area served by a shopping mall is $62,500 per year. A market research firm questions shoppers at the mall to find out whether the mean household income of mall shoppers is higher than that of the general population. Last year, your company’s service technicians took an average of 2.6 hours to respond to trouble calls from business customers who had purchased service contracts. Do this year’s data show a different average response time? Tim Kelley has a driver’s license that gives his weight as 187 pounds. Recall that last month’s mean weight was 190.5, with a sample size of 4. Also the population standard deviation is 3. What is the probability of observing a sample mean of 190.5 or larger when the true population mean is 187? the probability, computed assuming that H0 is true, that the test statistics would take as extreme or more extreme values as the one actually observed. ◦ Example 1 (Tim Kelley): p-value = This is the P-value of the test (or of the data, given the testing procedure). If it is small, it serves as evidence against H0. Need to know the distribution of the test statistics under H0 to calculate P-value. When the P-Value is small, there are 2 choices: ◦ 1—The null hypothesis is true and our observed effect is extremely rare! OR more likely… ◦ 2—The null hypothesis is false and our data is telling us this by the small P-value! So… We need a cut-off point (decisive value) that we can compare our P-value to and draw a conclusion or make a decision. ◦ In other words, how much evidence do we need to reject H0 ? This cut-off point is the significance level. It is announced in advance and serves as a standard on how much evidence against H0 we need to reject H0. Usually denoted α. Typical values of α: 0.05, 0.01. ◦ If not stated otherwise, assume α=0.05. Statistical Significance There is no ironclad rule for how small a P-value should be in order to reject H0—it’s a matter of judgment and depends on the specific circumstances. But we can compare the P-value with a fixed value that we regard as decisive, called the significance level. We write it as , the Greek letter alpha. When our P-value is less than the chosen , we say that the result is statistically significant. If the P-value is smaller than , we say that the data are statistically significant at level . The quantity is called the significance level or the level of significance. When we use a fixed level of significance to draw a conclusion in a significance test, P-value < → reject H0 → conclude Ha (in context) P-value ≥ → fail to reject H0 → cannot conclude Ha (in context) 40 If the P-value is smaller than a fixed significance level α, then we reject the null hypothesis (in favor of the alternative). Otherwise we don’t have enough evidence to reject the null. ◦ If we don’t reject the null, do we accept it? Guidelines: Note: Should always report a P-value with your conclusion and write the conclusion in terms of the problem. ◦ Conclusion for Example 1 (Tim Kelley): ◦ IF p-value < α Reject H0 ◦ IF p-value > α Fail to Reject H0 Tests of Significance: Four Steps 1. State the null and alternative hypotheses. 2. Calculate the value of the test statistic. 3. Find the P-value for the observed data. 4. State a conclusion. We will learn the details of many tests of significance in the following chapters. The proper test statistic is determined by the hypotheses and the data collection design. 42 43 Reject H0 when the P-value is smaller than significance level α. Otherwise: Do not reject. This rule is valid in other settings, too. If, based on previous data or experience,we expect “increase”, “more”, “better”, etc. (“decrease”, “less”, “worse”, resp.), then we can use a one sided test. Otherwise, by default, we use two-sided. Key words: “different”, “departures”, “changed”… A group of 72 male executives in age group 3544 has mean systolic blood pressure 126.07. Is this career group’s mean pressure different than that of the general population of males in this age group, which is N(128, 15)? (α not given?? Assume α = 0.05) Example 3: A test of significance is based on a statistic that estimates the parameter that appears in the hypotheses. When H0 is true, we expect the estimate to be near the parameter value specified in H0. Values of the estimate far from the parameter value specified by H0 give evidence against H0. A test statistic calculated from the sample data measures how far the data diverge from what we would expect if the null hypothesis H0 were true. z estimate - hypothesized value standard deviation of the estimate Large values of the statistic show that the data are not consistent with H0. 50 A significance test can be done in a black-and-white manner: We reject H0 if P < , and otherwise we do not reject H0. Reporting the P-value is a better way to summarize a test than simply stating whether or not H0 is rejected. This is because P quantifies how strong the evidence is against H0. The smaller the value of P, the greater the evidence. On the other hand, P does not provide specific information about the true population mean µ. If you desire a likely range of values for the parameter, use a confidence interval. 51 A level α two-sided significance test rejects H0: µ=µ0 exactly when µ0 falls outside a level 1- α confidence interval for µ. ◦ If µ0 is in the CI fail to reject H0 ◦ If µ0 is not in the CI reject H0 ◦ NOTE: must have “≠” in Ha! An agro-economist examines the cellulose content of a variety of alfalfa hay. Suppose that the cellulose content in the population has a standard deviation of 8 mg. A sample of 15 cuttings has a mean cellulose content of 145 mg. A previous study claimed that the mean cellulose content was 140 mg. The 95% confidence interval is (140.95, 149.05). ◦ Use the confidence interval to determine if the mean cellulose content is different from 140 mg. Now try the test using a test statistic instead of the confidence interval, just for practice. (The result should be the same.) Choosing a significance level What statistical significance does not mean Don’t ignore lack of significance Beware of searching for significance 56 α=0.05 is accepted standard, but… if the conclusion that Ha is true has “costly” implications, smaller α may be appropriate not always need to make a decision: describing the evidence by P-value may be enough no sharp border between statistically significant and insignificant Statistically significant effect may be small: ◦ Example (“Executive” blood pressure): µ0 = 128 σ = 15 n = 1000 obs. sample mean = 127 ◦ Z = (127-128)/ (15/sqrt(1000)) = -2.11 ◦ P-value for two sided Ha = 2*0.0174=0.0348 Significant?? Stat. significance is not necessarily practical significance. Plot your results and confidence interval, to see if the effect is worth your attention. Important effects may have large P-value if sample size too small. Converse also true. Outliers may produce or destroy statistical significance. Cautions About Significance Tests Don’t ignore lack of significance Consider this provocative title from the British Medical Journal: “Absence of evidence is not evidence of absence.” Having no proof that a particular suspect committed a murder does not imply that the suspect did not commit the murder. Indeed, failing to find statistical significance in results means that “the null hypothesis is not rejected.” This is very different from actually accepting the null hypothesis. The sample size, for instance, could be too small to overcome large variability in the population. When comparing two populations, lack of significance does not imply that the two populations are the same. The populations might be different but have similar statistical properties. 60 Statistical Inference, no matter how well done, cannot fix basic flaws in the design ◦ Bias due to: Sampling (like voluntary response, etc) incorrect experimental design Poorly worded questions Etc. ◦ Any other problems we discussed in chapter 3 can affect the validity of the Inference. Example: Take 100 executive rank employees. Measure: blood pressure, height, weight, bone density, metabolism rate, etc. ◦ ◦ ◦ ◦ Test if their blood pressure is different, using α=0.05 Test if their height is different, using α=0.05 Test if their weight is different, using α=0.05 … If we perform 40 significance tests, how many do we expect to be statistically significant, just by chance? Remember, the significance level controls what we call a “rare” result. In normal practice, rare results do occur, but rarely!! If α=0.05, then we will rarely (5% of the time) get a rare result but this is also what we call statistical significance! In summary, if you are searching for significance by running tests over and over, you will find it! But this is terrible statistics… We’d much rather have 1 significance test that we are interested in at a single α=0.05! Data: an SRS Formulas for other randomized designs available Haphazard data = unreliable CI Population need not be normal but outliers pose a threat to validity of conclusions Will learn how to estimate σ in Chapter 7