Assumptions and Conditions - Inference for Means Distinguish Between Assumptions and Conditions The plot thickens, and soon. All of mathematics is based on "If..., then..." statements. Students have seen such statements before. For example, they know (we hope) not to apply the Pythagorean theorem unless they have a right triangle. The fact that it's a right triangle is the assumption that guarantees the equation a 2 + b 2 = c 2 works, so we should always check to be sure we are working with a right triangle before proceeding. The same is true in statistics -- we just don't see the theorems themselves very often. All those theorems, though, are in that same "If..., then..." form. The "If" part sets out the underlying assumptions used to prove that the statistical method works. If those assumptions are violated, the method may fail. The assumptions are about populations and models, things that are unknown and usually unknowable. And that presents us with a big problem, because we will probably never know whether an assumption is true. In fact, there are three types of assumptions: 1. Unverifiable. We must simply accept these as reasonable -- after careful thought. 2. Plausible, based on evidence. We test a condition to see if it's reasonable to believe that the assumption is true. 3. False, but close enough. We know the assumption is not true, but some procedures can provide very reliable results even when an assumption is not fully met. In such cases a condition may offer a rule of thumb that indicates whether or not we can safely override the assumption and apply the procedure anyway. A condition, then, is a testable criterion that supports or overrides an assumption. Inference for Means Whenever samples are involved, we check the Random Sample Condition and the 10 Percent Condition. Beyond that, inference for means is based on t-models because we never can know the standard deviation of the population. The theorems proving that the sampling model for sample means follows a t-distribution are based on the... Normal Population Assumption: The data were drawn from a population that's Normal. We can never know if this is true, but we can look for any warning signals. We've done that earlier in the course, so students should know how to check the... Nearly Normal Condition: A histogram of the data appears to be roughly unimodal, symmetric, and without outliers. If so, it's okay to proceed with inference based on a t-model. But what does "nearly" Normal mean? If the sample is small, we must worry about outliers and skewness, but as the sample size increases, the t-procedures become more robust. By the time the sample gets to be 30-40 or more, we really need not be too concerned. Then our Nearly Normal Condition can be supplanted by the... Large Sample Condition: The sample size is at least 30 (or 40, depending on your text). Note that understanding why we need these assumptions and how to check the corresponding conditions helps students know what to do. And it prevents the "memory dump" approach in which they list every condition they ever saw -- like np 10 for means, a clear indication that there's little if any comprehension there. Inference for the Difference of Two Means By now students know the basic issues. They check the Random Condition (a random sample or random allocation to treatment groups) and the 10 Percent Condition (for samples) for both groups. They also must check the Nearly Normal Condition by showing two separate histograms or the Large Sample Condition for each group to be sure that it's okay to use t. And there's more. As was the case for two proportions, determining the standard error for the difference between two group means requires adding variances, and that's legitimate only if we feel comfortable with the Independent Groups Assumption. Again there's no condition to check. We have to think about the way the data were collected. Inference for Matched Pairs Whenever the two sets of data are not independent, we cannot add variances, and hence the independent sample procedures won't work. Such situations appear often. We might collect data from husbands and their wives, or before and after someone has taken a training course, or from individuals performing tasks with both their left and right hands. Matching is a powerful design because it controls many sources of variability, but we cannot treat the data as though they came from two independent groups. Instead we have the... Paired Data Assumption: The data come from matched pairs. There's no condition to be tested. Instead students must think carefully about the design. This helps them understand that there is no "choice" between two-sample procedures and matched pairs procedures. Either the data were from groups that were independent or they were paired. The design dictates the procedure we must use. Looking at the paired differences gives us just one set of data, so we apply our one-sample t-procedures. We already know the appropriate assumptions and conditions. Independence Assumption: The individuals are independent of each other. We base plausibility on the Random Condition. Normal Distribution Assumption: The population of all such differences can be described by a Normal model. We verify this assumption by checking the... Nearly Normal Condition: The histogram of the differences looks roughly unimodal and symmetric. Note that there's just one histogram for students to show here. We don't care about the two groups separately as we did when they were independent. As before, the Large Sample Condition may apply instead.