Chance Models, Hypothesis Testing, Power Q560: Experimental Methods in Cognitive Science Lecture 6 Which is correct in the “real” world? Population Stick/Switch Stick Switch Observed Sample Which is correct in the “real” world? Population Stick Stick/Switch Stick Switch Observed Sample Stick Switch Switch Probability and Samples So far, we’ve talked about samples of size 1. In an experiment, we take a sample of several observations and try to make generalizations back to the population How do we estimate how good a representation of the population the sample we obtain is? The distribution of sample means contains all sample means of a size n that can be obtained from a population. Sample Means Let’s do an example from a very small population of 4 scores: X: 2, 4, 6, 8 Sample Means We construct a distribution of sample means for n=2. 1. Step: Write down all 16 possible samples. Sample Means 2. Step: Draw the distribution of sample means. Sample Means Things to note about the distribution: 1. Mean of sample means = mean of population. 2. Shape looks normal. 3. We can use this distribution to answer questions about probabilities. Central Limit Theorem • For any population with mean and standard deviation , the distribution of sample means for sample size n will have a mean of and a standard deviation of s n , and will approach a normal distribution as n approaches infinity. Central Limit Theorem Even though we can’t compute all possible samples of size n from this population to compare to, the Central Limit Theorem tells us that for any DSM of samples of size n: 1) 2) mM = m s sM = n 3) DSM will approach the unit normal as n approaches infinity DSM will always be normally distributed, even if the population was not normally distributed Central Limit Theorem The mean of the distribution of sample means is called the expected value of M. The standard deviation of the distribution of sample means is called the standard error of M. standard error = M Standard deviation: standard distance between a score X and the population mean . Standard error: standard distance between a sample mean M and the population mean . Law of Large Numbers The larger a sample, the better its mean approximates the mean of the population. Visualizing sampling distributions and CLT: http://onlinestatbook.com/stat_sim/sampling_dist/index.html Probability and the DSM We can use the distribution of sample means to find out probabilities (= proportions!). For example: Given a population, how likely is it to obtain a sample of size n with a certain M? Probability and the DSM Example: SAT-scores (=500, =100). Take sample n=25. What is p(M>540)? p = .0228 Another Example: SAT-scores (=500, =100). Take sample n=25. What range of values for M can be expected 80% of the time (prediction)? Using the Standard Error The standard error tells us how much error, on average, should exist between a sample mean and the population mean. As the sample size n increases, the standard error decreases. Hypothesis Testing What is Hypothesis Testing A hypothesis test uses sample data to evaluate a hypothesis about a population parameter. The basic logic of hypothesis testing: 1. State hypothesis about a population. 2. Obtain random sample from population. 3. Compare sample data with population. - if consistent, accept hypothesis - if inconsistent, reject hypothesis An Example: Basic experimental situation: Four Steps The “4 Steps” of Hypothesis Testing: 1. State the hypothesis 2. Set decision criteria 3. Collect data and compute sample statistic 4. Make a decision (accept/reject) Step 1: State Hypothesis Step 2: Set Criteria Consider distribution of sample means if H0 is true. Divide the distribution into two sections: 1. Sample means likely to be obtained if H0 is true. 2. Sample means very unlikely to be obtained if H0 is true. Step 2: Set Criteria Distribution of sample means: Step 2: Set Criteria Examples for boundaries: Step 3: Collect Data/Statistics Select random sample and perform “experiment”. Compute sample statistic, e.g. sample mean. Locate sample statistic within hypothesized distribution (use z-score). Is sample statistic located within the critical region? Step 4: Decision 1. Possibility: sample statistic is within critical region. Reject H0. 2. Possibility: sample statistic is not within critical region. Do not reject H0. We reject or do not reject the null, we cannot prove the alternate hypothesis It is easier to demonstrate a hypothesis is false than to demonstrate that it is true Hypothesis Testing: An Example It is known that corn in Bloomington grows to an average height of =72 =6 six months after being planted. We are studying the effect of “Plant Food 6000” on corn growth. We randomly select a sample of 40 seeds from the above population and plant them, using PF-6000 each week for six months. At the end of the six month period, our sample has a height of M=78 inches. Go through the steps of hypothesis testing and draw a conclusion about PF-6000 1. State hypotheses 3. Collect data 2. Chance model/critical region 4. Decision and conclusion 1. State hypotheses • Null and alternate in both sentence and parameter notation 2. Determine critical region in chance model • Calculate and draw dist of sample means (DSM) • Determine alpha level (.05) • Calculate upper and lower cutoff for means that will be=considered Mcrit m ± zcrits M “unlikely” due to chance • (for =.05, zcrit=±1.96) 3. Collect data/compute test statistic (Done for us) 4. Hypothesis Decision and Conclusion • Does Mobt exceed Mcrit? • If yes, reject null; if no, cannot reject null Step 1: State Hypotheses In words: Null: PF6000 will not have an effect on corn growth Alt: PF6000 will have an effect on corn growth In “code” symbols: H 0 : m = 72 H1 : m ¹ 72 Step 2: Chance Model and Critical Value a) Distribution of Sample Means: N = 40 s mM = 72 6 sM = = = 0.95 n 40 Step 2: Chance Model and Critical Value a) Distribution of Sample Means: N = 40 mM = 72 s 6 sM = = = 0.95 n 40 Draw the sampling distribution b) Set alpha level =.05 zcrit = ±1.96 Shade in critical region on sampling distribution Step 2: Chance Model and Critical Value c) Compute critical values to correspond to zcrit M lower = m - zcrits M = 72 -1.96(.95) = 72 -1.86 = 70.14 M upper = m + zcrits M = 72 + 1.96(.95) = 72 + 1.86 = 73.86 This is the range of means we will tolerate as due to chance Beyond these values, the obtained sample mean is unlikely to have come from this expected sampling distribution Pencil these values onto our sampling distribution Step 3: Do Experiment This is the part where we actually draw the sample, conduct the experiment, and compute the sample statistic (mean so far) For the question, this part has already been done for us, we just need to compare this obtained sample mean to our chance model to determine if any discrepancy between our sample and the original population is due to: 1. Sampling Error 2. A true effect of our manipulation Step 4: Decision and Conclusion • Mcrit is 70.14 (lower) or 73.86 (upper) • If Mobt exceeds either of these critical values (i.e., is out of the “chance” range, we reject H0. Otherwise, cannot reject H0 Mobt = 78 Mcrit = 73.86 Mobt exceeds Mcrit Reject H0 Conclusion: We must reject the null hypothesis that the chemical does not produce a difference. Conclude that PF6000 has an effect on corn growth. Directional Tests Directional = one-tailed In a one-tailed test the hypotheses make a statement about the expected direction of an effect. Example: experimental test of dietary drug (expected: reduction in food intake) H0: no reduction in food intake H1: food intake is reduced Errors and Uncertainty Errors and Uncertainty A hypothesis test may produce an erroneous result (wrong decision). Two types of errors can be made … Type I Error: Concluding there is an effect when there really is not Type II Error: Concluding there is no effect when there really is Errors and Uncertainty Type I error: H0 is rejected, while in fact the treatment has no effect. Example: Experimental treatment (behavior, drug, etc.) has actually no effect, but sample data make it look that way (due to sampling error). The alpha level is the probability that the test will lead to a Type I error. Researcher controls the magnitude of Type I error by setting . Errors and Uncertainty Type II error: Treatment effect really exists but hypothesis test fails to detect it. Example: treatment effect may be small Symbol Summary of possible outcomes of a statistical decision: Real World H0 True (No effect) Reject H0 (Effect) Type I Error H0 False (Real effect) 1-• Power Experimenter’s Decision Retain H0 (No effect) 1- • PCR Type II Error Statistical Power Another way of defining power: Power is the probability of obtaining sample data in the critical region when H0 is actually false. “Probability of detecting an effect if indeed one exists” Power is difficult to specify because it depends in part on the magnitude of any treatment effect. Example … Power, if treatment effect is 20 points: Power, if treatment effect is 40 points: Factors Affecting Power 1. Alpha (lowering reduces power) 2. Sample size (increasing n increases power b/c the standard error goes down) 3. Effect size (the bigger the effect, the greater the power b/c distance between distributions is bigger) 4. Tails (a one-tailed hypothesis test is more powerful than a two-tailed hypothesis test) p and α Sample means located in the critical region have p< (reject H0). Sample means located outside of the critical region have p> (accept H0) Why not z-test: An Example It is thought that we are genetically hardwired to recognize human faces. In a preferential looking paradigm, newborns are presented with two stimuli: one representing a face, and one containing the same features, but in a different configuration. The experimenter records how long the infants look at the face stimulus during a 60-sec presentation (lets assume they always look at one or the other) By chance, we would only expect them to look at the face stimulus for 30 seconds, but they look for 35 seconds…is this effect significant? Sample Variance We don’t know the variability of the population. But: we do know the variability of the sample. Sample variance = s2 = SS n-1 = SS Sample standard deviation = s2 df Estimated Standard Error We can use the estimated standard error as an estimate of the real standard error. Standard error = sM = Estimated standard error = s n = s2 n s s2 sM = = n n t-statistic Substituting the estimated standard error in the formula for the z-score gives us the following: t statistic = t = M- sM The t-statistic approximates a z-score, using the sample variance instead of the population variance (which is unknown). How well does that work? Degrees of Freedom and t Statistic Degrees of freedom describes the number of scores in a sample that are free to vary. degrees of freedom = df = n-1 The greater df, the better the t-statistic approximates the z-score. The set of t statistics for a given df (n) forms a t distribution. For large df (large n) the t distribution approximates the normal distribution. t distribution: Shape Hypothesis Tests Using the t Statistic Same procedure as with z-scores, except using the t statistic instead. Step 1: State hypothesis, in terms of population parameter . Step 2: Determine critical region, using , df, and looking up t. Step 3: Collect data and calculate value for t using estimated standard error. Step 4: Decide, based on whether t value for sample falls within critical region One-Sample t Test: An Example We’ll go back to our preferential looking paradigm and newborn babies. We show them the two stimuli for 60 seconds, and measure how long they look at the facial configuration. Our null assumption is that they will not look at it for longer than half the time, = 30 Our alternate hypothesis is that they will look at the face stimulus longer b/c face recognition is hardwired in their brain, not learned (directional) Our sample of n = 26 babies looks at the face stimulus for M = 35 seconds, s = 16 seconds Test our hypotheses ( = .05, one-tailed) Step 1: Hypotheses Sentence: Null: Babies look at the face stimulus for less than or equal to half the time Alternate: Babies look at the face stimulus for more than half the time Code Symbols: H 0 = m £ 30 H1 = m > 30 Step 2: Determine Critical Region Population variance is not known, so use sample variance to estimate n = 26 babies; df = n-1 = 25 Look up values for t at the limits of the critical region from our critical values of t table Set = .05; one-tailed 1.708 Step 2: Determine Critical Region Population variance is not known, so use sample variance to estimate n = 26 babies; df = n-1 = 25 Look up values for t at the limits of the critical region from our critical values of t table Set = .05; one-tailed tcrit = +1.708 Step 3: Calculate t statistic from sample a) Sample variance: b) Estimated standard error: c) t statistic: s2 =162 = 256 s2 256 sM = = = 3.14 n 26 M - m 35 - 30 t= = = 1.59 sm 3.14 Step 4: Decision and Conclusion The tobt=1.59 does not exceed tcrit=1.708 We must retain the null hypothesis Conclusion: Babies do not look at the face stimulus more often than chance, t(25) = +1.59, n.s., one-tailed. Our results do not support the hypothesis that face processing is innate. Which is correct in the “real” world? Population Stick Stick/Switch Stick Switch Observed Sample Stick Switch Switch