Student Manual for Fundamental Statistics for the Behavioral Sciences (7th edition) David C. Howell The University of Vermont Contents Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Introduction Basic Concepts Displaying Data Measures of Central Tendency Measures of Variability The Normal Distribution Basic Concepts of Probability Sampling Distributions and Hypothesis Testing Correlation Regression Multiple Regression Hypothesis Tests Applied to Means: One Sample Hypothesis Tests Applied to Means: Two Related Samples Hypothesis Tests Applied to Means: Two Independent Samples Power One-way Analysis of Variance Factorial Analysis of Variance Repeated-Measures Analysis of Variance Chi-Square Nonparametric and Distribution-Free Statistical Tests Choosing the Appropriate Analysis ii Preface The purpose of this manual is to provide answers to students using the accompanying text, Fundamental Statistics for the Behavioral Sciences, 7th ed. I have provided complete answers to all of the odd-numbered questions. I am often asked for answers to evennumbered exercises as well. I do not provide those because many instructors want to have exercises without answers. I am attempting to balance the two competing needs. You may find on occasion that you do not have the same answer that I do. Much of this will depend on the degree to which you or I round off intermediate steps. Sometimes it will make a surprising difference. If your answer looks close to mine, and you did it the same way that I did, then don’t worry about small differences. It is even possible that I made an error. I know that there will be errors in some of these answers. There always are. Even the most compulsive problem solver is bound to make errors, and it has been a long time since anyone accused me of being compulsive. I do try, honest I do, but something always slips past—sometimes they even slip past while I am correcting another error. So I maintain a page on the web listing the errors that I and other have found. If you find an error (minor and obvious typos don’t count unless they involve numbers), please check there and let me know if it is a new one. Some classes even compete to see who can find the most errors—it’s rough when you have to compete with a whole class. The address for the main web page, is http://www.uvm.edu/~dhowell/fundamentals/ , and the link to the Errata is there. Important note: Due to the way hypertext links are shown by Microsoft Word, the underlining often obscures a single underline character, as in “More_Stuff.” If you see a space in an address, it is often really a “_.” iii Chapter 1-Introduction 1.1 A good example is the development of tolerance to caffeine. People who do not normally drink caffeinated coffee are often startled by the effect of one or two cups of regular coffee, whereas those who normally drink regular coffee see no such effect. To test for a context effect of caffeine, you would first need to develop a dependent variable measuring the alerting effect of caffeine, which could be a vigilance task. You could test for a context effect by serving a group of users of decaffeinated coffee two cups of regular coffee every morning in their office for a month, but have them drink decaf the rest of the time. The vigilance test would be given shortly after the coffee, and tolerance would be seen by an increase in errors over days. At the end of the month, they would be tested after drinking caffeinated coffee in the same and in a different setting. The important points here are: 1. Tolerance is shown by an increase in errors on the vigilance task. 2. To see the effect of context, subjects need to be presented with caffeine in two different contexts. 3. There needs to be a difference between the vigilance performance in the two contexts. 1.3 Contexts affects people’s response to alcohol, to off-color jokes, or to observed aggressive behavior. 1.5 The sample would be the addicts that we observe. 1.7 Not all people in the city are listed in the phone book. In particular, women and children are underrepresented. A phone book is particularly out of date as a random selection device with the increase in the use of cell phones. Many telephone surveys really miss the general population, and instead focus on a restricted population, dominated by male adults. 1.9 In the tolerance study discussed in the text, we really do not care what the mean length of paw-lick latency is. No one would be excited to know that a mouse can stand on a surface at 105 degrees for 3.2 seconds without licking its paws. But we do very much care that the population mean of paw-lick latencies for morphine-tolerant mice is longer in one context than in another. 1.11 I would expect that your mother would continue to wander around in a daze, wondering what happened. 1.13 Three examples of measurement data: performance on a vigilance task; typing speed, blood alcohol level. 1.15 Relationship: The relationship between stress and susceptibility to disease; the relationship between driving speed and accident rate. 1.17 You could have one group of mice trained and tested in the same condition, one group trained in one condition and tested in the other, and a group given a placebo in the training context but given morphine in the testing condition. 1.19 This is an Internet search exercise without a fixed answer. The Statistics Homepage is an online statistics text. Various departments offer data sets, computing advice, and clarifying examples. 2 Chapter 2-Basic Concepts 2.1 Nominal: names of students in the class; Ordinal: the order in which students hand in their first exam; Interval: the student’s grade on that first exam; Ratio: the amount of time that the student spent studying for that exam. 2.3 If the rat lies down to sleep in the maze, after performing successfully for several trials, this probably says little about what the animal has learned in the task. It may say more about the animals level of motivation. In this exercise I am trying to get the students to see that there is often quite a difference between what you and I think our variable is measuring and what it actually measures. Just because we label something as a measure of learning does not make it so. Just because the numbers increase on a ratio scale (twice as much time in the maze) doesn’t mean that what those numbers are actually measuring is ratio (twice as much learning). 2.5 We have to assume the following at the very least (and I am sure I left out some) 1. Mice are adequate models for human behavior. 2. Morphine tolerance effects in mice are like heroin tolerance effects in humans, 3. Time on a warm surface is in some way analogous to a human response to heroin. 4. A context shift for mice is analogous to a context shift for humans. 5. A drug overdose is analogous to pain tolerance. 2.7 The independent variables are the sex of the subject and the sex of the other person. 2.9 The experimenter expected to find that women would eat less in the presence of a male partner than in the presence of a female partner. Men, on the other hand, were not expected to vary the amount that they ate as a function of sex of their partner. 2.11 We would treat a discrete variable as if it were continuous if it had many different levels and were at least ordinal. 2.13 When I drew 50 numbers 3 times I obtained 29, 26, and 19 even numbers, respectively. For my third drawing only 38 percent of my numbers were even, which is probably less than I might have expected—especially if I didn’t have a fair amount of experience with similar exercises. 3 2.15 Eyes level condition: a) X3 = 2.03; X5 = 1.05; X8 = 1.86 b) ∑X = 14.82 10 c) i 1 Xi 14.82 2.17 Eyes level condition: a) (∑X)2 = 14.822 = 219.6324; ∑X2 = 1.652 + ... + 1.732 = 23.22 b) ∑X/N = 14.82/10 = 1.482 c) This is the mean, a type of average. The above answers are the variance and standard deviation of Y. You really aren’t going to do much more calculation that this. 2.19 Putting the two sets of data together: a) Multiply pairwise b) ∑XY = 22.27496 c) ∑X∑Y = 14.82*14.63 = 216.82 d) XY ≠ XY. They do differ, as you would expect. X Y 22.7496 14.82 *14.63 XY N 1.0679 10 e) .1187 N 1 9 9 2.21 X X+4 5 9 7 11 3 7 6 10 3 7 ∑X = 24 ∑(X + 4) = 44 = (24 + 5*4) 2.23 In the text I spoke about room temperature as an ordinal scale of comfort (at least up to some point). Room temperature is a continuous measure, even though with respect to comfort it only measures at an ordinal level. 2.25 The Beth Perez story: a) The dependent variable is the weekly allowance, measured in dollars and cents, and the independent variable is the sex of the child. b) We are dealing with a selected sample—the children in her class. c) The age of the students would influence the overall mean. The fact that these children are classmates could easily lead to socially appropriate responses—or what the children deem to be socially appropriate in their setting. d) At least within her school, Beth could randomly sample by taking a student roster, assigning each student a number, and matching those up with numbers drawn from a random number table. Random assignment to Sex would obviously be impossible. e) I don’t see negative aspects of the lack of random assignment here because that is the nature of the variable under consideration. It would be better if we could randomly assign a child to a sex and see the result, but we clearly can’t. f) The outcome of the study could be influenced by the desire of some children to exaggerate their allowance, or to minimize it so as not to appear too 4 different from their peers. I would suspect that boys would be likely to exaggerate. g) The descriptive features of the study are her statements that the boys in her class received $3.18 per week in allowance, on average, while the girls received an average of $2.63. The inferential aspects are the inferences to the population of all children, concluding that “boys” get more than “girls.” 2.27 I would record the sequence number of each song that is played and then plot them on a graph. I can’t tell if they are truly random, but if I see a pattern to the points I can be quite sure that they are not random. I think that it is important for students to become involved with the Internet early on. There is so much material out there that will be helpful, and you have to start finding it now. I find it impossible to believe that my explanations of concepts are always the best explanations that could be given and that they serve each student equally well. If one explanation doesn’t make sense, you can find others that may. 5 Chapter 3-Displaying Data 3.1 Katz et al (1990) No Passage Group: 4 There is too little data to say very much about the shape of this distribution, but it certainly isn’t looking normally distributed. Frequency 3 2 1 <34 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57 • 0 Test Score 3.3 I would use stems of 3*, 3., 4*, 4. 5*, and 5. for this display. 3.5 Compared to those who read the passages: a) Almost everyone who read the passages did better than the best person who did not read them. Certainly knowing what you are talking about is a good thing (though not always practiced). b) 4 68966 44343 6669697 42102 57557 | | | | | | | | | | | 3* | 3. | 4* | 4. | 5* | 5. | 6* | 6. | 7* | 7. | HI | 5669 66 21232231 5 91 93 Notice that I have entered the data in the order in which I encountered them, rather than in increasing order. It makes it easier. c) It is obvious that the two groups are very different in their performance. We would be worried if they weren’t. d) This is an Internet exercise with no fixed answer. That source is far more advanced than the students would be at this time, but I think that they should be able to read it if they just skip over what they don’t understand. 6 3.7 The following is a plot (as a histogram) of reaction times collapsed across all variables. 3.9 Histogram of GPA scores 3.11 (1) Mexico has very many young people and very few old people, while Spain has a more even distribution. (2) The difference between males and females is more pronounced at most ages in Spain than it is in Mexico. (3) You can see the high infant mortality rate in Mexico. 3.13 The distribution of those whose attendance is poor is far more spread out than the distribution of normal attendees. This would be expected because a few very good students can score well on tests even when they don’t attend, but most of the poor attenders are generally poor students who would score badly no matter what. The difference between the average grades of these two groups is obvious. 3.15 As the degree of rotation increases, the distribution of reaction time scores appears to move from left to right—which is also an increase. I think it is a good idea to really think through this problem, rather than to just take the answer as given. It is important to see that looking at data can lead to conclusions to scientific questions, even without formal statistical tests. Many students have a hard time seeing the relationship between data 7 and a question they would like to ask. (Probably many older adults do as well.) 3.17 The data points are probably not independent in that data set. As time went on, there would be changes in the subject’s performance. At first he might get better with practice, but then fatigue would start to set in. Since the data are given in the order in which they were collected, at least within each condition, data nearer in time should be more similar than data farther apart in time. 3.19 The amount of shock that a subject delivers to a white participant does not vary as a function of whether or not that subject has been insulted by the experimenter. However, the black participants do suffer more shocks when the subject has been insulted. 3.21 Wikipedia gives an excellent set of data on HIV/AIDS prevalence at http://en.wikipedia.org/wiki/List_of_countries_by_HIV/AIDS_adult_prevalence_rate 3.23 There is a tremendous increase in Down’s Syndrome in children born to older mothers. This increase doesn’t really take off until mothers are in their 40s, but with parents delaying having children, this is a potential problem. Percent Down's Infants 2.5 2 1.5 1 0.5 0 15 20 25 30 35 40 Maternal Age 45 50 8 3.25 Smoking and low birthweight: Percent Births < 2500 grams The data are given as the percentage of births for each group that were less than 2500 grams. 12 Smokers 10 NonSmokers 8 6 4 2 0 1989 1990 1991 Year 1992 1993 The relationship is unlikely to be a fluke because it is so consistent year after year. You can see that within each group there is very little variability. Students often wonder why behavioral scientists care about what appears to be a public health problem. But public health problems are very often behavioral problems. Psychologists spend a great deal of time dealing with the behavioral consequences of low birthweight, and trying to find ways of lowering the rate, and with addictions such as smoking. 3.27 White females have a longer life expectancy than black females, but the difference has shrunk considerably since 1920, though recent changes have been modest. 9 Chapter 4-Measures of Central Tendency 4.1 Mode = 72; Median = 72, Mean = 70.18 4.3 Even without reading the passage, students are still getting about twice as many items correct as they would by chance. This suggests that the test, while testing reading comprehension, is also testing something else. I am not surprised at these results because most students can guess at better than chance levels. I think it is worth pointing out that these data suggest that the test measures something other than reading comprehension. Most students just say “they were able to guess intelligently,” without realizing that this means that the test is somehow measuring guessing ability. This will become more obvious when we talk about correlation in Chapter 9. Any positively skewed distribution will have a mean greater than the median. 4.5 The mean falls above the median. 4.7 Rats running a straight alley maze: X 320 X 320; X 21.33 N 15 Median 21 4.9 Multiplying by a constant (5): Original data 8 3 5 5 6 2 Revised data 40 15 25 25 30 10 Mean = 4.833, Mode = 5, Median = 5 Mean = 24.17 = 5×4.833, Mode = 25, Median = 25 4.11 Measures of central tendency for ADDSC and GPA: ADDSC Mode = 50 Median = 50 Mean = 4629/88 = 52.6 GPA Mode = 3.00 Median = 2.635 Mean = 216.15/88 =2.46 10 4.13 The means are very nearly the same for the two conditions. 4.15 The only measure that is acceptable for nominal data is the mode, because the mode is the only one that does not depend on the relationships among the points on the scale. 4.17 Class attendance: Regular Attendees Mean = 276.42; Median = 276 Poor Attendees Mean = 248.33; Median = 256 The two groups were 20 points apart in terms of the medians, and about 25 points apart in terms of means. Clearly, those students who come to class do better. Because this is not a true experiment (we don’t assign subjects to groups at random), we don’t know exactly what it means. I would like to think that students did poorly because they didn’t hear my brilliant presentations, but it could also be that poorer students in general are less likely to come to class. This is an issue of confounding, and it is a good example making the preference for random assignment apparent in a situation with which most students can identify. 4.19 This is an Internet activity in which there is no fixed answer. 4.21 a) mean = 46.57; 10% trimmed mean = 46.67. b) mean = 28.4; 10% trimmed mean = 25.0 11 c) Trimming was more effective in the second example because the second distribution was quite positively skewed. 4.23 The Male Optimists had a mean of 1.016, while the Male Pessimists had a mean of 0.945. This difference is very reliable. 12 Chapter 5-Measures of Variability 5.1 Variability of NoPassage group: Range = 57 – 34 = 23 St. Dev. = 6.83 Variance = 46.62 5.3 The variability of the NoPassage group is much smaller than the variability of the Passage group. If this difference turns out to be reliable, it could possibly be explained by the fact that the questions for the Passage group are asking for more than guessing and test-taking skills, and there may be greater variability due to variability in knowledge. On the other hand, it is not uncommon to find one standard deviation equal to two to three times another in small samples. 5.5 Percentages within two standard deviations in Exercise 5.2 s = 10.61 X + 2(10.61) = 70.18 + 21.22 = 48.96 — 91.4 16 scores (or 94%) lie within 2 standard deviations of the mean 5.7 Multiplying or dividing by a constant: Original 2 3 4 4 5 X*2 4 6 8 8 10 X/2 1 1.5 2 2 2.5 5 10 2.5 9 18 4.5 X 1= 4.57 s1 = 2.23 X 2= 9.14 s2 = 4.45 X 3 = 2.29 s3 = 1.11 5.9 Convert revised data to mean = 0 Since adding or subtracting a constant will not change the standard deviation, but will change the mean, I can subtract 3.27 from every score for X2 in Exercise 5.8, making the mean = 0, and keeping s2 = 1.0. the new values are X3 -0.889 0.539 -1.842 0.539 -0.413 1.016 1.016 X 1= 0 s1 = 1 5.11 Boxplot for Exercise 5.1: Median location = (N + 1)/2 = 29/2 = 14.5 Median = 46 Hinge location = (median location +1)/2 = 15/2 = 7.5 Hinge = 43 and 52 H-spread = 52 – 43 = 9 Inner fences = hinges + 1.5*H-spread = hinges + 1.5*9 = hinges + 13.5 = 29.5 and 65.5 Adjacent values = 34 and 57 30 35 40 45 50 55 13 60 5.13 Boxplot for ADDSC: Median location = (N + 1)/2 = 89/2 = 44.5 Median = 50 Hinge location = (median location +1)/2 = 45/2 = 22.5 Hinge = 44.5 and 60.5 H-spread = 60.5 – 44.5 = 16 Inner fences = hinges + 1.5*H-spread = hinges + 1.5*16 = hinges + 24 = 20.5 and 85.5 Adjacent values = 26 and 78 30 35 40 45 50 55 60 70 80 90 5.15 Variance when you add a score equal to the mean. 2 s (X X)2 ( X X)2 46.6257 N 1 27 2 2 (X X ) (N 1)s 27(46.6257) 1258.8933 Adding a score equal to the mean will not change the sum of the deviations but will increase the denominator to 28. 2 (X X) 1258.8933 44.96 N 1 28 Note that the new variance is (1-1/N) times the old variance. 2 snew The point that I was trying to make here is that adding scores that don’t deviate from the mean actually decrease the variance because they decrease the average deviation from the mean. 14 100 5.17 Angle of rotation: 5.19 The following is a cut-and-paste from the JMP help screen. (I don’t expect students to make all of these distinctions from what they are given, because many of the lines overlap.) 5.21 Treatment of anorexia: I would hypothesize that the two treatment groups would show more of a weight gain than the control group, but I have no reason to predict which treatment group would do better. I would assume that the variability would be about the same within each group. 15 Complete (Before and After) data for the three groups—from which difference scores were derived: Cognitive Behavioral 3.01 1.40 7.31 Mean Median St. Dev. Family Therapy 7.26 9.00 7.16 Control -.45 -.35 7.99 10 5 8 4 6 3 4 2 2 1 Std. Dev = 7.31 Std. Dev = 7.16 Mean = 3.0 Mean = 7.3 N = 29.00 0 -10.0 -5.0 -7.5 0.0 -2.5 5.0 2.5 10.0 7.5 15.0 12.5 N = 17.00 0 20.0 -5.0 17.5 0.0 -2.5 Cognitive Behavior Therapy 5.0 2.5 10.0 7.5 15.0 12.5 20.0 17.5 22.5 Family Therapy 6 5 4 3 2 1 Std. Dev = 7.99 Mean = -.4 N = 26.00 0 -12.5 -7.5 -10.0 -2.5 -5.0 2.5 0.0 7.5 5.0 12.5 10.0 15.0 Control Group If we look at the weight gain or loss, it would appear that the Control group remained stable, but the two treatment groups gained weight. The gain is greater for the Family Therapy group. 5.23 The descriptive statistics from SPSS are given below. The variable labels should be clear. 16 Notice that the Winsorized variance is considerably greater than the trimmed variance, as it should be. However, it is lower than the variance of the original data, reflecting the fact that the extreme values have been replaced. Cognitive behavior scores were positively skewed, with several quite high values and one or two low values. Trimming and Winsorizing reduced the influence of those values. This causes the Winsorized variance to be considerably smaller than the original variance. The trimmed mean is considerably smaller than the original mean, but the Winsorized mean is only slightly smaller. 17 Chapter 6-The Normal Distribution 6.1 Distribution of original values: 4 Frequency 3 2 1 0 Score / Deviation / z For the first distribution the abscissa would take on the values of: 1 2 3 4 5 6 7 For the second distribution the values would be: -3 -2 -1 0 1 2 3 For the third distribution the values would be: -1.90 -1.27 -0.63 0 0.63 1.27 1.90 In these calculations I used the parameters as given, rather than the statistics calculated on the sample. 6.3 Psychology 1 exam grades: z X 165 195 1.0 30 z X 225 195 1.0 30 a) The percentage between 165 and 225 is the percentage between z = -1.0 and z = 1.0. This is twice the area between z = 0 and z = 1 = 2×0.3413 = .6826. b) The percentage below 195 is just the percentage below z = 0 = .500. c) The percentage below z = 1 is the percentage in the larger portion = .8413. 6.5 Guessing on the Psychology 1 exam: a) We know the mean and standard deviation if the students guess; they are 75 and 7.5, respectively. We also know that a z score of 1.28 cuts off the upper 10%. We simply need to convert z = 1.28 to a raw score. X 75 1.28 1.28* 7.5 75 X 7.5 X 9.6 75 84.6 b) For the top 25% of the students the logic is the same except that z = 0.675. 18 X 75 .675* 7.5 75 X 7.5 5.0625 75 X 80.0625 .675 c) For the bottom 5% the cutoff will be z = -1.645. X 75 1.645 1.645*30 75 X 30 X 75 49.35 25.65 d) I would conclude that students were not just guessing, and could make use of test-taking skills that they had acquired over the years. There is a difference between Exercises 6.3 and 6.4 on the one hand, and 6.5 on the other. In the first two we are talking about performance on the test if students take it normally. There the mean is 195. In Exercise 6.5 we are talking about performance if the students just guessed purely at random without seeing the questions, but only the answers. Here the mean is 75, with a standard deviation of 7.5. These parameters are given by the binomial distribution with N = 300, p = .25, and q = .75, though the students would certainly not be expected to know this. 6.7 Reading scores for fourth and ninth grade children: a) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 15 20 25 30 35 40 45 b) To do better than the average 9th grade student, the 4th grader would have to have a score of 30 or higher. 30 25 1 5 The probability that a fourth grader would exceed a score of 30 is the probability of a z greater than 1.00 = .1587. c) The probability of a 9th grader doing worse than the average 4th grader is the probability of a 9th grader getting a score below 25, which is the probability of being more than half a standard deviation below the mean, which is .3085. z X 19 6.9 Diagnostically meaningful cutoffs: X X 50 1.28 10 12.8 50 X 62.8 A T score of 62.8 is the score that cuts off the top 10% of the distribution, and is therefore a diagnostically meaningful cutoff. 6.11 Seat belt study: 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 b) We need the probability of getting a 62 if the mean is 44 with a standard deviation of 7. 62 44 z 2.57 7 The probability of z > 2.57 = .0051. This is such a small probability that we will probably conclude that the student just made up the data, rather than collecting them honestly. 6.13 Distribution of correct responses a) Distribution b) The easiest way to find the cutoff for the lowest 10% is to simply take the sample data and count them, empirically finding the point with 10% of the scores below it. 20 6.15 Reaction time data: X 1.626 .6373 0.675*.6373 1.626 X 2.06 0.675 For a normal distribution we would expect 75% of the scores to be equal to or less than 2.06 seconds. In our data the 75th percentile is 1.88 seconds. 6.17 Identifying the highest 2% of Behavior Problem scores: The upper 2% is cut off by z = 2.05 X 50 2.05 10 2.05 *10 50 X 70.5 The critical cutoff is a score of 70.5. 6.19 The statisticians were upset because, by defining “overweight” as weighing more than 95% of peers (i.e. above the 95th percentile), the article seemed to be suggesting that there were 22% of children in the top 5%. Moreover, the article says that in 1986 only 8% of children were in the top 15%. That is just silly—it is analogous to “all of the children are above average.” I assume that they meant to say that 22% (etc.) were above what the 95th percentile was some years ago, but that is a different thing. Even if that is the case, the results still look too extreme to be likely. 6.21 Histogram of combined data on emotional stability 4 000 3 000 2 000 1 000 Std . Dev = 1 3 . 4 4 Me an = 1 0 0 .0 N = 10000.00 0 0 0. 0 2 .0 0 1 9 .0 0 1 8 .0 70 1 .0 0 6 1 .0 0 1 5 .0 0 1 4 .0 30 1 .0 20 1 .0 0 1 1 .0 0 10 0 0. 9 0 0. 8 0 . 70 0 . 60 0 0. 5 0 0. 4 0 . 30 0 . 20 Co m b in e d Dat a Notice that we have combined two normal distributions with the same mean, but the resulting distribution is not normal, as can be seen by comparing it to the superimposed normal curve. If the means were very different the distribution would become bimodal. 21 Chapter 7-Basic Concepts of Probability 7.1 Views of probability: a) Analytic—If two tennis players are exactly equally skillful so that the outcome of their match is random, the probability is .50 that Player A will win the upcoming match. b) Relative Frequency—If in past matches Player A has beaten Player B on 13 of the 17 occasions they have played, then, unless something has changed, Player A has a probability of 13/17 = .76 of winning their upcoming match. c) Subjective—Player A’s coach feels that she has a probability of .90 of winning her upcoming match with Player B. 7.3 More raffle tickets: a) The probability winning second prize given that you did not win first is 1/999 = .001. b) The probability that mom comes in first and you are second = 1/1000 * 1/999 = .000001. c) The probability of you first and mom second = 1/1000 * 1/999 = .000001 d) The probability that the two of you will take the top two prizes is .000001 + .000001 = .000002. 7.5 Part a) of Exercise 7.3 dealt with conditional probabilities. 7.7 What is the probability that you will feel better about your life given that you seek psychological counseling? The research hypothesis is that those who seek help when they need it feel better about life than those who refuse to seek help. 7.9 The mother and child are both sleeping for 11 hours, so the probabilities must be based on the remaining 13 hours. p(mom looking) = 2/13 = .154; p(baby looking) = 3/13 = .231; p(both looking) = .154*.231 = .036. 7.11 We would expect 3.33 percent of the fliers to end up in the trash if the message and the behavior were independent. In fact, Geller et al. found 4.5 percent of those fliers in the trash. This may look like a very small difference, but given the number of fliers that were handed out, it is a reliable one. It would appear that having a message on a flier increases its probability of being disposed of properly. 7.13 A continuous variable that is routinely treated as if it were discrete is children’s learning abilities, where placement in classes often assumes that the child falls within one category or another. 7.15 If we assume that we know nothing about the applicant, the probability of their being admitted is the probability that they fall above the 80th percentile (which equals 22 .20) times the probability that they will be admitted if they do, which is 10/100 = .10. The probability is .20*.10 = .02. Alternatively, we know that 10 out of 500 are admitted, so we could take the probability as being 10/500 = .02, which is the same thing. 7.17 ADDSC N = 88 X = 52.6 s = 12.42 [calculated from data set] 50 52.6 z 0.21 12.42 The probability associated with z = -.21 is .5832. 7.19 Dropouts with ADDSC > 60: p(dropout|ADDSC > 60) = 7/25 = .28 7.21 Conditional and unconditional probability of dropping out: p(dropout) = 10/88 = .11 p(dropout|ADDSC > 60) = .28 Students are much more likely to drop out of school if they scored at or above ADDSC = 60 in elementary school. 7.23 If there is no discrimination in housing, then a person’s race and whether or not they are offered a particular unit of housing are independent events. We could calculate the probability that a particular unit (or a unit in a particular section of the city) will be offered to anyone in a specific income group. We can also calculate the probability that the customer is a member of an ethnic minority. We can then calculate the probability of that person being shown the unit assuming independence and compare that answer against the actual proportion of times a member of an ethnic minority was offered such a unit. 7.25 The data again would appear to show that the U. S. Attorneys are more likely to request the death penalty when the victim was White than when the victim was Nonwhite. (This finding is statistically significant, though we won’t address that question until Chapter 19.) 7.27 In this situation we begin with the hypothesis that African Americans are fairly represented in the population. If so, we would expect 0.43% of the pool of 2124 people from which juries are drawn are African American. That comes out to be an expectation of 9.13 people. But the pool actually only had 4 African Americans. We would not expect exactly 9 people—we might have 7 or 8. But 4 sounds awfully small That is such an unlikely event if the pool is fair that we would probably conclude that the pool is not a fair representation of the population of Vermont. An important point here is that this is a conditional probability. If the pool is fair the probability of this event is only .05—an unlikely result. 23 Chapter 8-Hypothesis Testing 8.1 Last night’s hockey game: a) Null hypothesis: The game was actually an NHL hockey game. b) On the basis of that null hypothesis I expected that each team would earn somewhere between 0 and 6 points. I then looked at the actual points and concluded that they were way out of line with what I would expect if this were an NHL hockey game. I therefore rejected the null hypothesis. Notice that I haven’t drawn a conclusion about what type of game it actually was, because that is not what I set out to test. 8.3 A Type I error would be concluding that I was shortchanged when in fact I was not. 8.5 The rejection region is the set of outcomes for which we would reject the null hypothesis. The critical value would be the minimum amount of change below which I would reject the null. It is the border of the rejection region. 8.7 For the Mode test I would draw a very large number of samples and calculate the mode, range, and their ratio (M). I would then plot the resulting values of M. 8.9 Guessing the height of the chapel. a) The null hypothesis is that the average of two guesses is as accurate as one guess. The alternative hypothesis is that the average guess is more accurate than the single guess. b) A Type I error would be to reject the null hypothesis when the two kinds of guesses are equally accurate. A Type II error would be failing to reject the null hypothesis when the average guess is better than the single guess. c) I would be tempted to use a one-tailed test simply because it is hard to image that the average guess would be less accurate, on average, than the single guess. 8.11 A sampling distribution is just a special case of a general distribution in which the thing that we are plotting is a statistic which is the result of repeated sampling. 8.13 Magen et al (2008) study a) The null hypothesis is that the phrasing of the question will not effect the outcome—the means of the two groups are equal in the population. The alternative hypothesis is that the mean will depend on which condition the person is in. b) I would compare the two group means. c) If the difference is significant I would conclude that the phrasing of the choice makes a real difference in the outcome. 24 8.15 Rerunning Exercise 8.14 for = .01: We first have to find the cutoff for = .01 under a normal distribution. The critical value of z = 2.33 (one-tailed), which corresponds to a raw score of 42.69 (from a population with = 59 and = 7). We then find where 42.69 lies relative to the distribution under H1: X 42.69 50 z 1.04 7 From the appendix we find that .85.08% of the scores fall above this cutoff. Therefore = .851. 8.17 To determine whether there is a true relationship between grades and course evaluations I would find a statistic that reflected the degree of relationship between two variables. (The students will see such a statistic (r) in the next chapter.) I would then calculate the sampling distribution of that statistic in a situation in which there is no relationship between two variables. Finally, I would calculate the statistic for a representative set of students and classes and compare my sample value with the sampling distribution of that statistic. 8.19 Allowances for fourth-grade students: a) The null hypothesis in this case would be the hypothesis that boys and girls receive the same allowance on average. b) I would use a two-tailed test because I want to reject the null whenever there is a difference in favor of one gender over the other. c) I would reject the null whenever the obtained difference between the average allowances were greater than I would be lead to expect if they were paid the same in the population. d) I would increase the sample size and get something other than a self-report of allowances. 8.21 Hypothesis testing and the judicial system The judicial system operates in ways similar to our standard logic of hypothesis testing. However, in a court we are particularly concerned with the danger of convicting an innocent person. In a trial the null hypothesis is equivalent to the assumption that the accused person is innocent. We set a very small probability of a Type I error, which is far smaller than we normally do in an experiment. Presumably the jury tries to set that probability as close to 0 as they reasonably can. By setting the probability of a Type I error so low, they knowingly allow the probability of a Type II error (releasing a guilty person) to rise, because that is thought to be the lesser evil. 25 Chapter 9-Correlation 9.1 Low birthweight statistics: The two outliers would appear to have a distorting effect on the correlation coefficient. However, if you replot the data without those points the relationship is still apparent and the correlation only drops to -.54. 9.3 With 24 degrees of freedom, and two-tailed test at = .05 would require r > ± .388. 9.5 We can conclude that infant mortality is closely tied to both income and the availability of contraception. Infants born to people living in poverty are much more likely to die before their first birthday, and the availability of contraception significantly reduces the number of infants put at risk in the first place. 9.7 Because both income and contraception are related to mortality, we might expect that using them together would lead to a substantial increase in predictability. But note that they are correlated with each other, and therefore share some of the same variance. 9.9 Psychologists have a professional interest in infant mortality because some of the variables that contribute to infant mortality are behavioral ones, and we care about understanding, and often controlling, behavior. Psychologist have an important role to play in world health that has little to do with pills and irrigation systems. 26 There is a great deal of data available on these issues, and you can easily find it on the Internet. If you are interested in this question, you might also be interested in searching for similar literature on HIV/AIDS. This question was partly intended to make students think about the fact that all sorts of things are of interest to psychologists. We don’t just run animals in a maze or inquire into people’s dirty minds. In addition, low birthweight is a risk factor for all sorts of infant outcomes. 9.11 The relationship is extremely curvilinear, even though the linear correlation is quite high. You can see that the best fitting line misses almost all of the data points at each end of the distribution. 9.13 The relationship between test scores in Katz’ study and SAT scores for application purposes is a relevant question because we would not be satisfied with a set of data that used SAT questions and yet gave answers that were not in line with SAT performance. We want to know that the tests are measuring at least roughly the same thing. In addition, by knowing the correlation between SATs and performance without seeing the questions, we get a better understanding of some of what the SAT is measuring. 9.15 Correlation for the data in Exercise 9.14: SAT: mean = 598.57 ∑X = 16760 Test: mean = 46.21 ∑Y = 1294 covYX r St. Dev. = 61.57 St. Dev. = 6.73 X Y 16760*1294 780500 N 28 220.3175 N 1 27 XY cov YX 1 sY s X1 220.3175 .53 61.57 *6.73 With 26 df we would need a correlation of .374 to be significant. Since our value exceeds that, we can conclude that the relationship between test scores and the SAT is reliably different from 0. 9.17 When we say that two correlations are not significantly different, we mean that they are sufficiently close that they could both have come from samples from populations with exactly the same population correlation coefficient. 9.19 The answer to this question depends on the students’ expectations. 9.21 It is sometimes appropriate to find the correlation between two variables even if you know that the relationship is slightly curvilinear. A straight line often does a remarkably good job of fitting a curved function, provided that it is not too curved. 27 9.23 The amount of money that a country spends on health care may have little to do with life expectancy because to change a country’s life expectancy you have to change the health of a great many individuals. Spending a great deal of money on one person, even if it were to extend her life by dozens of years, would not change the average life expectancy in any noticeable way. Often the things that make a major change in life expectancy, like inoculations, really cost very little money. The African Red Cross estimates that there are 300-500 million cases of malaria each year, resulting in 1.5 to 2.5 million deaths. In particular, more than 90% of the deaths are in children under 5 years of age, and they occur predominantly in sub-Saharan Africa. Malaria cases could be cut by up to a third with insecticide treated bednets, which are very cheap by U.S. healthcare standards. 9.25 Extremely exaggerated data on male and female weight and height to show a negative slope within gender but a positive slope across gender: Height Weight Gender 68 72 66 69 70 185 175 190 180 180 Male Male Male Male Male 66 60 64 65 63 135 155 145 140 150 Fem. Fem. Fem. Fem. Fem. 200 190 Weight 180 170 160 150 140 130 55 57 59 61 63 65 Height 67 69 71 73 What we are effectively plotting here is the relationship between Gender and Weight, more than between Height and Weight. 9.27 We have confounding effects here. If we want to claim that red wine consumption lowers the incidence of heart disease, we have a problem because the consumption of red wine is highest in those areas with the greatest solar radiation, which is another potential cause of the effect. We would have to look at the relationship between red wine and heart disease controlling for the effects of solar radiation. 9.29 This is an Internet search with no fixed answer. 28 Chapter 10-Regression 10.1 Regression equation predicting low birthweight from high-risk fertility. Y = Low Birthweight Percentage X = High-risk Fertility sY = 0.698 sY2 = 0.487 Y = 6.70 sX = 6.289 sX2 = 39.553 X = 46.00 covXY = 2.7245 b cov XY 2.7245 0.069 s X2 39.553 a Y bX 6.70 (0.069) 46.00 3.53 Yˆ 0.069(X) 3.53 10.3 If the high risk fertility rate jumped to 70, we would predict that the incidence of birthweight < 2500gr would go to 8.35. Yˆ bX a 0.0689 X 3.53 0.0689 * 70 3.53 8.35 This assumes that there is a causal relationship, which is plausible in some ways, but not proven. It may be trivial to point this out, but here we have a real world situation where we can say something about changing trends in society and their possible effects. 10.5 I would be more comfortable speaking about the effects on Senegal because it is already at approximately the mean income level and we are not extrapolating for an extreme country. This may have little to do with a statistics course in psychology, but there have been some noticeable improvements in infant mortality in Senegal, and one device that has made a difference is a warm table on which newborn infants can be placed. This may interest students who probably think of advances in medicine in terms of MRIs. http://www.usaid.gov/stories/senegal/pc_sn_infant.html 10.7 Prediction of Symptoms score for a Stress score of 45: Regression equation: Yˆ = 0.7831X + 73.891 If X = 45: = 0.7831*45 + 73.891 Predicted Symptoms = 109.13 29 10.9 Subtracting 10 points from every X or Y score would not change the correlation in the slightest. The relationship between X and Y would remain the same. 10.11 Diagram to illustrate Exercise 10.10: 35 30 25 10 Point difference Y 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 X 10.13 Adding a constant to Y: 25 20 Y 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 X a) From this figure you can see that adding 2.5 to Y simply raised the regression line by 2.5 units. b) The correlation would be unaffected. 10.15 Predicting GPA (Y) from ADDSC (X): b cov XY 6.580 0.0426 s X2 154.431 a Y b X 2.456 0.0426*52.602 4.699 Yˆ 0.0426 X 4.699 When Hans Huessy and I first collected these data I was somewhat disheartened by how well we were doing (and to some extent I still am). We can take a measure in elementary school that is quickly filled out by the teacher, and make an excellent prediction about how the student will 30 do in high school. That may be nice statistically, but I don’t think we like to feel that children are that locked in. 10.17 The correlation dropped to -.478 when I added and subtracted .04 from each Y value. This drop was caused by the addition of error variance. One way to solve for the point at which they become equal is to plot a few predicted values and draw regression lines. Where the lines cross is the point at which they are equal. A more exact way of to set the two equations equal to each other and solve for X. 0.9X 31 1.5X 18 Collecting terms we get 31 18 1.5X 0.9X 13 0.6X X 13/ 0.6 21.67 To check this, substitute 21.67 in both equations 0.9 * 21.67 31 50.503 1.5* 21.67 18 10.19 Weight as a function of height for males: 220 200 Weight 180 160 140 120 100 60 62 64 66 68 70 Height for Males 72 74 76 The regression solution that follows is a modification of printout from SPSS. Equation Number 1 Dependent Variable.. Variable(s) Entered on Step Number 1.. HEIGHT Multiple R .60368 R Square .36443 Adjusted R Square .35287 Standard Error 14.99167 WEIGHT Analysis of Variance Regression Residual F = 31.53637 DF 1 55 Sum of Squares 7087.79984 12361.25279 Signif F = Mean Square 7087.79984 224.75005 .0000 ------------------ Variables in the Equation -----------------Variable B SE B Beta T Sig T 31 HEIGHT (Constant) 4.355868 -149.933617 .775656 54.916943 .603680 5.616 -2.730 .0000 .0085 b) The intercept is given as the “constant” and is -149.93, which has no interpretable meaning with these data. The slope of 4.356 tells us that a oneunit increase in height is associated with a 4.356 increase in weight. c) The correlation is .60, telling us that for females 36% of the variability in weight is associated with variability in height. d) Both the correlation and the slope are significantly different from 0, as shown by an F of 31.54 and a (equivalent) t of 5.616. 10.21 Predicting my own weight, for which I use the equation from Exercise 10.19: Yˆ = 4.356*height - 149.93 Yˆ = 4.356*68 - 149.93 = 146.28 a) The residual is Y - Yˆ = 156 - 146.28 = 9.72. (I have gained some weight since I last used this example.) b) If the students who supplied the data gave biased responses, then, to the degree that the data are biased, the coefficients are biased and the prediction will not apply accurately to me. 10.23 Predictions for a 5’6” male and female For the male, Yˆ = 4.356*66 - 149.93 = 137.57 For a female, Yˆ = 2.578*66 - 44.859 = 125.29 Difference = 12.28 pounds 10.25 Plot of Reaction Time against Trials for only the Yes/5-stimuli trials: 100 90 Reaction Time 80 70 60 50 40 30 20 10 0 200 205 210 215 220 225 Trial 230 235 240 245 250 The following regression solution is a modification of SPSS printout. 32 Equation Number 1 Dependent Variable.. Variable(s) Entered on Step Number 1.. TRIAL Multiple R R Square Adjusted R Square Standard Error RXTIME .01640 .00027 -.02056 12.76543 Analysis of Variance DF 1 48 Regression Residual F = .01291 Sum of Squares 2.10363 7821.89637 Signif F = Mean Square 2.10363 162.95617 .9100 ------------------ Variables in the Equation -----------------Variable TRIAL (Constant) B SE B Beta T Sig T -.014214 67.805186 .125100 28.267795 -.016397 -.114 2.399 .9100 .0204 The slope is only -0.014, and it is not remotely significant. For this set of data we can conclude that there is not a linear trend for reaction times to change over time. From the scatterplot above we can see no hint that there is any nonlinear pattern, either. 10.27 The evils of television: 285 Girls Boys NAEP Score 275 265 255 245 235 225 0 1 2 3 4 Hours Spent Watching TV 5 6 Regression equations: Boys Yˆ = -4.821X + 283.61 Girls Yˆ = -3.460X + 268.39 b) The slopes are roughly equal, given the few data points we have, with a slightly greater decrease with increased time for boys. The difference in intercepts reflects the fact that the line for the girls is about 9 points below that for boys. c) Television can not be used as an explanation for poorer scores in girls, because we see that girls score below boys even when we control for television viewing. 10.29 Draw a scattering of 10 data points and drop your pencil on it. 33 b) As you move the pencil vertically you are changing the intercept. c) As you rotate the pencil you are changing the slope. d) You can come up with a very good line simply by rotating and raising or lowering your pencil so as to make the deviations from the lines as small as possible. (We really minimize squared deviations, but I don’t expect anyone’s eyes to be good enough to do that.) 10.31 Galton’s data a) The correlation is .459 and the regression equation is Yˆ = .646×midparent + 23.942. (Remember to weight cases by “freq”.) b) I reran the regression requesting that SPSS save the Unstandardized prediction and residual. c) d) The children in the lowest quartile slightly exceed their parents mean (67.12 vs 66.66) and those in the highest quartile average slightly shorter than their parents (68.09 vs 68.31). e) It is easiest if you force both axes to have the same range and specify that the regression line is Yˆ = 1×X + 0. (If you prefer, you can use an intercept of 0.22 to equate the means of the parents and children.) 34 Chapter 11-Multiple Regression 11.1 Predicting quality of life: a) All other variables held constant, a difference of +1 degree in Temperature is associated with a difference of -.01 in perceived Quality of Life. A difference of $1000 in median income, again with all other variables held constant, is associated with a +.05 difference in perceived Quality of Life. A similar interpretation applies of b3 and b4. Since values of 0 cannot reasonably occur for all predictors, the intercept has no meaningful interpretation. b) Yˆ = 5.37 - .01(55) + .05(12) + .003(500) - .01(200) = 4.92 c) Yˆ = 5.37 - .01(55) + .05(12) + .003(100) - .01(200) = 3.72 11.3 Religious Influence and religious Hope contribute significantly to the prediction, but not religious Involvement. It is worth pointing out here that even though religion Involvement does not contribute significantly to the multiple regression, it does have a significant simple correlation with Optimism. The matrix of correlations (where N = 600) is OPTIMISM RELINVOL RELINF RELHOPE OPTIMISM 1.0000 P= . .1667 P= .000 .2725 P= .000 .2663 P= .000 RELINVOL .1667 P= .000 1.0000 P= . .4487 P= .000 .5439 P= .000 RELINF .2725 P= .000 .4487 P= .000 1.0000 P= . .4187 P= .000 RELHOPE .2663 P= .000 .5439 P= .000 .4187 P= .000 1.0000 P= . 11.5 I would have speculated that religious Involvement was not a significant predictor because of its overlap with the other predictors, but the tolerances kick a hole in that theory to some extent. That’s what happens when you ask a question before you are sure of the answer. 11.7 Adjusted R2 for 15 cases in Exercise 11.6: R 0.1234 . 173 2 35 est R 2 ( 1 R ) ( N 1) ( 1 . 173) ( 14) 1 . 158 ( N p 1) ( 15 4 1) 2 1 Since a squared value cannot be negative, we will declare it undefined. This is all the more reasonable in light of the fact that we cannot reject H0:R* = 0. 11.9 The multiple correlation between the predictors and the percentage of births under 2500 grams is .855. The incidence of low birthweight increases when there are more mothers under 17, when mothers have fewer than 12 years of education, and when mothers are unmarried. All of the predictors are associated with young mothers. (As the question noted, there are too few observations for a meaningful analysis of the variables in question.) 11.11 The multiple correlation between Depression and the three predictor variables was significant, with R = .49 [F(3,131) = 14.11, p = .0000]. Thus approximately 25% of the variability in Depression can be accounted for by variability in these predictors. The results show us that depression among students who have lost a parent through death is positively associated with an elevated level of perceived vulnerability to future loss and negatively associated with the level of social support. The age at which the student lost his or her parent does not appear to play a role. 11.13 The fact that the frequency of the behavior was not a factor in reporting is an interesting finding. My first thought would be that it is highly correlated with the Offensiveness, and that Offensiveness is carrying the burden. But a look at the simple correlation shows that the two variables are correlated at less than r = .20. 11.15 Using random variables as predictors: I drew the following data directly from the random number tables in the appendix (and I didn’t cheat). Y 5 2 3 6 9 2 6 3 9 8 X1 3 1 5 4 1 7 9 7 3 5 X2 7 6 2 1 0 6 2 3 3 6 X3 2 0 9 8 2 7 8 0 7 5 X4 7 9 1 7 9 1 8 4 9 6 X5 5 5 2 9 4 7 1 9 4 4 The multiple correlation for these data is .739, which is astonishingly high. Fortunately, the F test on the regression is not significant. Notice that we have only twice as many subjects as predictors. 36 This question is bound to lead to the question of how many cases we need per variable. There is no good answer to this question. Some will tell you that there should be at least 10 cases per predictor. I know of no argument in defense of such a rule. Harris (1985) has suggested a rule that says that N should exceed the number of predictors by at least 50. Cohen (1988) has argued from the point of view of power, and gives the example that a population correlation coefficient of .30 would require a sample size of 187 to have power = .80. This latter is sobering, but it is not a good argument here because we have not yet discussed power in any meaningful way. 11.17 Predicting weight: 11.19 The weighted average is 3.68, which is very close to the regression coefficient for Height when we control for Gender. 11.21 Sex is important to include in this relationship because women tend to be smaller than men, and thus probably have smaller, though not less effective, brains, but we probably don’t want that contamination in our data. However, note that Sex was not significant in the previous answer, though the sample size (and hence power) is low. 11.23 I could argue that PctSAT is a nuisance variable because we are not particularly interested in the variable itself, but only in controlling it to allow us to have a clearer view of Expend, which is the variable in which we are interested. At the same time, it is an important contributor to the prediction of Combined, but we are led away from noticing that because of our predominant interest in Expend. 37 11.25 The scatterplot follows and shows that the squared correlation is .434, which is just what we found from the regression solution. 38 Chapter 12—Hypothesis Tests Applied to Means: One Sample 12.1 Distribution of 100 random digits: 14 Frequency 12 10 8 6 4 2 0 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Number 12.3 The mean and standard deviation of the sample are 4.1 and 2.82, respectively, which are reasonably close to the parameters of the population for which the sample was drawn (4.5 and 2.6, respectively). The mean of the distribution of means is 4.28, which is somewhat closer to the population mean, and the standard deviation is 1.22. a) The Central Limit theorem would predict a sampling distribution of the mean with a mean of 4.5 and a standard deviation of 2.6/5 = 1.16. b) These values are close to the values that we would expect. 12.5 If you had drawn 50 samples of size 15, the mean of the sampling distribution should still approximate the mean of the population, but the standard error of that distribution would now be only 2.67/15 = 0.689. 12.7 Why doesn’t the previous question address the issue of the terrible state of North Dakota’s educational system? These students are certainly not a random sample of high school students in North Dakota or elsewhere. Moreover, they scored above the mean of 500, which would certainly not be expected if North Dakota’s system were inadequate. In addition, there is no definition of what is meant by “a terrible state,” nor any idea of whether or not the SAT measures such a concept. 12.9 Unlike the results in the two previous questions, this interval probably is a fair estimate of the confidence interval for P/T ratio across the country. It is not itself biased by the bias in the sampling of SAT scores. 39 12.11 Weight gain exercise: For these data the mean weight gain was 3.01 pounds, with a standard deviation of 7.3 pounds. This gives us t X 3.01 0 3.01 2.22 s 7.3 1.357 N 29 With 28 df the critical value at = .05, two-tailed, is 2.048, which will allow us to reject the null hypothesis and conclude that the girls gained weight at better than chance levels in this experiment. There is an important movement within statistics right now in the direction of laying a much heavier emphasis on confidence limits than on null hypothesis tests. I think this is a very good example of a place where a behavioral scientist might make good use of a confidence interval. I didn’t ask the you to calculate these limits, but they are 0.227 and 5.787. You should think about what these limits mean and about why they are useful. 12.13 Effect size measure for data in Exercise 12.11: One effect size measure would simply be the mean weight gain of 3.01 pounds. That statistic has real meaning to us, especially if we keep the size of a standard deviation in mind. A dubious alternative method would be to calculate an estimate of d̂ using the standard deviation of the gain scores as our base. X 3.01 dˆ 0.41 s 7.3 If I knew the standard deviation at baseline, that would make a good denominator. Unfortunately that information is not available, and 7.3 is the standard deviation of weight gains, and it is difficult to see how that creates a reasonable metric.. 12.15 I needed to solve for t in Exercise 12.14 because I did not know the population variance. 12.17 Testing the null hypothesis that children under stress report lower levels of anxiety: t X 11.00 14.55 3.55 3.50 s 6.085 1.014 N 36 40 With 35 df the critical value of t at = .05 (two-tailed) is ±2.03. We can reject H0 and conclude that children under stress show significantly lower levels of anxiety than normal children. Here is another situation where the data say that children report lower levels of anxiety, but it was necessary to first verify that their reports could be relied upon. 12.19 Yes, the results in Exercise 12.18 are consistent with the t test in Exercise 12.17. The t test showed that these children showed lower levels of anxiety than the normal population, and the confidence interval did not include 14.55. 41 Chapter 13—Hypothesis Tests Applied to Means: Two Related Samples 13.1 Sexual satisfaction of married couples. (Dependent variable = 1 for never fun and 4 for always fun.) Husband Wife Difference t Mean = 2.725 Mean = 2.791 Mean = -0.066 St. error diff = 0.136 St. Dev. = 1.165 St. Dev. = 1.080 St. Dev. = 1.298 N = 90 D 0.066 0 0.066 0.48 sD 1.298 0.136 N 91 With 90 df the critical value of t is approximately ±1.98, so we cannot reject the null hypothesis. We have no reason to conclude that wives are more or less satisfied, on the average, than their husbands. This is a matched-sample t because responses came from married couples. I would hope that there is some relationship between the sexual satisfaction of one member of the couple and the satisfaction of the other—but perhaps that is hoping for too much. 13.3 Scatterplot of data from Exercise 13.1: (The frequencies of each combination are shown above the points.) The correlation is .33, which is significant at = .05 This analysis finally addresses the degree of compatibility between couples, rather than mean differences. The correlation is significant, but it is not very large. That scatterplot is not very informative because of the discreteness of the scale and hence the overlapping of points. 42 13.5 The most important thing about a t test is the assumption that the mean (or difference between means) is normally distributed. Even though the individual values can only range over the integers 1 – 4, the mean of 91 subjects can take on a large number of possible values between 1 and 4. It is a continuous variable for all practical purposes, and can exhibit substantial variability. I drew 10,000 random samples of N = 91 from treating Husband scores as a population. The distribution of means follows. 13.7 We used a paired-t test for the data in Exercise 13.6 because the data were paired in the sense of coming from the same subject. Some subjects generally showed more beta-endorphins at any time than others, and we wanted to eliminate this subject-to-subject variability that has nothing to do with stress. In fact, there isn’t much of a relationship between the two measures, but we can’t fairly ignore it anyway. (Even though the correlation is not statistically significant, I think that we would look foolish if we did not treat these as paired data.) 13.9 If you look at the actual numbers given in Exercise 13.6, you would generally be led to expect that whatever was used to measure beta-endorphins was only accurate to the nearest half unit. Fair enough, but then where did values of 5.8 and 4.7 come from? If we can tell the difference to a tenth of a unit, why are most, but not all, of the scores reported to the nearest .5? It’s a puzzle. 13.11 You would not want to use a repeated measures design in any situation where the first measure will “tip off” or sensitize a subject to what comes next. Thus if you are going to show a 43 subject something and then ask him to recall it, the next time you show any item the subject will expect to have to recall it. Similarly we should be careful about repeated measures in drug studies because drugs often last surprisingly long in the body. 13.13 How many subjects do we need? First of all, in Exercise 13.6 we had 19 subjects, giving us 18 df. This means that for a one tailed test at = .01 we will need a t of at least 2.552 to be significant. So we can substitute everything we know about the data except for the N, and solve for N. D 0 sD N 7.70 2.552 13.52 t N Therefore N 2.552 *13.52 4.481 7.70 N 4.4812 20.078 21 subjects This exercise should be a good lead in to power, because you should be able to see the logic of this without knowing a thing about power. But in the chapter on power we are really doing the same thing but disguising it behind a bunch of Greek symbols. (Well, perhaps that’s a bit unfair.) Notice that we had to guess at N to get the critical value of t before we could calculate the needed N. Using the sample size they had is a reasonable approximation. 13.15 As the correlation between the two variables increases, the standard error of the difference will decrease, and the resulting t will increase. 13.17 First guess versus average guess 44 Notice that this is the same t as we had in Exercise 13.12. This is because there is a perfect linear relationship between first, second, and average guesses. (If you know the first guess and the average, you can compute what the second guess must have been.) 13.19 If I had subtracted the Before scores from the After scores I would simply change the sign of the mean and the sign of the t. There would be no other effect. 13.21 There is no answer I can give for this question because it asks the students to design a study. 45 Chapter 14—Hypothesis Tests Applied to Means: Two Independent Samples 14.1 Reanalysis of Exercise 13.1 as if the observations were independent: Males Females t Mean = 2.725 Mean = 2.791 X1 X 2 2 1 2 2 s s N1 N 2 s = 1.165 s = 1.080 NM = 91 NF = 91 2.725 2.791 1.1652 1.0802 91 91 0.066 0.066 .40 0.166 .028 [t.05(180) = ±1.98] Do not reject the null hypothesis. We can conclude that we have no reason to doubt the hypothesis that males and females are equal with respect to sexual satisfaction. There was no need to pool the variances here because the sample sizes were equal. If we did pool them, the pooled variance would have been 1.262. 14.3 The difference between the t in Exercises 13.1 and 14.1 is small because the relationship between the two variables was so small. 14.5 Random assignment plays the role of assuring (as much as is possible) that there were no systematic differences between the subjects assigned to the two groups. Without random assignment it might be possible that those who signed up for the family therapy condition were more motivated, or had more serious problems, than those in the control group. 14.7 You can not use random assignment to homophobic categories for a study like the study of homophobia because the group assignment is the property of the participants themselves. The lack of random assignment here will not invalidate the findings. 14.9 In Exercise 14.8 it could well have been that there was much less variability in the schizophrenic group than in the normal group because the number of TATs showing positive parent-child relationship could have had a floor effect at 0.0. This did not happen, but it is important to check for it anyway. 46 14.11 Experimenter bias effect: Expect Good Expect Poor s 2 p t N1 1 s12 N 2 1 s22 N1 N 2 2 X1 X 2 s 2p N1 Mean = 18.778 Mean = 17.625 s 2p N2 s = 3.930 s = 4.173 N=9 N=8 8(15.44) 7(17.41) 16.359 982 18.778 17.625 16.359 16.359 9 8 1.153 1.153 0.587 3.863 1.965 [t.05(15) = ±2.131] Do not reject the null hypothesis. We cannot conclude that our data show the experimenter bias effect. 14.13 Confidence limits for Exercise 14.8: Mean difference = 1.45 standard error = 0.545 t.05(38) = 2.02 CI (.95) ( X 1 X 2 ) t.05 (38) s X2 1 X 2 1.45 2.02(0.545) 1.45 1.10 0.35 2.55 Note that the answers to Exercises 14.11 and 14.12 are in line with the hypothesis tests, in that when we rejected the null hypothesis the confidence limits did not include 0, and when we did not reject the null, they did include 0. 47 14.15 Comparing GPA for those with low and high ADDSC scores: X L 2.59 sL2 0.658 N L 75 X H 1.68 sH2 0.560 N H 13 s 2 p N 1 s 1 N 2 1 s22 74(0.658) 12(0.560) 0.644 N1 N 2 2 75 13 2 2 1 t X1 X 2 s 2p N1 s 2p N2 2.59 1.68 0.644 0.644 75 13 0.91 0.91 3.77 .058 0.241 [t.05(86) = ±1.98] Reject H0 and conclude that people with high ADDSC scores in elementary school have lower grade point averages in ninth grade than people with lower scores. Here I pooled the variances even though the Ns were substantially different because the variance estimates were so similar. 14.17 The answer to 14.15 tells you that ADDSC scores have significant predictability of grade point average several years later. Moreover the answer to Exercise 14.16 tells you that this difference is substantial. This is a nice example of a situation in which it is easy to see a test of means as a test of predictability. 14.19 Anger with a reason is just fine. X NoAttrib 3.40 2 sNoAttrib 2.0736 X Attrib 5.02 s 2p t 2 s Attrib 2.7556 N NoAttrib 41 N Attrib 41 n1 1 s12 n2 1 s22 41 1 2.07362 41 1 2.75562 5.9466 n1 n2 2 XAXS s 2p N1 s 2p N2 41 41 2 3.40 5.02 1.62 3.01 5.9466 5.9466 0.538 41 41 48 The critical value is approximately 2.00, so we will reject the null hypothesis and conclude that when given a reason for a woman’s anger, she is given more status than when no reason was given for the anger. 14.21 If the two variances are equal, they will be equal to the pooled variance. If you have a problem seeing this, you can take any two equal variances and unequal Ns and try it for yourself. The answer becomes obvious when you do. 49 Chapter 15—Power 15.1 The statement on skiing is intended to point out that just because two things are different doesn’t mean that the larger (better, greater, etc.) one will always come out ahead. To take a different example, one treatment might be better than another for anorexia, but I would be very surprised if the difference was statistically significant every time, or even that its mean was always greater than the other mean. I just hope that it is significant most of the time. 15.3 Power for socially desirable responses: Assume the population mean = 4.39 and the population standard deviation = 2.61 a) Effect size: 4.39 3.87 0.52 1 0 .20 2.61 2.61 b) delta: .20 36 1.20 c) power = .22 Notice that the value of here is exactly the same as the value of t in that example. This is as it should be. 15.5 For Exercise 15.3 we would need approximately equal to 2.50, 2.80, and 3.25 for power of .70, .80, and .90, respectively. N 2 2.50 2.50 .20 N therefore N 156.25 .20 2 2.80 2.80 .20 N therefore N 196 .20 2 3.25 3.25 .20 N therefore N 264.06 .20 Notice how quickly the required sample sizes increase, and how as p increases the N required increases faster and faster. 50 15.7 Diagram of Exercise 15.6: 2.75 3.87 2.987 = 3.87-2.03*2.61/√36 15.9 Avoidance behavior in rabbits using a one-sample t test: a) For power = .50 we need = 1.95. 1 0 5.8 4.8 1.0 .50 2 2 N 2 1.95 1.95 .5 N therefore N 15.21 .50 b) For power = .80 we need = 2.80. N 2 2.8 2.8 .5 N therefore N 31.36 .50 Because subjects come in whole units, we would need 16 subjects for power = .50 and 32 subjects for power = .80 51 15.11 Avoidance behavior in rabbits with unequal sample sizes: .50 N Nh 2 N1 N 2 2(20)(15) 17.14 N1 N 2 20 15 N / 2 =.5 17.14 1.46 2 With = 1.46, power = .31 15.13 Cognitive development of LBW and normal babies at 1 year—modified data: a) Power calculations 1 0 25 28 .375 8 N 20 .375 1.19 2 2 With = -1.19, power = .22 b) t test: t X1 X 2 s 2p s 2p 25 28 1.19 64 64 20 20 N1 N 2 [t.05(38) = ±2.205] Do not reject the null hypothesis. c) The t is numerically equal to , although t is calculated from statistics and is calculated from parameters. In other words, is equal to the t we would get if the data came out with statistics equal to the parameters, 15.15 The significant t with the smaller N is more impressive, because that test had less power than the other, so the underlying difference is probably greater. The fact that a significant difference with a small N is more impressive should not lead you to conclude that small sample sizes are to be preferred. 15.17 Social awareness of ex-delinquents—which subject pool would be better to use? X X = 38 N = 50 College = 35 N = 100 Normal X Dropout 52 = 30 N = 25 38 35 N h 66.67 3 38 30 N h 33.33 66.67 2 33.33 2 32.66 17.32 8 Assuming equal standard deviations, the H. S. dropout group of 25 would result in a higher value of , and therefore higher power. 15.19 Total Sample Sizes Required for Power = .60, = .05, Two-Tailed ( = 2.20) Effect Size Small Medium Large 0.20 0.50 0.80 One-Sample t 121 20 8 Two-Sample t (per group) 242 39 16 Two-Sample t (overall) 484 78 32 15.21 When can power = ? The mean under H1 should fall at the critical value under H0. The question implies a onetailed test. Thus the mean is 1.645 standard errors above µ0, which is 100. µ = 100 + 1.645X = 100 + 1.645(15)/25 = 104.935 When µ = 104.935, power would equal . 15.23 The power of the comparison of TATs of parents of schizophrenic and normal subjects. 3.523 2.412 2.968; s p 2.968 1.723 2 3.55 2.10 1.45 1 2 0.842 1.723 1.723 N 20 0.842 2.66 2 2 Power = .75 s 2p 53 15.25 Aronson’s research on stereotype threat. s 2 p n1 1 s12 n2 1 s22 11 1 3.17 2 12 1 3.02 2 n1 n2 2 11 12 2 9.56 s p 3.09 X X 2 9.64 6.58 dˆ 1 0.99 sp 3.09 nh 2n1n2 2 1112 11.48 n1 n2 11 12 d n 11.48 0.99 0.99 5.74 0.99 2.396 2.37 2 2 From Appendix D5 the power of this experiment, if these are accurate estimates of the parameters, is .658. 54 Chapter 16—One-way Analysis of Variance I am assuming that most people would prefer to see the solutions to these problems as computer printout. (I will use SPSS for consistency.) 16.1 Analysis of Eysenck’s data: a) The analysis of variance: ----- ONEWAY ----Variable RECALL By Variable GROUP Group Membership Analysis of Variance Sum of Mean Source D.F. Squares Squares Between Groups 1 266.4500 266.4500 Within Groups 18 190.1000 10.5611 Total 19 456.5500 F F Ratio Prob. 25.2294 .0001 Group Count Standard Standard Mean Deviation Error Grp 1 Grp 2 10 10 19.3000 12.0000 2.6687 3.7417 .8439 1.1832 17.3909 TO 21.2091 9.3234 TO 14.6766 Total 20 15.6500 4.9019 1.0961 13.3558 TO 17.9442 95 Pct Conf Int for Mean b) t test t-tests for Independent Samples of GROUP Group Membership Number Variable of Cases Mean SD SE of Mean ---------------------------------------------------------------------RECALL Young 10 19.3000 2.669 .844 Older 10 12.0000 3.742 1.183 ---------------------------------------------------------------------Mean Difference = 7.3000 Levene's Test for Equality of Variances: F= .383 P= .544 t-test for Equality of Means 95% Variances t-value df 2-Tail Sig SE of Diff CI for Diff ---------------------------------------------------------------------Equal 5.02 18 .000 1.453 (4.247, 10.353) Unequal 5.02 16.27 .000 1.453 (4.223, 10.377) Notice that if you square the t value of 5.02 you obtain 25.20, which is the same as the F in the analysis of variance. Notice also that the analysis of variance procedure produces 55 confidence limits on the means, whereas the t procedure produces confidence limits on the difference of means. 16.3 Expanding on Exercise 16.2: a) Combine the Low groups together and the High groups together: Variable By Variable RECALL LOWHIGH Source Between Groups Within Groups Total Group Mean D.F. 1 38 39 Count Mean Analysis of Variance Sum of Mean Squares Squares 792.1000 792.1000 506.3000 13.3237 1298.4000 Standard Deviation Standard Error F Ratio 59.4505 F Prob. .0000 95 Pct Conf Int for Grp 1 Grp 2 20 20 6.7500 15.6500 1.6182 4.9019 .3618 1.0961 5.9927 13.3558 TO TO 7.5073 17.9442 Total 40 11.2000 5.7699 .9123 9.3547 TO 13.0453 Here we have compared recall under conditions of Low versus High processing, and can conclude that higher levels of processing lead to significantly better recall. b) The answer is still a bit difficult to interpret because both groups contain both younger and older subjects, and it is possible that the effect holds for one age group but not for the other. 16.5 2 and 2 for the data in Exercise 16.1: SSgroup = 266.45 SStotal = 456.55 MSerror = 10.564 k=2 2 2 SS group SStotal 266.45 .58 456.55 SS group (k 1) MSerror SStotal MSerror 266.45 (2 1)10.564 255.886 .55 456.55 10.564 467.114 56 16.7 Foa et al. (1991) study: Group SIT PE SC WL Total X .. n 14 10 11 10 45 Mean 11.07 15.40 18.09 19.50 15.622 S.D. 3.95 11.12 7.13 7.11 Total 155 154 199 195 703 Variance 15.6025 123.6544 50.8369 50.5521 703 15.622 45 SStreat n j X j X .. 2 14(11.07 15.622) 2 10(15.40 15.622) 2 11(18.09 15.622) 2 10(19.50 15.622) 2 507.840 MSerror (n 1)s (n 1) 2 i i i 13(15.6025) 9(123.6544) 10(50.8369) 9(50.5521) 41 55.587 SSerror [(n1 1)]MSerror 41*55.587 2279.067 From these values we can fill in the complete summary table and compute the F value. Source Treatment Error Total df 3 41 44 SS 507.840 2279.067 2786.907 MS 169.280 55.587 F 3.04 [F.05(3,41) = 2.84] We can reject the null hypothesis and conclude that there are significant differences between groups. Some treatments are more effective than others. 57 Number of Reported Symptoms b) 20 18 16 14 12 10 8 6 4 2 0 SIT PE SC Treatment Condition WL c) It would appear that the more interventionist treatments lead to fewer symptoms than the less interventionist ones, although we would have to run multiple comparisons to tell exactly which groups are different from which other groups. 16.9 If the sample sizes in Exercise 16.7 were twice as large, that would double the SStreat and MStreat. However it would have no effect on MSerror, which is simply the average of the group variances. The result would be that the F value would be doubled. 16.11 Effect size for tests in Exercise 16.10. It only makes sense to calculate an effect size for significant comparisons in this study, so we will deal with SIT vs SC. X X SIT 18.09 11.07 7.02 dˆ SC 0.94 7.455 MSerror 55.579 The SIT group is nearly a full standard deviation lower in symptoms when compared to the SC group, which is a control group. 16.13 ANOVA on GPAs for the ADDSC data: Variable GPA By Variable Group Source D.F. Between Groups 2 Within Groups 85 Total 87 Sum of Squares 22.5004 42.0591 64.5595 Mean Squares 11.2502 .4948 58 F F Ratio Prob. 22.7362 .0000 Group Standard Count Mean Deviation Standard Error 95 Pct Conf Int for Mean Grp 1 Grp 2 Grp 3 14 49 25 3.2536 2.5920 1.7436 .5209 .6936 .8020 .1392 .0991 .1604 2.9528 TO 2.3928 TO 1.4125 TO 3.5543 2.7913 2.0747 Total 88 2.4563 .8614 .0918 2.2737 TO 2.6388 There is a significant difference between the groups, telling us that there is a relationship between ADDSC score in elementary school and the GPA the student has in 9th grade. From the means it is clear that the GPA declines as the ADDSC score increases. These are real data, and they tell us that a teacher in elementary school can already pick out those students who will do well and badly in high school. I have always found these results depressing and worrisome, even though psychologists are supposed to like to be able to predict. There are some things I wish weren’t so predictable. 16.15 Analysis of Darley and Latané data: Group 1 2 3 Total n 13 26 13 52 Mean 0.87 0.72 0.51 SStreat n j X j X .. Total 11.31 18.72 6.63 36.66 2 13 0.87 0.705 26 0.72 0.705 13 0.51 0.705 2 MSerror 2 2 0.8541 0.053 (given in text) SSerror [(n1 1)]MSerror 49*0.053 2.597 From these values we can fill in the complete summary table and compute the F value. Source Treatment Error Total df 2 49 51 SS 0.854 2.597 3.451 MS 0.427 0.053 F 8.06 [F.05(2,49) = 3.18] We can reject the null hypothesis and conclude that subjects are less likely to summon help quickly if there are other bystanders around. 59 16.17 Bonferroni test on data in Exercise 16.2: Both of these comparisons will be made using t tests. The means are given in Exercise 16.15 above. t Xi X j MSerror MSerror ni nj For Young/Low versus Old/Low: 6.5 7.0 0.5 0.434 6.6278 6.6278 1.151 10 10 For Young/High versus Old/High: t t 19.3 12.0 7.3 6.34 6.6278 6.6278 1.151 10 10 For 36 df for error and for 2 comparisons at a familywise error rate of = .05, the critical value of t = 2.34. There is clearly not a significant difference between young and old subjects on tasks requiring little cognitive processing, but there is a significant difference for tasks requiring substantial cognitive processing. The probability that at least one of these statements represents a Type I error is at most .05. It is worth pointing out that when we are using MSerror as our variance estimate, and have equal sample sizes, the computations are very simple because we only need to calculate the denominator once. 16.19 Effect size for WL versus SIT X X SIT 19.50 11.07 8.43 dˆ WL 1.18 sWL 7.11 7.11 The two groups differ by over a standard deviation. 60 16.21 Spilich et al. data on a cognitive task: Variable ERRORS By Variable SMOKEGRP Analysis of Variance Source Between Groups Within Groups Total Group Count D.F. 2 42 44 Mean Sum of Squares 2643.3778 11700.4000 14343.7778 Standard Deviation Mean Squares 1321.6889 278.5810 Standard Error F Ratio 4.7444 F Prob. .0139 95 Pct Conf Int for Mean Grp 1 Grp 2 Grp 3 15 15 15 28.8667 39.9333 47.5333 14.6866 20.1334 14.6525 3.7921 5.1984 3.7833 20.7335 28.7838 39.4191 TO TO TO 36.9998 51.0828 55.6476 Total 45 38.7778 18.0553 2.6915 33.3534 TO 44.2022 Here we have a task that involves more cognitive involvement, and it does show a difference due to smoking condition. The non-smokers performed with fewer errors than the other two groups, although we will need to wait until the next exercise to see the multiple comparisons. 16.23 Spilich et al. data on driving simulation: Variable ERRORS By Variable SMOKEGRP Analysis of Variance Source Between Groups Within Groups Total Group Count D.F. 2 42 44 Mean Sum of Squares 437.6444 992.6667 1430.3111 Standard Deviation Mean Squares 218.8222 23.6349 Standard Error F Ratio 9.2584 F Prob. .0005 95 Pct Conf Int for Mean Grp 1 Grp 2 Grp 3 15 15 15 2.3333 6.8000 9.9333 2.2887 5.4406 6.0056 .5909 1.4048 1.5506 1.0659 3.7871 6.6076 TO TO TO 3.6008 9.8129 13.2591 Total 45 6.3556 5.7015 .8499 4.6426 TO 8.0685 Here we have a case in which the active smokers again performed worse than the nonsmokers, and the differences are significant. 61 16.25 Attractiveness of faces a) The research hypothesis would be the hypothesis that faces averaged over more photographs would be judged more attractive than faces averaged over fewer photographs. b) Data analysis c) Conclusions The group means are significantly different. From the descriptive statistics we can see that the means consistently rise as we increase the number of faces over which the composite was created. 62 Chapter 17—Factorial Analysis of Variance 17.1 Thomas and Wang (1996) study: a) This design can be characterized as a 3 2 factorial, with 3 levels of Strategy and 2 levels of delay. b) I would expect that recall would be better when subjects generated their own key words, and worse when subjects were in the rote learning condition. I would also expect better recall for the shorter retention interval. (But what do I know?) c) Summaries of By levels of RECALL STRATEGY DELAY Value Label Variable Mean Std Dev Cases For Entire Population 11.602564 7.843170 78 STRATEGY DELAY DELAY 1.0000 1.0000 2.0000 9.461538 14.923077 4.000000 6.906407 5.330127 2.516611 26 13 13 STRATEGY DELAY DELAY 2.0000 1.0000 2.0000 11.269231 20.538462 2.000000 9.606488 1.983910 1.471960 26 13 13 STRATEGY DELAY 3.0000 1.0000 14.076923 15.384615 6.183352 5.454944 26 13 DELAY 2.0000 12.769231 6.796492 13 17.3 Analysis of variance on data in Exercise 17.1: RECALL by STRATEGY DELAY UNIQUE sums of squares All effects entered simultaneously Sum of Mean Source of Variation Squares DF Square Main Effects STRATEGY DELAY F Sig of F 2510.603 281.256 2229.346 3 2 1 836.868 140.628 2229.346 42.992 7.224 114.526 .000 .001 .000 824.538 824.538 2 2 412.269 412.269 21.179 21.179 .000 .000 Explained 3335.141 5 667.028 34.267 .000 Residual 1401.538 72 19.466 Total 4736.679 77 61.515 2-Way Interactions STRATEGY DELAY 63 There are significant differences due to both Strategy and Delay, but, more importantly, there is a significant interaction. This is a good example for showing all three effects. The Delay and Interaction effects are obvious, but the overall Strategy effect is harder to see. You would do well to calculate the Strategy means, which are 9.46, 11.27, and 14.08, respectively. It will help if you draw those means on the figure for Exercise 17.2. . 17.5 Bonferroni tests to clarify simple effects for data in Exercise 17.4: t Xi X j MSerror MSerror ni nj For Data at 5 Minutes Delay: For Generated versus Provided: t 14.92 20.54 5.62 3.15 20.7009 20.7009 1.784 13 13 For Generated versus Rote: t 14.92 15.38 0.46 0.26 20.7009 20.7009 1.784 13 13 For Provided versus Rote: t 20.54 15.38 5.16 2.89 18.2308 18.2308 1.784 13 13 64 For Data at 2 Day Delay: For Generated versus Provided: t 4.00 2.00 2.00 1.19 18.2308 18.2308 1.674 13 13 For Generated versus Rote: t 4 12.77 8.77 5.24 18.2308 18.2308 1.674 13 13 For Provided versus Rote: t 2 12.77 10.77 6.43 1.674 20.7009 20.7009 13 13 For 6 comparisons with 36 df, the critical value of t is 2.80. For the 5-minute delay, the condition with the key words provided by the experimenter is significantly better than both the condition in which the subjects generate their own key words and the rote learning condition. The latter two are not different from each other. For the 2-day delay, the rote learning condition is better than either of the other two conditions, which do not differ between themselves. We clearly see a different pattern of differences at the two delay conditions. The most surprising result (to me) in the superiority of rote learning with a 2 day interval. In running these Bonferroni tests, I had a choice. I could have thought of each simple effect as a family of comparisons, and obtained the critical value of t with 3 comparisons for each. Instead I chose to treat the whole set of 6 comparisons as a family and adjust the Bonferroni for 6 tests. There is no hard and fast rule here, and many might do it the other way. The results would not change regardless of what I decided. 17.7 The results in the last few exercises have suggested to me that if I were studying for a Spanish exam, I would fall back on rote learning, painful as it sounds and as much against common wisdom as it is. 17.9 In this experiment we have as many primiparous mothers as multiparous ones, which certainly does not reflect the population. Similarly, we have as many LBW infants 65 as full-term ones, which is certainly not a reflection of reality. The mean for primiparous mothers is based on an equal number of LBW and full-term infants, which we know is not representative of the population of all primiparous births. Comparisons between groups are still legitimate, but it makes no sense to take the mean of all primiparous moms combined as a reflection of any meaningful population mean. Many of our experiments are run this way (with equal sample sizes across groups that are not equally represented in the population), and it is important to distinguish between the legitimacy of between group comparisons and the legitimacy of combined means. 17.11 Simple effects versus t tests for Exercise 17.10. a) If I had run a t test between those means my result would simply be the square root of the F = 1.328 that I obtained. b) If I used MSerror for my estimated error term it would give me a t that is the square root of the F that I would have had if I had used the overall MSerror, instead of the MSerror obtained in computing the simple effect. 17.13 Analysis of variance for Spilich et al. Study: The main effect of Task and the interaction are significant. The main effect of Task is of no interest because there is no reason why different tasks should be equally difficult., We don’t care about the main effect of Smoking either because it is created by large effects for two levels of Task and no effect for the third. What is important is the interaction. 66 This is a good example of a situation in which main effects are of little interest. For example, saying that smoking harms performance is not really accurate. Smoking harms performance on some tasks, but not on others. Often main effects are still interpretable in the presence of an interaction, but not here. 17.15 Simple effects to clarify the Spilich et al. Example. We have already seen these simple effects in Chapter 16, in Exercises 16.18, 16.19, and 16.21. 17.17 Factorial analysis of the data in Exercise 16.2: Here we see that we have a significant effect due to age, with younger subjects outperforming older subjects, and a significant effect due to the level of processing, with better recall of material processed at a higher level. Most importantly, we have a significant interaction, reflecting the fact that there is no important difference between younger and older subjects for the task with low levels of processing, but there is a big difference when the task calls for a high level of processing—younger subjects seem to benefit more from that processing (or do more of it). 67 17.19 Nurcombe et al study of maternal adaptation. b) The program worked as intended and there was no interaction between groups and educational level. 17.21 Effect size for Level of Processing in Exercise 17.17 X X low 15.65 6.75 8.90 dˆ Hi 3.46 2.5739 MSerror 6.628 This is a very large effect size, but the data show an extreme difference between the two levels of processing. I used the square root of MSerror here because that was in line with what I did in the text. But a good case could be made for adding Age and the interaction sums of squares back in and calculating a new error term. That would produce dˆ X Hi X Low MSerror revised 15.65 8.90 8.90 2.44 3.65 13.323 which is considerably smaller but still a very large effect. 17.23 Set of data for a 2 2 design with no main effects but an interaction: 12 10 8 6 4 2 0 Col-1 Col-2 Row-1 Cell means: 8 12 12 8 Row-2 68 17.25 Magnitude of effect for Exercise 17.1 Summary table from Exercise 17.1: Source Strategy Delay SD Error Total 2 Strategy 2 Strategy 2 Delay 2 Delay df 2 1 2 72 77 SS Strategy SStotal SStotal MSerror SStotal MS 140.628 2229.346 412.269 19.466 F 7.224 114.526 21.179 281.256 .06 4736.679 SS Strategy ( s 1) MSerror SS Delay SS 281.256 2229.346 824.538 1401.538 4736.679 281.256 (3 1)19.466 .05 4736.679 19.466 2229.346 .47 4736.679 SS Delay (d 1) MSerror SStotal MSerror 2229.346 (2 1)19.466 .46 4736.679 19.466 2 SD SS SD 824.538 .17 SStotal 4736.679 2 SD SS SD ( s 1)(d 1) MSerror 824.538 (3 1)(2 1)19.466 .16 SStotal MSerror 4736.679 19.466 17.27 Magnitude of effect for Exercise 17.13: Summary table from Exercise 17.13: Source Task SmokeGrp TS Error Total df 2 2 4 126 134 SS 28661.526 1813.748 1269.452 13587.200 45331.926 69 MS 14330.763 906.874 317.363 107.835 F 132.895 8.41 2.943 2 Task SSTask 28661.526 .63 SStotal 45331.926 2 Task SSTask (t 1) MSerror 28661.526 (3 1)107.835 .63 SStotal MSerror 45331.926 107.835 2 Smoke SS Smoke 1813.748 .04 SStotal 45331.926 2 Smoke SS Smoke ( s 1) MSerror 1813.748 (3 1)107.835 .04 SStotal MSerror 45331.926 107.835 TS2 SSTS 1269.452 .03 SStotal 45331.926 TS2 SSTS (t 1)( s 1) MSerror 1269.452 (3 1)(3 1)107.835 .02 SStotal MSerror 45331.926 107.835 17.29 The two magnitude of effect measures (2 and 2) will agree when the error term is small relative to the effect in question, and will disagree when there is a substantial amount of error relative to the effect. But notice that this is a comparison of MSerror and a sum of squares, and sums of squares can be large when there are many degrees of freedom for them. So to some extent, all other things equal, the two terms will be in closer agreement when there are several degrees of freedom for the treatment effect. 17.31 You should restrict the number of simple effects you examine to those in which you are particularly interested (on a priori grounds), because the familywise error rate will increase as the number of tests increases. Although we routinely talk about familywise error rates with respect to multiple comparison procedures, they really apply whenever you run more than one test, whether you consider them tests on main effects and interactions, or tests on simple effects, or tests on multiple contrasts. A test is a test as far as the error rate is concerned. Source Gender Condition GX Error df 1 1 1 56 SS 223.49 1.35 0.69 1161.44 70 MS 223.49 1.35 0.69 20.74 F 10.78 <1 <1 Chapter 18—Repeated-Measures Analysis of Variance 18.1 Descriptive statistics on study of migraines: We e kly Me a ns 30 Me a n 20 10 0 1 2 3 4 5 WEEK 18.3 I would have liked to collect data from students on the use of pain killers and other ways of dealing with migraines. I might also like to have data on stress levels over time so that I could possibly rule out the effects of stress Here again we are getting into issues of experimental design, which underlie all meaningful analyses. This design differs from the one in the “suggestions” section of the Resource Manual for Chapter 16. In that design we had separate groups tested at the different times. 71 18.5 Repeated-measures analysis of variance of data used in Exercise 18.4: Source df SS MS F Subjects 8 612.00 Weeks 1 554.50 554.50 14.424 Error 8 302.00 37.75 Total 17 1159.7 [F.05(1,24) = 4.26] There is a significant increase in decrease in severity over time. F = t2 = 3.7982 = 14.424. 18.7 Effect size for Exercise 18.4 We will use the square root of MSerror as our estimate of the standard deviation, because this is a standard deviation corrected for any differences due to subject effects. X X 3 20.00 9.00 11.00 dˆ 0 3.44 3.20 MSerror 10.22 The decrease in severity from baseline to training a reduction of approximately three and one half standard deviations. (I used the standard deviation of the baseline scores in line with what I said in the text. 18.9 d̂ for difference in Exercise 18.8 I would standardize the difference in means using the square root of the average of the variances of the two baseline measures. This would leave individual differences as part of the standard deviation, which seems appropriate. The average variance is 77.97, so the standard deviation is 8.83 X baseline X training 20.39 7.19 13.20 dˆ 1.49 s 8.83 8.83 On average, the severity of headaches decreased by nearly 1.50 standard deviations from baseline to training. 18.11 Exercise 18.10 tested the null hypothesis that condom use did not change over time. We would have hoped to see that the intervention worked and that condom use increased, but that was not what we found. There was an increase, but it was not significant. 18.13 It would appear that without the intervention, condom use would actually have declined. This suggests that the intervention may have prevented that decline, in which case that non-significant result is actually a positive finding. 72 18.15 Bonferroni t tests to compare the beginning and end of Baseline, and the beginning and end of Training for the data in Table 18.1. We can use a standard t test because the error term has been corrected by the repeated-measures analysis of variance, which has already removed between subject variability. The Bonferroni alpha level would be .05/3 = .01667 We will reject all of the null hypotheses because each p value is less than .0167. 73 Chapter 19—Chi-Square 19.1 Popularity of Psychology professors: Anderson 25 22.3 Observed Expected 2 Klansky 32 22.3 Kamm 10 22.3 Total 67 67 (O E)2 E (25 22.3)2 (32 22.3) 2 (10 22.3) 2 22.3 22.3 22.3 11.33 [ 2 .05 (2) 5.99] We will reject the null hypothesis and conclude that students do not enroll at random. 19.3 Sorting one-sentence characteristics into piles: Observed Expected Exp. % 1 8 5 10% 2 10 10 20% 3 20 20 40% 4 8 10 20% 5 4 5 10% Total 50 50 100% (O E )2 E 2 (8 5) (10 10) 2 (20 20) 2 (8 10) 2 (4 5) 2 5 10 20 10 5 2.4 2 [ 2.05 (4) 9.49] Do not reject the null hypothesis that my daughter’s sorting behavior is in line with my theory. Notice that here is a case where my theory is aligned with the null hypothesis. 74 19.5 Racial choice in dolls (Clark & Clark, 1939): Black 83 126 Observed Expected 2 White 169 126 Total 252 252 (O E) 2 E (83 126)2 (169 126)2 126 126 29.35 [ 2 .05 (1) 3.84] We can reject H0 and conclude that the children did not choose dolls at random, but chose white dolls more often than black. This may be an old study, but it is a very important one. It played a role in Brown vs. Board of Education. It was used to argue the case that separate but equal was not equal, and that it was destructive to the self-esteem of black children. 19.7 Combining the two experiments: 1939 1970 Totals 2 Black 83 (106.42) 61 (37.58) 144 White 169 (145.58) 28 (51.42) 197 Total 252 89 341 (O E)2 E 2 2 2 (83 106.42) (169 145.58) (61 37.58) (28 51.42) 106.42 145.58 37.58 51.42 34.184 2 2 .05 (1) 3.84 Reject the null hypothesis and conclude that the distribution of choices between Black and White dolls was different in the two studies. Choice is not independent of the study, and could easily be related to the time at which the studies were run. We are no longer asking whether one color of doll is preferred over the other color, but whether the pattern of preference is constant across studies. In analysis of variance terms we are dealing with an interaction. 75 19.9 There are several ways this study could be modified. We could simply rerun the present analysis by defining smokers and non-smokers on the basis of the partner’s smoking behavior. Alternatively, we could redefine the Smoker variable as “neither,” “mother,” “father,” or “both.” 19.11 Howell and Huessy (1985) study of attention deficit disorder: Remedial English 22 (28.374) 19 (12.626) 41 Classification Normal ADD Total Nonremedial English 187 (180.626) 74 (80.374) 261 Total 209 93 302 (O E)2 E 2 2 2 2 2 (22 28.374) (187 180.626) (19 12.626) (74 80.374) 28.374 180.626 12.626 80.374 5.38 2 .05 (1) 3.84 We can reject the null hypothesis and conclude that achievement level during high school varies as a function of performance during elementary school. 19.13 A one-way chi-square test on the data in the first column of Exercise 19.12 would be asking if the students are evenly distributed among the eight categories. What we really tested in Exercise 19.12 is whether that distribution, however it appears, is the same for those who later took remedial English as it is for those who later took nonremedial English. 19.15 Inescapable shock and implanted tumor rejection: Reject No Reject Total Inescapable Shock 8 (14.52) 22 (15.48) 30 Escapable Shock 19 (14.52) 11 (15.48) 30 76 No Shock 18 (15.97) 15 (17.03) 33 Total 45 48 93 2 (O E)2 E 2 2 (8 14.52) (19 14.52) 14.52 14.52 8.852 2 (15 17.03) 17.03 2 .05 (2) 5.99 The ability to reject a tumor is affected by the shock condition. I like this example particularly because it makes it clear that psychological variables have very clear effects on physical health. We often say this, but here are some quite dramatic data. 19.17 This is another place where we see the important relationship between sample size and power. 19.19 Testosterone and childhood delinquency: Not Delinquent Delinquent 2 High Testosterone 366 (391.824) 80 (54.176) 446 Normal Testosterone 3554 (3528.176) 462 (487.824) 4016 Total 3920 542 4462 O E 2 E 366 391.8242 ... 462 487.8242 391.824 487.824 15.57 .205 1 3.84 a) These results show that there is a significant relationship between the two variables—2 = 15.57. b) Testosterone levels in adults are related to the behavior of those individuals when they were children. c) This result shows that we can tie the two variables (delinquency and testosterone) together historically. I would assume that people who have high 77 testosterone levels now also had high levels when they were children, but that is just an assumption. 19.21 We could ask a series of similar questions, evenly split between “right” and “wrong” answers. We could then sort the replies into positive and negative categories and ask whether faculty were more likely than students to give negative responses. . 19.23 Racial differences in desired weight gain. For white females, the odds of wishing to lose weight were 352/183 = 1.9235, meaning that while females are nearly twice as likely to wish to lose weight as to stay the same or gain weight. For African-American females, the corresponding ratio is 47/52 = .9038. The odds ratio is 1.9235/.9038 = 2.1281. This means that the odds of wishing to lose weight were more than twice as high among white females as compared to African American females. 19.25 Death penalty study: The chi-square statistic is clearly significant. Nonwhite defendants are sentenced to death at a significantly higher rate than white defendants. 78 Chapter 20—Nonparametric and Distribution-Free Tests 20.1 Inferences in children’s story summaries (McConaughy, 1980): a) Analysis using the Mann-Whitney test (also known as Wilcoxon’s rank-sum test): Raw Data Ranks 0 1 1.5 3 R = 30 Younger Children 0 3 2 5 1.5 6 N=7 4.5 2 9 4 7 Older Children 6 4 8 4.5 7.5 11.5 10 7.5 R = 61 N=6 Ws = R for group with smaller N = 61 13 7 11.5 Ws' 2W Ws 84 61 23 W’s < Ws; therefore W’s in Appendix E. Double the probability level for a 2tailed test. W.025(6,7) = 27 > 23 b) Reject the null hypothesis and conclude that older children include more inferences in their summaries. 20.3 The analysis in Exercise 20.2 using the normal approximation: n1 (n1 n 2 1) 2 z n1 n 2 (n1 n 2 1) 12 9(9 11 1) 53 2 9(11)(9 11 1) 12 3.15 Ws p ( z 3.15) (2(.0009) .0018 .05 We will reject the null hypothesis and come to the same conclusion we came to earlier. 79 20.5 Hypothesis formation in psychiatric residents (Nurcombe & Fitzhenry-Coor, 1979): Before After Diff. Rank Signed Rank 8 7 -1 2 4 9 +5 8.5 8.5 2 3 +1 2 2 2 6 +4 7 7 4 3 -1 2 -2 8 10 +2 4.5 4.5 3 6 +3 6 6 1 7 +6 10 10 3 8 +5 8.5 8.5 -2 9 7 -2 4.5 -4.5 T+ = (positive ranks) = 46.5 T- = (negative ranks) = 8.5 T = smaller of |T+| or |T-| = 8.5 n = 10 T.05(10) = 8 < 8.5 Do not reject H0 b) We cannot conclude that we have evidence supporting the hypothesis that there is a reliable increase in hypothesis generation and testing over time. (Here is a case in which alternative methods of breaking ties could lead to different conclusions.) Here you might discuss how we could go about deciding how to break ties, putting the emphasis on a priori decisions. 20.7 Independence of first-born children: First Second Diff. Rank Signed Rank 12 10 2 4 4 18 12 6 17.5 17.5 13 15 -2 4 17 13 4 11 11 8 9 -1 1 -4 15 12 3 8 8 16 13 3 8 8 5 8 -3 8 8 10 -2 4 -8 -4 2 7 -5 14 5 7 -2 4 15 13 2 4 4 -14 -4 -1 12 8 4 11 11 Data Cont.: First Second Diff. Rank Signed Rank 13 8 5 14 14 5 9 -4 11 14 8 6 17.5 17.5 20 10 10 20 20 19 14 5 14 14 -11 T+ = (positive ranks) = 164 T- = (negative ranks) = 46 T = smaller of |T+| or |T-| = 46 n = 20 T.05(20) =52 > 46 80 17 11 6 17.5 17.5 18 12 6 17.5 17.5 b) We can reject the null hypothesis and conclude that first-born children are more independent than their second-born siblings. Here is a good example of where we would use a “matched sample” test even though the same children do not perform in both conditions (nor could they). We are assuming that brothers and sisters are more similar to each other than they are to other children. Thus if the first-born is particularly independent, we would guess that the second-born has a higher than chance expectation of being more independent. They share a common environment. 20.9 Data in Exercise 20.7 plotted as a function of the first-born’s score: 12 10 FirstBorn - SecondBorn 8 6 4 2 0 -2 -4 -6 0 10 20 30 FirstBorn The scatterplot shows that the difference between the pairs is heavily dependent upon the score of the first-born. 20.11 The Wilcoxon matched-pairs signed-ranks test tests the null hypothesis that paired scores were drawn from identical populations or from symmetric populations with the same mean (and median). The corresponding t test tests the null hypothesis that the paired scores were drawn from populations with the same mean and assumes normality. This is an illustration of the argument that you buy things with assumptions. By making the more stringent assumptions of a t test, we buy greater specificity in our conclusions. However if those assumptions are false, we may have used an inappropriate test. 20.13 Rejection of the null hypothesis by a t test is a more specific statement than rejection using the appropriate distribution-free test because, by making assumptions about normality and homogeneity of variance, the t test refers specifically to population means—although it is also dependent on those assumptions. 81 20.15 Truancy and home situation of delinquent adolescents: Analysis using the Kruskall-Wallis one-way analysis of variance: Natural Home Score Rank 15 18 18 22 19 24.5 14 16 5 4.5 8 8 12 11.5 13 13.5 7 6.5 Ri 124.5 Foster Home Score Rank 16 19 14 16 20 26 22 27 19 24.5 5 4.5 17 20 18 22 12 11.5 170.5 Group Home Score Rank 10 9 13 13.5 14 16 11 10 7 6.5 3 2 4 3 18 22 2 1 83 N = 27 n=9 H 12 Ri2 n 3(N 1) N(N 1) i 12 124.52 170.52 832 3(27 1) 27(27 1) 9 9 9 6.757 (2) 5.99 2 .05 We can reject the null hypothesis and conclude the placement of these adolescents has an effect on truancy rates. This analysis doesn’t directly answer the question the psychologist wanted answered, because he wanted to show that the group home was better than the others. He might follow this up with Mann-Whitney tests serving in the role of multiple comparison procedures, applying a Bonferroni correction (although it might be difficult to find the necessary critical values.) Alternatively, he could just run a single Mann-Whitney between the group home and the combined data of the other two placements. 20.17 The study in Exercise 20.16 has the advantage over the one in Exercise 20.15 in that it eliminates the influence of individual differences (differences in overall level of truancy from one person to another). 82 20.19 For the data in Exercise 20-5: a) Analyzed by chi-square: More Fewer Total Observed 7 3 10 Expected 5 5 10 (O E)2 (7 5)2 3 52 E 5 5 1.60 2 2 .05 (1) 3.84 We cannot reject the null hypothesis. b) Analyzed by Friedman’s test: Before Score 8 4 2 2 4 8 3 1 3 9 Totals After Rank 2 1 1 1 2 1 1 1 1 2 13 Score 7 9 3 6 3 10 6 7 8 7 N = 13 k= 2 Rank 1 2 2 2 1 2 2 2 2 1 17 12 2 Ri 3N(k 1) Nk(k 1) 12 [132 17 2 ] 3(10)(2 1) 12(2)(2 1) 1.60 2F 2 .05 (1) 3.84 These are exactly equivalent tests. 83 20.21 “The mathematics of a lady tasting tea;” First Cup Score Rank 8 3 15 3 16 2 7 3 9 3 8 2 10 3 12 3 Totals 22 N=8 Second Cup Score Rank 3 2 14 2 17 3 5 2 3 4 9 3 3 1 10 2 16 Third Cup Score Rank 2 1 4 1 14 1 4 1 6 2 4 1 4 2 2 1 10 k= 3 12 Ri2 3N(k 1) Nk(k 1) 12 [222 162 10 2 ] 3(8)(3 1) 8(3)(3 1) 9.00 F2 2 .05 (2) 5.99 We can reject the null hypothesis and conclude that people don’t really like tea made with used tea bags. 84 Chapter 21—Choosing the Appropriate Test [N.B. Please review the disclaimer concerning these answers at the beginning of Chapter 21. There are many different ways to think about a study.] 21.1 This test involves comparing two proportions, and the easiest way to do that is to set up a 2 × 2 contingency table with Group on one dimension and Mastery on the other. 21.3 This is a repeated measures analysis of variance with assessment periods as the repeated measure and condition as the between subject variable. If measurements were taken on several occasions I would like to see the data plotted over time, but all we currently have is the data at the end of the treatment phase. 21.5 This is a t test for two independent groups—children of divorced families and children of intact families. 21.7 This is a standard one-way analysis of variance. I would be most likely to conduct a planned comparison of the positive and negative conditions. 21.9 This is a two-way analysis of variance, with secure and insecure as one dimension and mortality vs. dental as the other. No multiple contrasts are called for because there are only two levels of each variable. 21.11 This could be treated as a two-way analysis of variance if we break the data down by race and by Afrocentric facial features. A problem with this is that we would presumably have more Afrocentric features for black inmates, which would lead to unequal sample sizes (i.e. an unbalanced design). 21.13 This is a regression problem where we are time is one variable and the difference in happiness between liberals and conservatives (by year) is the other variable. 21.15 The most important thing to do would be to plot the data over time looking for trends. A repeated measures analysis of variance would tell you if differences are significant, but it is the direction of differences, and whether they return to baseline, that is likely to be most informative. The authors further broke down the participants in terms of their preoccupation with 9/11 and looked at differences between those groups. Interestingly, even the least preoccupied group showed changes over time. 21.17 This is a difficult one, partly because it depends on what Payne wants to know. I assume that she wants to know how rankings of characteristics agree across sexes or across years. She could first find the mean rank assigned to each characteristic separately for each sex and year. Because the raw data were originally ranks, I would probably be inclined to then rank these mean values. She could then calculate Spearman’s rS between males and females for each year or between years within each sex. The correlations would be obtained for the ten pairs of scores (one per characteristic). 85 21.19 This is a 2 4 analysis of variance with two levels of sex and 4 levels of occupation. The major emphasis is on the occupations, so multiple comparisons of those means would be appropriate. 21.21 There are two independent groups in this experiment. The authors should use a Mann-Whitney test to compare average locus of control scores. 21.23 This is a situation for a chi-square goodness-of-fit test. The conditions are Rotated versus Stationary, and the count is the number of subjects choosing that condition as giving stronger contours. The expected values would be 37/2 = 18.5. The data are sufficiently extreme that a test in superfluous. 21.25 This is another complex repeated-measures analysis of variance. The comparison of recall of the two lists (one learned before administration of the drug and the other learned after) is a repeated measurement because the same subjects are involved. The comparison of the Drug versus Saline groups is a between-subjects effect because the groups involve different subjects. 21.27 This is basically a correlational study, where we separately correlate the two dependent variables with amount of alcohol consumed. Given the 14 year gap, and all of the other factors that affect development, we should not expect very strong correlations even under the best of conditions. 86