Ideal Answers to Chapter 2 Questions QUESTION 2.1a: In the case of these data, the median seems like a much better indicator of central tendency than does the mode. After all, the mode also happens to be the lowest score in these data. In contrast, the median or middle score is in close agreement with the mean of 11.8. Unless there are some real outliers in a set of data, the mean is generally considered the best indicator of what score is typical in a data set. QUESTION 2.1b: Yes, if the data were to more closely approximate a normal distribution, the mode and the median would be more likely to be in agreement. In fact, in a perfectly normal distribution, the mean, the median and the mode are all exactly the same. The median, for example, is the area precisely under the peak of the normal curve, meaning that it is the score that occurs with the highest frequency. QUESTION 2.2a. The mean value for shoe size in my SPSS output file is 11.8. Further, the mode is 10.5, and the median is 11.5. I pasted my output below. QUESTION 2.2b. As I suggested above, the mean is probably the best measure of central tendency, followed closely by the median. If we randomly and repeatedly pulled one of these five Pelham guys out of a very large hat (with replacement) and wanted to predict his shoe size as accurately as possible before the selection, the mean would do the best job of minimizing errors of prediction, in the long term. QUESTION 2.2c. To have a better sense for whether these men tend to have big feet, it would be nice to know the population standard deviation for shoe size. If a score of 1.8 sizes above the mean is 2 standard deviations above the mean, we could say that these guys, on average, have really big feet. In the domain of height, for example, a man who is 2 standard deviations above the mean is roughly 6’5” (based on a mean of 5’10” and a standard deviation of 3.5 inches). But if the shoe size standard deviation were as high as 3.6, this would put them only half a standard deviation above the mean, roughly like saying a guy is 5’11 ½ inches tall, certainly not a shrimp but not a giant. Unfortunately, we probably cannot get a very good estimate of the population standard deviation for shoe size based on these five men because they are all related. Assuming foot size has a pretty big genetic component, this would make for a smaller standard deviation among a group of related people than for the entire population. This being, said, common sense tells us that Barry has pretty darned big feet – and Brett is just behind him. Of course, maybe Barry is 6’5”, in which case his feet are probably perfectly normal for a tall guy. QUESTION 2.3a. This is one of those interesting cases where the mean, the median and the mode are not in agreement at all. To begin with the median, which is the most commonly used measure when extreme outliers are possible, the median is $20.00. So there is a sense in which the average American reported spending $20 yesterday. In other words, about half the people reported spending less than $20 and half reported spending more. Using the median would prevent a small percentage of people who spent a huge amount of money yesterday from inflating the value. But of course, if we are just trying to minimize errors of prediction, the mean would work better. The mean was $71.10. That’s way bigger than the median, and it must reflect the influence of the 11 people (out of n=120 who answered the question) who say they spent $200 or more yesterday. One guy even said he spent $900! (I hope he enjoys the new flat screen Hitachi.) If you think that spending $900 isn’t all that crazy (every once in a while), then the mean of $71.10 is not so unreasonable. At the other extreme, our mode was actually $0.00, that is, no money spent at all. Almost 26% of these respondents report having spent nothing yesterday (once we ignore rent, routine bills, and the other things they were asked to cross off the list). If you just care about what was most typical (common) for American consumers, the mode is the most reasonable measure of central tendency. One worry with accepting the mode as the best measure of spending, though, is that it lacks face validity. Did ¼ of the sample really spend nothing, or did they just forget about that decaf latte they picked up at Starbucks before heading home to sleep all day? Taking the data at face value, I’d have to say that this is one of those cases where there is no single answer to the question of how much money the typical American spends. There are pros and cons for each of these three very different answers to the question. Accepting conventional economic wisdom for measuring central tendency that involves money would probably mean settling on the median and would make the economists happy. Median income, for example, is almost always preferred by statisticians and economists over mean income. QUESTION 2.3b. Fully 40% of survey respondents failed to answer the spending question. This is a very high figure and it raises worries about non-response bias. Those who were willing and able to report their spending may have had different spending habits than those who were unwilling or unable to answer. For example, if seniors, and/or poor people are more likely to be unable or unwilling to answer the spending question, then our estimate would be an overestimate for the total population. On the other hand, if most Americans believe that reckless spending is frowned upon, people who spend extravagant amounts of money might be more likely to refuse to answer the question, which would mean that, in the absence of any other biases, our estimate would be too low. To the degree that memory lapses rather than social desirability biases are responsible for nonresponses to the spending question, elderly respondents might be overrepresented among nonresponders. If elderly people tend to have lower than average incomes, or more frugal spending habits, this would mean that our observed values overestimate spending in the total population. Because almost all survey respondents are willing to report their age, this last possibility could be tested empirically in a complete data set. In the best of all possible worlds these various sources of bias would all cancel one another out. However, this lucky state of affairs seems unlikely. In the absence of empirical data on which kind of people are least willing or able to report their spending, we can only add the caveat that our measure of daily spending is subject to nonresponse bias. QUESTION 2.4a. The table below shows the U.S. states with the highest and lowest percentages of Whites, Latinos, Blacks and Asians in 2007. The table shows that there is tremendous variation across U.S. states in their ethnic composition. Whereas more than 95% of those living in Maine were White, less than 25% of those living in Hawaii were White. Further, almost 40% of Hawaiians in 2007 were Asian. States with Highest and Lowest Percentages of People in Four Ethnic Groups in 2007 Group State with Highest Percentage State with Lowest Percentage Whites Maine (95.5%) Hawaii (24.7%) Latinos New Mexico (44.4%) West Virginia (1.1%) Blacks Mississippi (37.2%) Montana (0.6%) Asians Hawaii (39.9%) MT, WV, WY (tied at 0.7%) Note. Actual percentages for each ethnic group are listed in parenthesis after each state. QUESTION 2.4b. (a) According to our indicator of ethnic diversity, Hawaii is the most ethnically diverse U.S. state and California is the second most diverse. The two states are very similar in that non-Hispanic Whites are actually a statistical minority in both states (though more dramatically so in Hawaii). Further each state also has one other ethnic group that is represented in very large numbers: In Hawaii it is Asians (who make up almost 40% of the population) whereas in California it is Latinos (who make up 36% of the population). African Americans are the only group not to be represented in very large numbers in either state (about 7% in California and about 3% in Hawaii). It is also worth noting that whereas Hawaii has very few Blacks and very few Latinos, California is more diverse than Hawaii in the sense that Whites, Asians, and Latinos all make up at least 10% of the population. On the other hand, by the simplest possible definition of ethnic diversity, which would be the percentage of non-Hispanic whites in a U.S. state, Hawaii is the winner hands down. (b) Using the measure based on the standard deviation, the two least ethnically diverse states are Maine and Vermont. If we trust Wikipedia (http://en.wikipedia.org/wiki/List_of_U.S._states_by_income), both of these New England states are roughly in the middle of the pack in per capita income relative to other U.S. states, and politically speaking, both are “blue” states. Based on our own data, the two states have strikingly similar (and strikingly homogenous) ethnic compositions. Both are about 95% White, and about 1% Black, 1% Latino, and 1% Asian. According to census data, one area in which these two otherwise similar states differ somewhat is in the percentage of residents aged 25 or older with a bachelor’s degree or higher. Vermont ranks 7th nationally whereas Maine ranks 24th. More specifically, about 34% of Vermont residents, but only about 27% of Maine residents are college educated. See http://www.census.gov/statab/ranks/rank19.html (c) The two states that are the most typical in terms of ethnic diversity are Arkansas and Tennessee. They rank 25th and 26th, respectively on our ethnic diversity measure. Further in both states, more than 3/4 of residents are White, about 1/6 are Black. Both have only a small percentage (3-5%) of Latinos and almost no Asians. The same census data just cited also show that these two states are well below the national median in the percentage of adults who are college educated. Tennessee ranks 42nd and Arkansas ranks 48th nationally in the percentage of residents with a college diploma. Criticism: These results would surely have differed somewhat if we had defined ethnic diversity based on national origin (e.g., Germans, Nigerians, etc.) rather than focusing on the four most common American ethnic groups. Further, if we had simply included data from a couple of other ethnic groups (e.g., Native American Indians, people of mixed ethnicity) the results might have differed appreciably for at least some states. Alaska and Oklahoma, for example, would have probably climbed higher in the rankings based on the relatively high percentage of Native American Indians in these states. Teaching Moment: Instructors may wish to point out that we could improve noticeably on the measure of ethnic diversity (without having to consult the U.S. Census any further) by creating one additional ethnic group, namely “all other ethnicities.” This would be a catch-all category for groups such as Native American Indians, Middle Easterners who do not consider themselves Asians, and people of mixed ethnicity. Putting all of these very small minorities in one group (by simply subtracting the 4 ethnic groups listed from 100%) would effectively be all-inclusive and would bypass the statistical snags associated with using, say, 12 ethnic groups, eight of which would be very tiny in most states. QUESTION 2.5. Here are the calculations for the margin of error based on an observed sample percentage of 48% yes responses and a sample size of n= 1,000. Sp = square root of [.48 x .52/1,000] = .0158 So the margin of error is P + (1.96 x .0158), which is .48 + 0.031. The 95% confidence interval (converted to a percentage is) thus 44.9-51.1%. With rounding, then, there is about a 95% chance that 45-51% of Canadians believe that there is intelligent life elsewhere in the universe. Going through the same calculations for a poll result of 4% rather than 48% yields, Sp = square root of [.04 x .96/1,000] = .0062 So the margin of error is P + (1.96 x .0062), which is .04 + 0.012. The 95% confidence interval (converted to a percentage is) thus 2.8-5.2%. Rounding to the nearest percentage point, then, there is roughly a 95% chance that the true percentage of Canadians who think their taxes are too low is between 3 and 5%. We have roughly a one point margin of error here (with a 4% result) instead of a roughly 3 point margin of error (with a 48% result). A look at the simple standard deviations of the two sets of 20 scores provided (Set5 and Set 50) clarifies why margins of error get smaller as a poll result becomes more extreme. The standard deviation of the scores is simply smaller for more extreme results. Based on my hand calculations, the S.D. for the scores in Set 50 is 0.500. When I asked SPSS to calculate it using “Descriptives,” by the way, I got .51299. I assume that this is because SPSS used N-1 to estimate the population standard deviation rather than just giving the simple descriptive S.D. My hand calculations for the S.D. for Set 5 yielded a much smaller value of .2179. This much lower value makes sense because 19 of the 20 scores in Set 5 are identical to one another! So all else being equal, margins of error are smaller when a sample percentage deviates more from 50% because the standard deviation associated with the scores gets smaller as you deviate further from an even, 50%, split. QUESTION 2.6. The skewness score for the variable “bimodal” was exactly zero (0.00). This was the case because the scores to the right of the mean (6.0) are a perfect mirror image of the scores to the left of the mean (e.g., the lowest score of 1.0 is exactly 5 points below the mean, and the highest score of 11.0 is exactly 5 points above the mean). Further, the two modes (3 and 9) are each exactly three points from the mean. When I created a new variable whose n = 21 scores were twenty 1s and just one 7, I observed a skewness score of 4.58, which indicates extreme positive skew. This makes sense because there is one and only one score (7) that is an extremely positive outlier. In fact, this one outlier is more than 4 standard deviations above the mean of about 1.3. QUESTION 2.7. The variable “normal” has a kurtosis score of 0.11, which is very close to zero. This makes sense because “normal” very closely approximates a normal distribution. The ever so slightly positive kurtosis score means that these scores are ever so slightly clustered (i.e., bunched together) at the mean. In contrast, the variable “low_kurt” has a low kurtosis score of -1.89. This low score reflects the fact that this bimodal distribution of scores is heavily clustered at the tails. The mean is 6.0, but there is only one score of exactly 6 in the distribution whereas there are seven scores of 4 (2 points below the mean) and seven scores of 8 (two points above the mean). Thus 14 out of 21 or 2/3 of the scores lie in the extreme tails of the distribution. One might describe this as an “anti-normal” distribution. The variable “high_kurt,” has a high kurtosis score of +5.68, which is a reflection of the fact that fully 17 of the 21 scores are exactly the mean score of 6.0. There are exactly two scores below the mean (4 and 5) and exactly two scores above the mean (7 and 8). Thus, this distribution of scores is very sharply peaked at the mean. The kurtosis score for “fingers” is even larger. It is +18.44. This much higher kurtosis score is presumably a reflection of the fact that only two (rather than four) of the 21 scores deviate at all from the median score of 10 fingers. Supplemental Answer: To see whether kurtosis is more sensitive to the number of scores that deviate from the central tendency or the amount by which the scores deviate, I created a couple of variables in which I varied the number of scores that deviate from the mean as well as the degree of that deviation. The kurtosis statistic was highly sensitive to changes in the number of scores that deviate from the mean (at least for the extremely limited range of scores I tested) but not sensitive at all to the absolute degree of deviation for the one or two scores that were different from the central tendency. In fact the n=7 scores in Set A (0, 10, 10, 10, 10, 10, 20) have the same kurtosis score (+3.0) as the n=7 scores in Set B (9, 10, 10, 10, 10, 10, 11). At first blush this seemed odd, but if you first subtract 9 from all of the scores in Set B and then multiply these new scores by 10 you get the scores in Set A. This observation seems to support a basic principle of statistics -- which is that if you can make a simple linear transformation of a set of scores it does not change basic properties of the set of scores. Just as multiply height by two would not change the correlation between height and weight, making these two transformations (10 * (x-9)) on the scores in Set A does not seem to change their kurtosis. This suggests that by adding just a few more scores to the positive and negative tails of a distribution (so that there is better information about the true variability of the scores), one might see that that the extremity of the scores does matter for kurtosis. The n=9 scores in Set A’ ( -1, 0, 10, 10, 10, 10, 10, 20, 21) are exactly the same as the scores in Set A except they contain a single score that is one point more negative and a single scores that is one point more positive than the original low and high scores in Set A. Although kurtosis dropped substantially (from +3.0 to +1.26), it was still positive. Contrast this with what happened when I examined the kurtosis of the scores in Set B’. These scores were ( 8, 9, 10, 10, 10, 10, 10, 11, and 12). Note that these scores are exactly the same as the scores in Set B except that they, too, contain a single score that is one point more negative and a single score that is one point more positive than the original low and high scores. The kurtosis of this set of scores, however, is actually slightly negative (-.019), meaning that this distribution is slightly flatter (more like a plateau) than a perfectly normal distribution. This makes sense because we now know that the scores of 8, 9, 11, and 12 are all quite close to the mean. The gap between 9 and the mean is thus the same as the gap between 9 and 8. This is not at all true in the previous example (for Sets A and A’). The gap between the 20 and the mean is much larger than the gap between 20 and 21. QUESTION 2.8a. Wickens would clearly say that we should not trust this correlation or its associated p value. If we were to delete the two people with a very unusual number of fingers, there could be no correlation by definition because there would be no variance in number of fingers. Further, if we were to switch the heights of the person with 9.5 fingers and the person with 12 fingers, we would observe a correlation of r = -.48, p < .05. That’s a significant negative instead of a significant positive correlation! Another way to think about this is that the highly leptokurtic nature of this distribution means that the standard deviation for number of fingers is very small (0.46). Thus a score that is only a little higher or a little lower than the typical score of 10 fingers has tremendous weight in the analysis. In fact, the score of 12 fingers is more than 4 standard deviations above the mean. Scores in excess of three standard deviations above the mean are usually treated as potentially worrisome outliers. There is a sense in which we are essentially dealing with three observations. Assuming the 19 participants with 10 fingers are a representative sample, the 19 height scores of this group should be normally distributed around the mean height of the population from which they were sampled. This means that most of the 19 scores should be at or near the mean in height. Thus, we essentially have a group of 19 scores that are mostly near the mean in height and all of which are precisely at the median in number of fingers. Then we have only one person with a below average number of fingers and only one person with an above average number of fingers. If one of these two unusually-fingered people happens to be tall and one happens to be short, we are guaranteed to observe either a positive or a negative correlation (as confirmed by the switching demonstration noted above). In fact, even if we were to make the 9.5 fingered person perfectly average in height (67 inches), the observed correlation between number of fingers and height would still be r = .49, p = .026. So in essence this entire correlation is being driven by a single 12-fingered person. Finally, it seems obvious that a correlation with n=3 (1 df) would not be significant at p = .05. Just to be sure I created a sample that included only 3 of these 21 participants – the person with 9.5 fingers, the person with 12 fingers, and a person of average height (67 inches) with 10 fingers. Because of the enormous reduction of variation in height scores among people with 10 fingers, the correlation increased dramatically to r = .966. However, with only 1 df, even this extremely positive correlation was not significant, p = .166. QUESTION 2.8b. If numbers of fingers were highly platykurtic and height were normally distributed, I would not be very worried that one or two observations could create a spurious correlation between number of fingers and height. In fact, this is a perfect description of the properties of the two variables that are involved in a traditional independent samples t-test or one way ANOVA. The independent variable is a categorical variable with 2 or more levels (each of which usually has roughly equal frequencies), and the dependent measure is assumed to be normally distributed. For example, in the simple case of gender, if there were 10 men and 11 women in these data, there obviously would be no reason to worry that a person was an outlier on gender. Further, if we happened to sample one very tall or very short woman, her height would merely by one of 11 scores rather than the sole score representing her entire gender group.