CHAPTER 3 DESCRIBING DATA: SUMMARY MEASURES MULTIPLE CHOICE QUESTIONS In the following multiple-choice questions, please circle the correct answer. 1. Which of the following are the three most common measures of central location? a. Mean, median and mode b. Average, variance and standard deviation c. Mode, sample mean, and sample variance d. Mean, median, and average ANSWER: a 2. Which of the following are considered measures of association? a. Mean and variance b. Variance and correlation c. Covariance and correlation d. Covariance and variance ANSWER: c 3. The difference between the first and third quartile is called the a. interquartile range b. interdependent range c. unimodal range d. common occurrence range e. none of the above ANSWER: a 32 . Describing Data: Summary Measures 4. If a value represents the 95th percentile, this means: a. 95% of all values are below this point b. 95% of all values are above this point c. 95% of the time you will observe this value d. there is a 5% chance that this value is incorrect e. none of the above ANSWER: a 5. For a boxplot, the point inside the box indicates the location of the __________. a. mean b. median c. minimum value d. maximum value ANSWER: a 6. For a boxplot, the vertical line inside the box indicates the location of the______. a. mean b. median c. mode d. minimum value e. maximum value ANSWER: b 7. Suppose that a histogram of a data set is approximately symmetric and "bell shaped". Approximately percent of the observations are within three standard deviation of the mean. a. 50 b. 68 c. 95 d. 99.7 e. 100 ANSWER: 8. d The correlation coefficient is always: a. between – 1 and 0 b. between 0 and +1 c. between 0.0 and 100 d. between –1 and +1 ANSWER: d 33 Chapter 3 9. Which of the following are the two most commonly used measures of variability? a. Variance and mode b. Variance and standard deviation c. Sample mean, and sample variance d. Mean and range ANSWER: b 10. Boxplots are probably most useful for . a. calculating the mean of the data b. calculating the median of the data c. comparing the mean to the median d. comparing two populations graphically e. comparing the correlation and the covariance ANSWER: d 11. Suppose that a histogram of a data set is approximately symmetric and "bell shaped". Approximately percent of the observations are within one standard deviation of the mean. a. 50 b. 68 c. 95 d. 99.7 e. 100 ANSWER: 12. Suppose that a histogram of a data set is approximately symmetric and "bell shaped". Approximately percent of the observations are within two standard deviation of the mean. a. 50 b. 68 c. 95 d. 99.7 e. 100 ANSWER: 13. b c The mode is best described as: a. the middle observation b. the same as the average c. the 50th percentile d. the most frequently occurring value ANSWER: d 34 Describing Data: Summary Measures 14. The median can also be described as: a. the middle observation when the data is arranged in ascending order b. the second quartile c. the 50th percentile d. all of the above e. none of the above ANSWER: d 15. For a boxplot, the box itself represents a. 25 b. 50 c. 75 d. 90 e. 100 ANSWER: 16. percent of the observations. b Which of the following best describes a data warehouse? a. A physical facility full of databases b. A storage facility for government information c. A database specifically structured to enable data mining d. A database used to track warehousing information e. None of the above ANSWER: c 35 Chapter 3 TEST QUESTIONS QUESTIONS 17 AND 18 ARE BASED ON THE FOLLOWING INFORMATION: Statistics professor has just given a final examination in his statistical inference course. He is particularly interested in learning how his class of 40 students performed on this exam. The scores are shown below. 75 81 88 80 17. 79 82 83 76 72 79 90 84 75 71 82 84 77 73 79 80 71 89 62 68 78 74 73 74 83 75 88 76 84 93 76 70 71 74 76 91 What are the mean and median scores on this exam? ANSWER: Mean = 78.40, Median = 77.50 18. Explain why the mean and median are different. ANSWER: There are few higher exam scores that tend to pull the mean away from the middle of the distribution. While there is a slight amount of positive skewness in the distribution (skewness = 0.182), the mean and the median are essentially equivalent in this case. QUESTIONS 19 AND 20 ARE BASED ON THE FOLLOWING INFORMATION: The average time in minutes, for a sample of 40 people living in Detroit metropolitan area to travel to work and back home each day is shown below. 42.8 51.8 41.3 61.5 19. 47.1 43.1 32.6 36.5 46.3 51.5 34.1 48.9 42.2 45.5 31.9 40.6 55.9 40.7 42.9 48.1 35.0 57.6 48.5 39.8 41.4 36.5 46.7 35.6 37.8 32.3 40.9 44.4 33.9 41.9 46.6 38.6 36.6 37.0 29.5 45.6 Find the most representative average commute time. ANSWER: The most representative average commute time for Detroit metropolitan area is slightly greater than 41 minutes. The median is the best measure of central location due to the distribution positive skewness (skewness = 0.573). 36 Describing Data: Summary Measures 20. Does it appear that the distribution of average commute time is approximately symmetric? Explain why or why not. ANSWER: The distribution is slightly skewed to the right; it is essentially not symmetric. Notice that the mean is slightly greater than the median here; the mean is influenced by the existence of some relatively large commute times in the sample. QUESTIONS 21 AND 22 ARE BASED ON THE FOLLOWING INFORMATION: A production manager for Bell Computers is interested in determining the variability of the proportion defective items in a shipment of one of the computer components used in their product. The results below were calculated using information on the percent defective from 50 randomly selected shipments collected over a 6-week period. Count Mean Median Standard deviation Minimum Maximum Variance First quartile Third quartile 21. 50.000 0.210 0.206 0.057 0.104 0.299 0.003 0.153 0.248 Use the above information to determine the rules of thumb in this case. ANSWER: Approximately 68% of the values will fall between 0.210 0.057, that is between 0.153 and 0.267. Approximately 95% of the values will fall between 0.210 2 (0.057), that is between 0.096 and 0.324. Approximately 99.7% of the values will fall between 0.210 3 (0.057), that is between 0.039 and 0.381. 22. Explain what would cause the mean to be higher than the median in this case. ANSWER: The data seems to be skewed to the right. This is causing the mean to be slightly higher than the median. 37 Chapter 3 QUESTIONS 23 THROUGH 25 ARE BASED ON THE FOLLOWING INFORMATION: A manager for Marko Manufacturing, Inc. has recently been hearing some complaints that women are being paid less than men for the same type of work in one of their manufacturing plants. The boxplots shown below represent the annual salaries for all salaried workers in that facility (40 men and 34 women). Female_Salary Male_Salary 0 23. 20000 40000 60000 80000 Would you conclude that there is a difference between the salaries of women and men in this plant? Justify your answer. ANSWER: Yes. The men seem to have higher salaries than the women do in many cases. We can see from the boxplots that the mean and median values for the men are both higher than for the women. You can also see from the boxplots that the middle 50% of salaries for men is above the median for women. This means that if you were in the 25th percentile for men, you would be above the 50th percentile for women. You can also see that the mean and median salaries for the men are about $10,000 above those for the women. 24. How large must a person’s salary should be to qualify as an outlier on the high side? How many outliers are there in these data? ANSWER: A person’s salary should be somewhere above $70,000. There is one male salary that would be considered an outlier (at approximately $80,000) 38 Describing Data: Summary Measures 25. What can you say about the shape of the distributions given the boxplots above? ANSWER: They both appear to be slightly skewed to the right (both have a mean > median). The total variation seems to be close for both distributions (with one outlier for the male salaries), but there seems to be more variation in the middle 50% for the women than for the men. There seem to be more men’s salaries clustered more closely around the mean than for the women. QUESTIONS 26 THROUGH 28 ARE BASED ON THE FOLLOWING INFORMATION: Below you will find current annual salary data and related information for 30 employees at Gamma Technologies, Inc. These data include each selected employees gender (1 for female; 0 for male), age, number of years of relevant work experience prior to employment at Gamma, number of years of employment at Gamma, the number of years of post-secondary education, and annual salary. The tables of correlations and covariances are presented below. Table of Correlations Gender Gender 1.000 Age -0.111 Prior_Exp 0.054 Gamma_Exp -0.203 Education -0.039 Salary -0.154 Age 1.000 0.800 0.916 0.518 0.923 Prior Exp Gamma Exp Education Salary 1.000 0.587 0.434 0.723 1.000 0.342 0.870 1.000 0.617 1.000 Table of Covariances (variances on the diagonal) Gender Age Prior Exp Gamma Exp Education Salary 26. Gender 0.259 -0.633 0.117 -0.700 -0.033 -1825.97 Age Prior Exp Gamma Exp Education Salary 134.051 39.060 72.047 9.951 249702.35 19.045 17.413 3.140 73699.75 49.421 3.987 143033.29 2.947 24747.68 584640062 Which two variables have the strongest linear relationship with annual salary? ANSWER: Age at 0.923 and Gamma experience at 0.870 have the strongest linear relationship with annual salary. 39 Chapter 3 27. For which of the two variables, number of years of prior work experience or number of years of post-secondary education, is the relationship with salary stronger? Justify your answer. ANSWER: Prior work experience is stronger at 0.723 versus 0.617 for number of years of post-secondary education. 28. How would you characterize the relationship between gender and annual salary? ANSWER: It is a somewhat weak relationship at –0.154. Also, the negative value tells us that the salaries are decreasing as the gender value increases. This indicates that the salaries are lower for females (1) than for males (0). QUESTIONS 29 THROUGH 31 ARE BASED ON THE FOLLOWING INFORMATION: The data shown below contains family incomes (in thousands of dollars) for a set of 50 families; sampled in 1980 and 1990. Assume that these families are good representatives of the entire United States. 1980 58 6 59 71 30 38 36 33 72 100 1 27 22 141 72 165 79 1990 54 2 55 57 26 34 32 29 68 96 0 23 47 166 97 190 104 1980 33 14 48 20 24 82 95 12 93 100 51 22 50 124 113 118 96 1990 29 10 44 16 20 78 97 8 89 102 47 18 75 149 138 143 121 1980 73 26 64 59 11 70 31 92 115 62 23 34 36 125 121 88 40 1990 69 22 70 55 7 66 27 88 111 58 19 30 61 150 146 113 Describing Data: Summary Measures 29. Find the mean, median, standard deviation, first and third quartiles, and the 95th percentile for family incomes in both years. ANSWER: Mean Median Standard deviation First quartile Third quartile 95th percentile 30. Income 1980 62.820 59.000 39.786 30.250 92.750 124.550 Income 1990 67.120 57.500 48.087 27.500 97.000 149.55 The Republicans claim that the country was better off in 1990 than in 1980, because the average income increased. Do you agree? ANSWER: It is true that the mean increased slightly, but the median decreased and the standard deviation increased. The 95th percentile shows that the mean increase might be because the rich got riches. 31. Generate a boxplot to summarize the data. What does the boxplot indicate? ANSWER: The boxplot shows that there is not much difference between the two populations. Income_1990 Income_1980 0 40 80 41 120 160 20 Chapter 3 QUESTIONS 32 THROUGH 35 ARE BASED ON THE FOLLOWING INFORMATION: In an effort to provide more consistent customer service, the manager of a local fast-food restaurant would like to know the dispersion of customer service times about their average value for the facility’s drive-up window. The data below represents the customer service times (in minutes) for a sample of 47 customers collected over the past week. Count Mean Median Standard deviation Minimum Maximum Variance First quartile Third quartile 32. 47.000 0.914 0.822 0.511 0.095 2.372 0.261 0.563 1.180 Interpret the variance and standard deviation of these sample data. ANSWER: The variance = 0.261 and this represents the average of the squared deviations from the mean. The standard deviation = 0.511 and is the square root of the variance. The standard deviation is in minutes (the original units). 33. Determine the rules of thumb in this case. ANSWER: Approximately 68% of the values will fall between 0.914 0.511, that is between 0.403 and 1.425. Approximately 95% of the values will fall between 0.914 2(0.511), that is between (-0.108 and 1.936. Approximately 99.7% of the values will fall between 0.914 3(0.511), that is between -0.632 and 2.447. 34. Does it appear as though most of the data falls within 3 standard deviations of the mean? Explain. ANSWER: Since the minimum value is 0.095 and the maximum value is 2.372, then all of the data falls within three standard deviations of the mean. 42 Describing Data: Summary Measures 35. Explain what would cause the mean to be higher than the median in this case. ANSWER: The data is skewed to the right. This is causing the mean to be higher than the median. It is important to understand that service times are bounded on the lower end by zero (or it is impossible for the service time to be negative). However, there is no bound on the maximum service time. Therefore, the larger service times are causing the mean to be somewhat higher than the median. QUESTIONS 36 THROUGH 39 ARE BASED ON THE FOLLOWING INFORMATION: The percentage of the US population without health insurance coverage is shown below with samples from the 50 states and District of Columbia for both 1997 and 1998. State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri % in 97 19.9 14.0 20.9 18.1 21.8 13.1 11.1 14.2 17.1 17.9 16.9 9.9 14.7 12.1 11.2 10.4 13.6 15.9 19.9 13.8 13.3 13.2 11.5 10.2 18.5 12.9 % in 98 14.2 13.2 21.1 18.6 21.3 15.5 9.5 16.4 18.0 19.0 18.6 9.6 14.7 11.7 13.3 12.0 13.1 15.3 21.2 14.2 16.0 11.8 10.4 8.7 20.4 15.3 State Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming 43 % in 97 14.3 11.4 16.4 12.6 13.7 23.8 16.7 14.0 9.1 11.7 18.5 13.8 11.3 12.2 14.9 10.7 10.9 24.9 12.2 9.3 12.7 13.4 16.9 9.6 16.1 % in 98 13.4 9.7 19.4 10.7 14.9 26.3 15.9 15.0 9.0 12.6 19.9 13.2 10.6 13.6 15.3 10.1 15.5 25.2 12.4 13.9 14.2 13.1 16.6 8.0 16.6 Chapter 3 36. Describe the distribution of state percentages of Americans without health insurance coverage in 1998. Be sure to employ both measures of central location and dispersion in developing your characterization of this sample. ANSWER: Count Mean Median Standard deviation Minimum Maximum First quartile Third quartile Skewness 51.000 14.855 14.200 4.098 8.000 26.300 12.200 16.500 0.699 This distribution is slightly skewed to the right, as evidenced by the fact that its mean (14.855%) exceeds its median (14.2%). The standard deviation of the 1998 distribution is approximately 4.10%. 37. Compare the 1998 distribution of percentages with the corresponding set of percentages taken in 1997. How are these two sets similar? In what ways are they different? ANSWER: Count Mean Median Standard deviation Minimum Maximum First quartile Third quartile Skewness Percentage in 97 Percentage in 98 51.000 51.000 14.455 14.855 13.700 14.200 3.724 4.098 9.100 8.000 24.900 26.300 11.600 12.200 16.800 16.500 0.910 0.699 The two distributions are similar in that both are somewhat positively skewed (even though the 1997 distribution is more skewed than the 1998 distribution). They are different inasmuch as the 1998 distribution has higher measures of central location and dispersion than does the 1997 distribution. 44 Describing Data: Summary Measures 38. Compute a correlation measure for the two given sets of percentages. What does the correlation tell you in this case? ANSWER: Table of correlations Percent 97 Percent 98 Percent 97 1.000 0.903 Percent 98 1.000 There is a very large positive correlation between these two sets of percentages. This indicates that the percentages tend to move together, although not perfectly. 39. Based on your answers to Questions 37 and 38, what would you expect to find upon analyzing similar data for 1999? ANSWER: Based on the answers to Questions 37 and 38 above, it is reasonable to expect that the percent of the U.S. population without health insurance coverage in 1999 will be slightly greater than in the previous year. Also, we expect that there will be greater variability about the mean in 1999 than was the case in 1998. QUESTIONS 40 THROUGH 43 ARE BASED ON THE FOLLOWING INFORMATION: Below you will find summary measures on salaries for classroom teachers across the United States. You will also find a list of selected states and their average teacher salary. All values are in thousands of dollars. State Alabama Colorado Connecticut Delaware Nebraska Nevada New Hampshire New Jersey New Mexico South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Wyoming Salary 31.3 35.4 50.3 40.5 31.5 36.2 35.8 47.9 29.6 31.6 26.3 33.1 32.0 30.6 36.3 35.0 31.6 Count Mean Median Standard deviation Minimum Maximum Variance First quartile Third quartile 45 Salary 51.000 35.890 35.000 6.226 26.300 50.300 38.763 31.550 40.050 Chapter 3 40. Which of the states listed paid their teachers average salaries that exceed at least 75% of all average salaries? ANSWER: Connecticut at 50.3; Delaware at 40.5; and New Jersey at 47.9 (all those > 40.05). 41. Which of the states listed paid their teachers average salaries that are below 75% of all average salaries? ANSWER: Alabama at 31.3; Nebraska at 31.5; New Mexico at 29.6; South Dakota at 26.3; and Utah at 30.6 (all those < 31.55). 42. What salary amount represents the second quartile? ANSWER: $35,000 (median) 43. How would you describe the salary of Virginia’s teachers compared to those across the entire United States? Justify your answer. ANSWER: Virginia = $35,000 which is also the median. Virginia is at the 50th percentile or 50% of the teachers’ salaries across the U.S. are below Virginia and 50% of the salaries are above theirs. QUESTIONS 44 THROUGH 48 ARE BASED ON THE FOLLOWING INFORMATION: A manager at Abby Technologies, Inc. (ATI) is interested in analyzing the compensation of employees at her company. ATI currently has 25 employees and below you will find summary measures for the 25 employees. Count Mean Median Standard deviation Minimum Maximum Variance First quartile Third quartile 25.00 43957.20 42220.00 12478.24 18440.00 71250.00 155706456.42 37250.00 51125.00 46 Describing Data: Summary Measures 44. Interpret the mean and median values. ANSWER: The mean value of $43,957.20 represents the average compensation for the 25 employees. The median value of $42,220.00 represents the 50th percentile or the second quartile. This means that 50% of the values are below $42,220. 45. Why is there a difference in the mean and median values? Which of these two values do you believe is more appropriate in describing this distribution? ANSWER: The mean could be somewhat inflated due to one or two large values (such as the maximum compensation of $71,250). The median may be more appropriate. 46. Explain what the first quartile and third quartile values represent. ANSWER: The first quartile means that 25% of the employees have a compensation package below $37,250 (also means that 75% of the employees have a compensation package above this amount). The third quartile means that 75% of the employees have a compensation package below $51,125 (also means that 25% of the employees have a compensation package above this amount). 47. What range of values would contain approximately 95% of the observations? ANSWER: Approximately $43,957.20 2(12,478.24) or from $19,000.72 to $68,913.68. 48. Do any of the values fall outside of the range calculated in Question 47? ANSWER: Yes, the minimum value ($18,440) and the maximum value ($71,250) fall outside of this range. QUESTIONS 49 THROUGH 55 ARE BASED ON THE FOLLOWING INFORMATION: The following are two sets of data, each with a sample of size n = 9 Set 1 Set 2 5 15 2 12 4 14 2 12 3 13 47 2 12 10 20 2 12 12 22 Chapter 3 49. For each data, compute the mean, median, and mode ANSWER: Mean Median Mode 50. Set 1 4.667 3 2 Set 2 14.667 13 12 Compare your results in Question 49, and summarize your findings ANSWER: The measures of central tendency (mean, median, and mode) for Set 2 are all 10 more than the comparable statistics for Set 1. 51. Compare the first sampled item in each set, compare the second sampled item in each set, and so on. Briefly describe your findings in light of your summary in Question 50. ANSWER: The data values in Set 2 are each 10 more than the corresponding values in Set 1. The is the reason behind the results in Question 50. 52. For each set, compute the range, interquartile range, variance and standard deviation. ANSWER: Range Interquartile range Variance Standard deviation 53. Set 1 10 5 14.25 3.7749 Set 2 10 5 14.25 3.7749 Describe the shape of each set. ANSWER: Since the mean is greater than the median for each data set, the distributions are both right-skewed. 54. Compare your results in Questions 52 and 53, and discuss your findings. ANSWER: Because the data values in Set 2 are each 10 more than the corresponding values in Set 1, the measures of spread among the data values remain the same across the two sets. Set 2 is a reflection of Set 1 simply shifted up the scale 10 units, so the distributions are also reflections of each other. 48 Describing Data: Summary Measures 55. On the basis of your answers to Questions 49 through 54, what can you generalize about the properties of central tendency, variation, and shape? ANSWER: Generally stated, when a second data set is an additive shift from an original set, the measures of central tendency for the second set are equal to the comparable measures for the original set plus the value, or distance, of the shift; the measures of spread for the second set are equal to the corresponding measures for the original set, and the shape of the second distribution will be a reflection of the shape of the original distribution. 49 Chapter 3 TRUE / FALSE QUESTIONS 56. If all values in a data set are negative, the value of the standard deviation may be either positive or negative. ANSWER: F 57. Assume that the histogram of a data set is symmetric and bell shaped, with mean of 72 and standard deviation of 10. Then, using the “rules of thumb”, we can say that 95% of the data values were between 52 and 92. ANSWER: T 58. If a histogram has a single peak and looks approximately the same to the left and right of the peak, we should expect no difference in the values of the mean, median, and mode. ANSWER: T 59. The mean is a measure of central location. ANSWER: T 60. In a negatively skewed distribution, the mean is larger than the median and the median is larger than the mode. ANSWER: F 61. Since the population is always larger than the sample, the population mean is always larger than the sample mean. ANSWER: F 62. The length of the box in the boxplot portrays the interquartile range. ANSWER: T 63. In a positively skewed distribution, the mean is smaller than the median and the median is smaller than the mode. ANSWER: F 64. The interquartile range is considered the weakest measure of central location. ANSWER: F 65. The value of the standard deviation always exceeds that of the variance. ANSWER: F 66. The difference between the first and third quartiles is called the interquartile range. ANSWER: T 50 Describing Data: Summary Measures 67. The standard deviation is measured in original units, such as dollars and pounds. ANSWER: T 68. The median is one of the most frequently used measures of variability. ANSWER: F 69. Each of the covariance and correlation measures the strength and direction of a linear relationship between two numerical variables. ANSWER: T 70. Abby has been keeping track of what she spends to rent movies. The last seven week's expenditures, in dollars, were 6, 4, 8, 9, 6, 12, and 4. The mean amount Abby spends on renting movies is $7. ANSWER: T 71. If two data sets have the same range, the smallest and largest observations in both sets will be the same ANSWER: F 72. The variance is a measure of the linear relationship between two variables ANSWER: F 73. Generally speaking, if two variables are unrelated, the covariance will be a positive or negative number close to zero ANSWER: T 74. The correlation between two variables is a unitless quantity that is always between –1 and +1. ANSWER: T 75. The variance is the positive square root of the standard deviation. ANSWER: F 76. It is possible that the data points are close to a curve and have a correlation close to 0, because correlation is relevant only for measuring linear relationships. ANSWER: T 77. Expressed in percentiles, the interquartile range is the difference between the 25th and 75th percentiles. ANSWER: T 78. A data sample has a mode 0f 140, a median of 130, and a mean of 120. The distribution of the data is positively skewed. ANSWER: F 51 Chapter 3 79. A student scores 85, 75, and 80 on three exams during the semester and 90 on the final exam. If the final is weighted double and the three others weighted equally, the student's final average would be 80. ANSWER: F 80. Suppose that a sample of 10 observations has a standard deviation of 3, then the sum of the squared deviations from the sample mean is 30. ANSWER: F 81. The value of the mean times the number of observations equals the sum of all of the data values. ANSWER: T 82. The difference between the largest and smallest values in a data set is called the range. ANSWER: T 83. There are four quartiles that divide the values in a data set into four equal parts. ANSWER: F 52