Chapter 4: Describing your Data 1. a. Quantitative variables involve values that come in meaningful (not arbritrary) numbers. b. Qualitative variables are variables whose values fall into some category, indicating a quality or property of an object. c. Continuous variables are quantitative variables whose values follow a wide range of possible values over a continuous spectrum. d. Ordinal variables are qualitative variables whose categories can be put into some natural order. e. Nominal variables are qualitative variables whose categories can not be put into a natural order. 2. A distribution is skewed if most of the values are clustered toward the left or right edge of the distribution. A positively skewed distribution is a skewed distribution in which the values are clustered towards the left – or lower end of the distribution. A negatively skewed distribution is a skewed distribution in which the values are clustered towards the right – or upper end of the distribution. 3. A stem and leaf plot is a graphical representation of a distribution in which the values of the distribution are used to create the stems and leafs of the plot. One advantage of the stem and leaf plot over a histogram is that the values of the distribution can be inferred from the plot. A disadvantage is that you do not have control over the size and placement of the bins and the plot can be difficult to manage fro large data sets in which the stems need to display a large number of leafs. 4. The interquartile range is the range between the first quartile and the third quartile. 5. False. The range of the distribution is related to the distributions maximum and minimum value. Those values are not determined from the distribution's variability–especially in the case of outliers. 6. An outlier is an unusually extreme value from a distribution. If the value is greater than the third quartile plus 1.5 x the interquartile range (IQR) or less than the first quartile minus 1.5 x the IQR it is considered a moderate outlier. If the outlier is greater than the third quartile plus 3 x IQR or less than the first quartile minus 3 x IQR, it is considered an extreme outlier. 7. False. Extreme vlalues may be a natural part of the data. By removing those values, you may be removing an important (and perhaps the most interesting) part of the distribution. 8. A boxplot is a graphical representation of the distribution that displays the interquartile range, the maximum and minimum values, and the existence of moderate or extreme outliers. One advantage is that the boxplot displays several of the important summary statistics. A distadvantage is that the boxplot may be difficult to interpret for the untrained eye. 1 Chapter 4: Describing your Data 9. a. Data Values 30 30 60 100 110 120 120 180 200 200 210 210 210 220 240 290 300 340 450 610 900 b. Positively skewed. c. The mean is approximately 244.29. The median is approximately 210. d. Moderate outliers: 610. Extreme outliers: 900. 10. a. 2 Chapter 4: Describing your Data Positively skewed. b. 11. a. Univariate Statistics Count Sum Average Median Trimmed Mean (0.2) Minimum Maximum Range Standard Deviation Variance Standard Error Square Feet 117 193,501 1,653.85 1,549 1,594.33 837 3,750 2,913 523.723 274,285.574 48.418 Skewness Kurtosis 1.188 1.634 Smallest (2) Largest (2) 1st Percentile 5th Percentile 10th Percentile 25th Percentile 50th Percentile 75th Percentile 90th Percentile 95th Percentile 99th Percentile Interquartile Range Square Feet 900 2,931 911 1,029 1,106 1,280 1,549 1,894 2,489 2,680 2,929 614 b. The smallest house is 837 square feet. The largest house is 3,750 square feet. c. 85.3% d. The boxplot appears as: 4000 3500 3000 2500 2000 1500 1000 Square Feet 500 0 e. The house that is 3,750 square feet. 3 Chapter 4: Describing your Data The boxplot appears as: f. 4000 3500 3000 2500 2000 1500 1000 Corner Lot = No Corner Lot = Yes 500 0 g. Houses on corner lots appear to be larger (in general) than house which are not on corner lots, but this is not always the case. There are some corner houses which are small. h. The house that was an outlier when we considered all houses, is not when compared only to other corner-lot houses. Whether this observation is an outlier depends upon the context. 4 Chapter 4: Describing your Data i. Count Sum Average Median Trimmed Mean (0.2) Minimum Maximum Range Standard Deviation Variance Standard Error Skewness Kurtosis Smallest (2) Largest (2) 1st Percentile 5th Percentile 10th Percentile 25th Percentile 50th Percentile 75th Percentile 90th Percentile 95th Percentile 99th Percentile Interquartile Range j. Square Feet Corner_Lot = Corner_Lot = "No" "Yes" 90 27 135,477 58,024 1,505.30 2,149.04 1,447 2,116 Overall 117 193,501 1,653.85 1,549 1,465.22 837 2,931 2,094 2,127.17 1,080 3,750 2,670 1,594.33 837 3,750 2,913 393.338 154,715.044 41.462 1.156 2.052 900 2,774 893 1,016 1,051 1,219 1,447 1,715 1,923 2,191 2,791 602.583 363,106.652 115.967 0.467 0.359 1,348 2,921 1,150 1,364 1,438 1,710 2,116 2,590 2,785 2,899 3,534 523.723 274,285.574 48.418 1.188 1.634 900 2,931 911 1,029 1,106 1,280 1,549 1,894 2,489 2,680 2,929 496 880 614 The new variable is a quantitative variable. The first few values are: Price Square Feet PPSQF 87,400 1,236 70.71 110,900 1,740 63.74 95,000 1,715 55.39 87,000 1,273 68.34 73,900 970 76.19 77,000 900 85.56 133,000 1,850 71.89 116,000 1,720 67.44 102,000 1,606 63.51 5 Chapter 4: Describing your Data The histogram appears as: k. 25.0 20.0 15.0 10.0 5.0 l. 97.63 94.39 91.16 87.92 84.68 81.45 78.21 74.98 71.74 68.51 65.27 62.04 58.80 55.56 52.33 49.09 45.86 42.62 39.39 36.15 0.0 The distribution is symmetric. There are two moderate outliers with values of 34.53 and 99.24. 6 Chapter 4: Describing your Data 12. a. Employees Count Sum Average Median Trimmed Mean (0.2) Minimum Maximum Range Standard Deviation Variance Standard Error Skewness Kurtosis Smallest (2) Largest (2) 1st Percentile 5th Percentile 10th Percentile 25th Percentile 50th Percentile 75th Percentile 90th Percentile 95th Percentile 99th Percentile Interquartile Range b. 50 6,777 135.54 52 75.28 16 1,400 1,384 256.084 65,578.825 36.216 4.118 17.522 19 1,200 17 21 25 33 52 125 215 403 1,302 93 Average = 135.54. Median = 52. The median is probably a better measure of the size of a typical business from the data set since it is less influenced by large values. In this case, the sample average is even larger than the third quartile (125). 7 Chapter 4: Describing your Data The distribution is positively skewed with several extreme outliers. c. 1600 1400 1200 1000 800 600 400 200 0 Employees -200 d. The first few values of the log(Empl.oyees) variable are: Name Employees 500 Jockey International LEmployees 2.699 2300 6th St., Kenosha 53140 Astronautics 1400 3.146 280 2.447 163 2.212 285 2.455 140 2.146 52 1.716 46 1.663 125 2.097 4115 N. Teutonia Ave., Milwaukee 53209 Pleasant Company 8400 Fairway Place, Middleton 53562 Dawes Transport 9180 N. 107th St., Milwaukee 53212 East Capitol Drive Foods Inc. 709 E. Capitol Drive Milwaukee 53212 TSR Inc. 201 Sheridan Springs Road, Lake Geneva 53147 Ricom Electronics 5139 W. Clinton Ave. Milwaukee 53233 Rollette Oil Co. 2025 Beloit Ave., Janesville 53548 Mainline Industrial Distributors 3441 Highview Drive, P.O. Box 297, Appleton 54912 8 Chapter 4: Describing your Data The boxplot of the log(Employees) variable appears as: 3.5 3 2.5 2 1.5 1 LEmp 0.5 0 The distribution is more symmetric than the raw employee counts. Also, there are few outliers and those outliers are moderate in size, not extreme. 9 Chapter 4: Describing your Data e. The statistics are: Count Sum Average Median Trimmed Mean (0.2) Minimum Maximum Range Standard Deviation Variance Standard Error Skewness Kurtosis Smallest (2) Largest (2) 1st Percentile 5th Percentile 10th Percentile 25th Percentile 50th Percentile 75th Percentile 90th Percentile 95th Percentile 99th Percentile Interquartile Range LEmployees 50.000 91.974 1.839 1.712 1.791 1.204 3.146 1.942 0.434 0.188 0.061 1.106 1.314 1.279 3.079 1.241 1.328 1.394 1.512 1.712 2.097 2.331 2.589 3.113 0.585 The skewness of the raw counts is 4.118 and for the log counts, the skewness is 1.106. The larger skewness for the raw counts demonstrates that the distribution is more skewed than the distribution for the log values. The kurtosis value for the raw counts is 17.522, while for the log counts the kurtosis is 1.314. This indicates that the distribution of the raw counts shows heavier tails. f. The average of log(Employees) is 1.839. In terms of the number of employees, this is 101.839 or 69.10. This is equal to the geometric mean of the raw counts. g. The difference in sales between the first two companies is $40,000 – however the one company is 5x the size of the other. The difference in sales between the second two companies is $50,000, but the smaller company is 89% of the size of the larger one. Thus, even though the difference between the second pair is greater in absolute numbers, the companies are more similar in terms of their scale. 10 Chapter 4: Describing your Data 13. a. The statistics are: High Income Minority Count Sum Average Median Mode Trimmed Mean (0.2) Minimum Maximum Range Standard Deviation Variance Standard Error Skewness Kurtosis Smallest (2) Largest (2) 1st Percentile 5th Percentile 10th Percentile 25th Percentile 50th Percentile 75th Percentile 90th Percentile 95th Percentile 99th Percentile Interquartile Range High Income White 20 550.3 27.515 29.1 20 226.0 11.300 9.8 #N/A Minority 20 737.63 36.8815 37.40 #N/A 28.381 5.8 41.3 35.5 10.519 2.2 26.8 24.6 36.7519 10.60 62.20 51.60 15.075 3.7 32.4 28.7 10.9220 119.2908 2.4422 -0.598 -0.616 8.0 41.1 6.2 7.9 11.0 21.3 29.1 37.0 39.3 41.1 41.3 6.5164 42.4632 1.4571 0.994 0.880 3.6 25.1 2.5 3.5 4.1 7.4 9.8 15.1 17.0 25.2 26.5 13.05090 170.32590 2.91827 0.004 -0.373 20.90 55.90 12.56 20.39 22.88 26.43 37.40 45.25 51.94 56.22 61.00 7.7958 60.7746 1.7432 0.415 -0.122 5.5 29.7 4.0 5.4 5.6 9.2 15.8 20.2 23.9 29.8 31.9 15.7 7.7 18.83 11.0 #N/A 11 White 20 312.5 15.625 15.8 16.0 Chapter 4: Describing your Data The boxplots appear as: b. 70 60 50 40 30 20 10 0 High Income Minority High Income White Minority -10 12 White Chapter 4: Describing your Data 14. a. The statistics are: Area = "North" Count Sum Average Area = "South" Area = "West" 17 13 51 820,651 622,717 544,099 1,987,467 38,969.95 39,078.63 36,630.40 41,853.78 Median 39,200 35,328 41,261 Mode 43,472 Trimmed Mean (0.2) Overall 21 #N/A #N/A 37,411 (43472 ; 36230.4) 39,000.75 35,921.60 40,380.07 38,261.81 Minimum 28,952 29,509 33,550 28,952 Maximum 49,085 54,384 66,368 66,368 Range 20,133 24,875 32,818 37,416 5,960.871 5,686.171 8,197.974 6,687.082 35,531,977.545 32,332,535.680 67,206,782.950 44,717,069.549 1,300.769 1,379.099 2,273.709 936.379 0.029 2.044 2.439 1.583 Kurtosis -1.164 5.557 7.368 4.549 Smallest (2) 31,333 31,261 35,746 29,509 Largest (2) 48,269 43,498 46,611 54,384 1st Percentile 29,428 29,789 33,814 29,230 5th Percentile 31,333 30,910 34,868 31,297 10th Percentile 32,421 32,146 35,791 32,520 25th Percentile 33,502 33,504 36,230 34,391 50th Percentile 39,200 35,328 41,261 37,411 75th Percentile 43,472 37,411 41,624 42,508 90th Percentile 47,152 41,553 46,001 46,611 95th Percentile 48,269 45,675 54,514 48,677 99th Percentile 48,922 52,642 63,997 60,376 9,970 3,907 5,394 8,117 Standard Deviation Variance Standard Error Skewness Interquartile Range 13 Chapter 4: Describing your Data The boxplots appear as: b. Salary Boxplot 70000 65000 60000 Teacher Salaries 55000 50000 45000 40000 35000 30000 25000 Area = North Area = South Area = West 20000 Area c. The salary distribution for the Nothern region is symmetric with no outliers. There is an extreme outlier and a moderate outlier in the Southern region. The distribution of the Southern region is positively skewed. There is an extreme outlier in the Western region. The outlier comes from the state of Alaska. The cost of living in Alaska is very high compared to other states in the union.; therefore teacher salaries may be higher to compensate. 14 Chapter 4: Describing your Data The statistics are: d. Area = "North" Count Area = "South" Area = "West" Overall 21 17 13 51 Sum 133.39 121.24 92.47 347.10 Average 6.3517 7.1317 7.1132 6.8058 Median 6.25 Mode 7.34 #N/A Trimmed Mean (0.2) #N/A 6.86 6.77 #N/A #N/A 6.3203 7.1454 7.0707 6.7614 Minimum 4.91 5.45 4.97 4.91 Maximum 7.98 8.61 9.73 9.73 Range 3.07 3.15 4.76 4.82 Standard Deviation 0.78332 0.83879 1.50144 1.07652 Variance 0.61359 0.70356 2.25431 1.15890 Standard Error 0.17093 0.20344 0.41642 0.15074 0.383 -0.153 0.169 0.435 Skewness Kurtosis -0.167 -0.188 -0.967 -0.085 Smallest (2) 5.37 5.96 5.00 4.97 Largest (2) 7.68 8.40 8.73 8.73 1st Percentile 5.00 5.53 4.97 4.94 5th Percentile 5.37 5.86 4.99 5.19 10th Percentile 5.44 6.14 5.14 5.45 25th Percentile 5.79 6.55 6.25 6.09 50th Percentile 6.25 7.34 6.86 6.77 75th Percentile 6.87 7.41 8.36 7.40 90th Percentile 7.50 8.16 8.73 8.36 95th Percentile 7.68 8.44 9.13 8.66 99th Percentile 7.92 8.57 9.61 9.23 Interquartile Range 1.09 0.86 2.10 1.31 The boxplots appear as: e. Salary Pupil Ratio 10 9 8 7 6 5 Area = North Area = South 4 Area 15 Area = West Chapter 4: Describing your Data f. Alaska is not an outlier for this variable. Alaska is at the 2 nd percentile for salary/pupil ratio, but at the 100th percentile for salary. Even though these teachers are paid the highest, in terms of the salary/pupil ratio they are among the lowest paid. a. The histogram appears as: 15. 50.0 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 16 2,500 2,400 2,300 2,200 2,100 2,000 1,900 1,800 1,700 1,600 1,500 1,400 1,300 1,200 1,100 900 1,000 800 700 600 500 400 300 200 0 100 0.0 Chapter 4: Describing your Data The frequency table is: Cum. Salary Freq Freq. 0 23 23 100 44 67 200 33 100 300 22 122 400 23 145 500 19 164 600 14 178 700 29 207 800 12 219 900 11 230 1,000 7 237 1,100 4 241 1,200 4 245 1,300 5 250 1,400 1 251 1,500 1 252 1,600 2 254 1,700 0 254 1,800 2 256 1,900 4 260 2,000 0 260 2,100 1 261 2,200 0 261 2,300 0 261 2,400 2 263 2,500 0 263 b. c. % 8.75% 16.73% 12.55% 8.37% 8.75% 7.22% 5.32% 11.03% 4.56% 4.18% 2.66% 1.52% 1.52% 1.90% 0.38% 0.38% 0.76% 0.00% 0.76% 1.52% 0.00% 0.38% 0.00% 0.00% 0.76% 0.00% Cum. % 8.75% 25.48% 38.02% 46.39% 55.13% 62.36% 67.68% 78.71% 83.27% 87.45% 90.11% 91.63% 93.16% 95.06% 95.44% 95.82% 96.58% 96.58% 97.34% 98.86% 98.86% 99.24% 99.24% 99.24% 100.00% 100.00% The 10th percentile is 100. The 90th percentile is 1049. 17 Chapter 4: Describing your Data d. The players in the upper 10% of the salary range are: First Last Salary Darryl Strawberry 1220 Don Mattingly 1975 George Bell 1175 Wade Boggs 1600 Cal Ripken 1350 Jesse Barfield 1238 Kent Hrbek 1310 Von Hayes 1300 Leon Durham 1183 Tony Pena 1150 Kirk Gibson 1300 Rickey Henderson 1670 Carney Lansford 1200 Ozzie Smith 1940 Paul Molitor 1260 Eddie Murray 2460 Dale Murphy 1900 Jack Clark 1300 Andre Thornton 1100 Gary Carter 1926 Jim Rice 2413 Keith Hernandez 1800 Dave Winfield 1861 George Brett 1500 Mike Schmidt 2127 Ron Cey 1050 Steve Garvey 1450 e. Average = 541.48. Median = 425. Percentile = 59.3% 18 Chapter 4: Describing your Data 16. The histogram appears as: a. 200.0 180.0 160.0 140.0 120.0 100.0 80.0 60.0 40.0 20.0 19 22.0 21.0 20.0 19.0 18.0 17.0 16.0 15.0 14.0 13.0 12.0 11.0 9.0 10.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 0.0 Chapter 4: Describing your Data The frequency table is: b. Salary Freq Cum. Freq. % Cum. % 0.0 184 184 40.98% 40.98% 0.5 47 231 10.47% 51.45% 1.0 25 256 5.57% 57.02% 1.5 21 277 4.68% 61.69% 2.0 12 289 2.67% 64.37% 2.5 15 304 3.34% 67.71% 3.0 21 325 4.68% 72.38% 3.5 13 338 2.90% 75.28% 4.0 13 351 2.90% 78.17% 4.5 12 363 2.67% 80.85% 5.0 12 375 2.67% 83.52% 5.5 14 389 3.12% 86.64% 6.0 10 399 2.23% 88.86% 6.5 8 407 1.78% 90.65% 7.0 10 417 2.23% 92.87% 7.5 3 420 0.67% 93.54% 8.0 5 425 1.11% 94.65% 8.5 3 428 0.67% 95.32% 9.0 2 430 0.45% 95.77% 9.5 3 433 0.67% 96.44% 10.0 3 436 0.67% 97.10% 10.5 0 436 0.00% 97.10% 11.0 1 437 0.22% 97.33% 11.5 1 438 0.22% 97.55% 12.0 6 444 1.34% 98.89% 12.5 1 445 0.22% 99.11% 13.0 2 447 0.45% 99.55% 13.5 1 448 0.22% 99.78% 14.0 0 448 0.00% 99.78% 14.5 0 448 0.00% 99.78% 15.0 0 448 0.00% 99.78% 15.5 0 448 0.00% 99.78% 16.0 0 448 0.00% 99.78% 16.5 0 448 0.00% 99.78% 17.0 0 448 0.00% 99.78% 17.5 0 448 0.00% 99.78% 18.0 0 448 0.00% 99.78% 18.5 0 448 0.00% 99.78% 19.0 0 448 0.00% 99.78% 19.5 0 448 0.00% 99.78% 20.0 0 448 0.00% 99.78% 20.5 0 448 0.00% 99.78% 21.0 0 448 0.00% 99.78% 21.5 0 448 0.00% 99.78% 22.0 1 449 0.22% 100.00% 20 Chapter 4: Describing your Data c. The 10th percentile is 0.2148. The 90th percentile is 6.6672. d. The player list is: Number 3 Player Salary Alex Rodriguez 22.000 31 Mike Piazza 13.571 42 Mo Vaughn 13.167 24 Manny Ramirez 13.033 Derek Jeter 12.600 51 Bernie Williams 12.357 25 Carlos Delgado 12.200 33 Larry Walker 12.167 15 Shawn Green 12.167 88 Albert Belle 12.049 21 Sammy Sosa 12.000 43 Raul Mondesi 11.500 25 Mark McGwire 11.000 10 Chipper Jones 10.333 25 Barry Bonds 10.300 22 Juan Gonzalez 10.000 35 Frank Thomas 9.927 10 Gary Sheffield 9.917 30 Ken Griffey 9.710 33 Brian Jordan 9.100 11 Barry Larkin 9.000 25 Rafael Palmeiro 8.621 7 Ivan Rodriguez 8.600 4 Robin Ventura 8.500 25 Andruw Jones 8.200 25 Jim Thome 8.175 16 Ray Lankford 8.100 33 Jay Bell 8.000 7 Kenny Lofton 8.000 8 Javy Lopez 7.750 7 Craig Biggio 7.750 7 Dean Palmer 7.500 23 Greg Vaughn 7.479 23 Eric Karros 7.375 12 Roberto Alomar 7.343 24 Brian Giles 7.333 2 Carl Everett 7.333 5 Nomar Garciaparra 7.250 19 Vinny Castilla 7.250 8 Johnny Damon 7.100 28 David Justice 7.000 19 Dante Bichette 7.000 Todd Zeile 6.833 Jose Offerman 6.750 John Olerud 6.700 2 9 30 5 21 Chapter 4: Describing your Data e. Average = 2.4545. Median = 0.9. The average salary is at the 64.4 percentile. f. The skewness of the 1985 salaries is 1.5717. The skewness of the 2002 salaries is 1.912. Salaries are more skewed in 2002 than they were in 1985. a. The boxplots are: 17. Bladder Cancer Rate 7 6 5 4 3 Cig Use Category = 0 Cig Use Category = 1 Cig Use Category = 2 2 1 0 Kidney Cancer Rates 5 4.5 4 3.5 3 2.5 2 1.5 Cig Use Category = 0 Cig Use Category = 1 1 0.5 0 22 Cig Use Category = 2 Chapter 4: Describing your Data Leukemia Rates 9 8 7 6 5 Cig Use Category = 0 Cig Use Category = 1 Cig Use Category = 2 4 3 2 1 0 Lung Cancer Rates 30 25 20 15 Cig Use Category = 0 Cig Use Category = 1 10 5 0 23 Cig Use Category = 2 Chapter 4: Describing your Data b. The univariate statistics are: Bladder Cig_Use_Category = 0 Count Sum Average Median Mode Cig_Use_Category = 1 14 Cig_Use_Category = 2 15 15 47.70 61.48 72.15 3.4071 4.0987 4.8100 3.07 4.09 4.78 2.90 3.72 4.46 3.3058 4.1038 4.8008 Minimum 2.89 2.86 3.20 Maximum 5.14 5.27 6.54 Range 2.25 2.41 3.34 Standard Deviation 0.71710 0.77775 0.87097 Variance 0.51424 0.60490 0.75859 Standard Error Trimmed Mean (0.2) 0.19165 0.20081 0.22488 Skewness 1.609 -0.089 0.031 Kurtosis 1.683 -1.033 0.377 Smallest (2) 2.90 2.93 3.46 Largest (2) 4.65 5.21 5.98 1st Percentile 2.89 2.87 3.24 5th Percentile 2.90 2.91 3.38 10th Percentile 2.90 3.04 3.69 25th Percentile 2.92 3.62 4.46 50th Percentile 3.07 4.09 4.78 75th Percentile 3.56 4.71 5.21 90th Percentile 4.47 5.08 5.83 95th Percentile 4.82 5.23 6.15 99th Percentile 5.08 5.26 6.46 Interquartile Range 0.64 1.09 0.75 24 Chapter 4: Describing your Data Kidney Cig_Use_Category = 0 Count Sum Average Median Mode Cig_Use_Category = 1 Cig_Use_Category = 2 14 15 15 33.82 43.59 45.55 2.4157 2.9060 3.0367 2.32 2.90 3.03 2.75 2.66 2.3842 2.9169 2.9862 Minimum 1.59 2.13 2.41 Maximum 3.62 3.54 4.32 Range 2.03 1.41 1.91 Standard Deviation 0.54092 0.35958 0.45492 Variance 0.29260 0.12930 0.20695 Standard Error 0.14457 0.09284 0.11746 Skewness 0.716 -0.197 1.491 Kurtosis 0.571 0.640 3.896 Smallest (2) 1.77 2.45 2.55 Largest (2) 3.11 3.43 3.36 1st Percentile 1.61 2.17 2.43 5th Percentile 1.71 2.35 2.51 10th Percentile 1.85 2.55 2.59 25th Percentile 2.08 2.75 2.75 50th Percentile 2.32 2.90 3.03 75th Percentile 2.72 3.07 3.18 90th Percentile 3.04 3.37 3.36 95th Percentile 3.29 3.46 3.65 99th Percentile 3.55 3.52 4.19 Interquartile Range 0.63 0.32 0.43 Trimmed Mean (0.2) #N/A 25 Chapter 4: Describing your Data Leukemia Cig_Use_Category = 0 Count Sum Average Cig_Use_Category = 1 Cig_Use_Category = 2 14 15 15 94.34 107.04 99.13 6.6087 6.7386 7.1360 Median 6.71 7.00 Mode 6.71 7.38 6.6975 7.1038 6.6892 Minimum 5.82 6.41 4.90 Maximum 8.15 8.28 7.27 Range 2.33 1.87 2.37 Standard Deviation 0.64412 0.51788 0.66076 Variance 0.41489 0.26820 0.43660 Standard Error 0.17215 0.13372 0.17061 Skewness 0.604 0.719 -1.339 Kurtosis 0.353 0.066 1.842 Smallest (2) 5.95 6.56 5.78 Largest (2) 7.48 7.80 7.23 1st Percentile 5.84 6.43 5.02 5th Percentile 5.90 6.52 5.52 10th Percentile 5.99 6.58 5.90 25th Percentile 6.26 6.82 6.30 50th Percentile 6.71 7.00 6.82 75th Percentile 6.98 7.42 7.10 90th Percentile 7.46 7.76 7.22 95th Percentile 7.71 7.94 7.24 99th Percentile 8.06 8.21 7.26 Interquartile Range 0.72 0.60 0.81 Trimmed Mean (0.2) 26 6.82 #N/A Chapter 4: Describing your Data Lung Cig_Use_Category = 0 Count Sum Average Median Mode Cig_Use_Category = 1 15 15 233.27 284.94 346.53 16.6621 18.9960 23.1020 16.41 19.50 #N/A Trimmed Mean (0.2) Cig_Use_Category = 2 14 #N/A 23.03 #N/A 16.3175 18.9500 23.3338 Minimum 12.01 12.11 15.92 Maximum 25.45 26.48 27.27 Range 13.44 14.37 11.35 Standard Deviation 3.63450 3.62722 2.70780 13.20959 13.15674 7.33219 0.97136 0.93654 0.69915 Skewness 1.005 0.013 -1.101 Kurtosis 1.365 0.308 2.670 Smallest (2) 12.12 14.20 20.94 Largest (2) 20.55 22.72 25.95 1st Percentile 12.02 12.40 16.62 5th Percentile 12.08 13.57 19.43 10th Percentile 12.56 14.73 20.96 25th Percentile 14.23 16.65 22.06 50th Percentile 16.41 19.50 23.03 75th Percentile 17.56 20.98 24.79 90th Percentile 20.49 22.39 25.92 95th Percentile 22.27 23.85 26.35 99th Percentile 24.81 25.95 27.09 3.33 4.34 2.73 Variance Standard Error Interquartile Range c. There does not appear to be any relationship between cigarette use and leukemia rate in the states. There is some evidence for a relationship between cigarette use and kidney cancer, since the kidney cancer rate increases with increased rate of cancer use. The strongest evidence occurs between cigarette use and bladder cancer or lung cancer. In most cases as the rate of cigarette use increases, so does the cancer rate. d. Wyoming with a lung cancer rate of 15.92 and a cancer rate category of "high". 18. The statistics are: Diff80 Ratio80 Count 14 14 Average -16.600 0.730 Median -10.9 0.507 a. 27 Chapter 4: Describing your Data The boxplots are: b. 20 0 -20 -40 -60 -80 -100 Diff80 -120 2.5 2 1.5 1 0.5 0 Ratio80 The extreme outlier in the Diff80 boxplot comes from New York city. The statistics without New York city are: Diff80 Ratio80 Count 13 13 Average -10.215 0.774 Median -8.6 0.538 c. 28 Chapter 4: Describing your Data The boxplots are: 15 10 5 0 -5 -10 -15 -20 -25 Diff80 -30 -35 2.5 2 1.5 1 0.5 Ratio80 0 d. There is evidence that air quality has improved (whether we include or exclude the data from New York city). Of the two statistics, the ratio is less affected by the New York city results, though it does contain two moderate outliers from other cities. 29 Chapter 4: Describing your Data 19. a. The statistics are: REACTION Count Sum Average 106 18.264 0.17230 Median Mode Trimmed Mean (0.2) 0.171 (0.175 ; 0.167 ; 0.164 ; 0.162) 0.17107 Minimum 0.124 Maximum 0.224 Range 0.100 Standard Deviation 0.020556 Variance 0.000423 Standard Error 0.001997 Skewness 0.495 Kurtosis 0.101 Smallest (2) 0.137 Largest (2) 0.224 1st Percentile 0.137 5th Percentile 0.143 10th Percentile 0.148 25th Percentile 0.158 50th Percentile 0.171 75th Percentile 0.186 90th Percentile 0.200 95th Percentile 0.214 99th Percentile 0.224 Interquartile Range 0.028 30 Chapter 4: Describing your Data The boxplot appears as: b. 0.24 0.22 0.2 0.18 0.16 0.14 0.12 REACTION 0.1 There are no outliers and the shape of the distribution appears to be symmetric. c. The stem and leaf plot appears as follows: Stem x 0.01 REACTION 12 4 13 7 14 0011344567778 15 22455566677899 16 0011222233444455777789 17 00111222334444555667 18 22233455666777899 19 0122338 20 2238 21 3477 22 244 d. The interquartile range is 0.0275. The lower inner fence is equal to 0.117. The lower outer fence is equal to 0.0757. Therefore 0.1 seconds would be considered a moderate outlier. e. The reaction statistics by order of finish are: Count Average Median Minimum Maximum Standard Deviation FINISH = 1 12 0.16842 0.172 0.137 0.189 0.017154 FINISH = 2 12 0.16750 0.166 0.140 0.214 0.022134 FINISH = 3 12 0.17033 0.161 0.141 0.222 0.024570 FINISH = 4 12 0.16717 0.164 0.145 0.187 0.014509 FINISH = 5 12 0.17575 0.175 0.146 0.217 0.019410 FINISH = 6 12 0.17267 0.171 0.142 0.224 0.023631 FINISH = 7 12 0.17125 0.174 0.124 0.209 0.023038 FINISH = 8 12 0.17925 0.176 0.148 0.224 0.022519 FINISH = 9 10 0.17960 0.177 0.155 0.213 0.018234 31 Chapter 4: Describing your Data In general, those who finish lower seemed to have slower reaction times. But there is a great deal of overlap in the data. For example one of the first place finishers had a reaction time of 0.189 seconds, while one of the last place finishers had a reaction time of 0.155 seconds. A slower reaction time need not mean a lower finish. The boxplot is: f. Reaction Times by Order of Finish 0.24 0.22 Reaction Times 0.2 0.18 0.16 0.14 0.12 FINISH = 1 FINISH = 2 FINISH = 3 FINISH = 4 FINISH = 5 FINISH = 6 FINISH = 7 FINISH = 8 FINISH = 9 0.1 Order of Finish g. There is no strong evidence of a relationship between a runner's reaction time and the result of the race. h. The statistics are: Count Average Median Minimum Maximum Std. Dev. Group = "1-3" REACTION 36 0.16875 0.166 0.137 0.222 0.020919 Group = "4-6" REACTION 36 0.17186 0.170 0.142 0.224 0.019314 Group = "7-9" REACTION 34 0.17653 0.175 0.124 0.224 0.021268 32 Chapter 4: Describing your Data The boxplot is: Reaction Times by Finish Group 0.24 0.22 Reaction Times 0.2 0.18 0.16 0.14 0.12 Group = "1-3" Group = "4-6" Group = "7-9" 0.1 Finish Group There appears to be a trend to slower reaction times for runners in the lower finish groups. Reaction time may be related to finish order only in a very general way. For example, you cannot tell whether a runner will finish first by having a fast reaction time, but it may be more likely that runner who finish in the top 3 have, on average, faster reaction times than runners who finish in the bottom group. However, given the amount of overlap in the boxplots, even this very general conclusion may not be valid. 33 Chapter 4: Describing your Data 20. a. The statistics are: Assoc Count Asst Full 50 50 50 Sum 2,571.35 2,163.86 3,695.56 Average 51.4270 43.2772 73.9112 50.65 42.45 72.00 50.9388 42.8240 73.4815 Minimum 41.80 35.30 55.90 Maximum 70.00 56.40 96.50 Range 28.20 21.10 40.60 Median Trimmed Mean (0.2) Standard Deviation 5.70100 4.61775 10.16325 32.50135 21.32362 103.29158 0.80624 0.65305 1.43730 Skewness 1.012 1.060 0.419 Kurtosis 1.623 1.144 -0.468 Smallest (2) 42.00 35.90 56.10 Largest (2) 64.40 56.20 93.30 1st Percentile 41.90 35.59 56.00 5th Percentile 43.29 37.65 58.64 10th Percentile 45.97 38.79 62.94 25th Percentile 47.83 40.40 66.80 50th Percentile 50.65 42.45 72.00 75th Percentile 54.68 44.80 79.68 90th Percentile 57.32 50.09 90.23 95th Percentile 62.98 51.06 92.03 99th Percentile 67.26 56.30 94.93 6.85 4.40 12.88 Variance Standard Error Interquartile Range The greatest variability is found in the salaries of full professor. b. In general all three groups have fairly symmetric distributions. There are no apparent outliers in the stem and leaf plot. Stem x 10 Assoc Asst Full 3 3 557888899 4 1224 000000011111111222223333334444 4 5666667777888899999 6679 5 000001111223344 00011 5 55556777 66 6 144 566 12334 6 7 55566777889 0 0000112233 7 555667899 8 233 8 567 9 00123 9 6 34 Chapter 4: Describing your Data The boxplot is: c. Professor Salaries 120 100 80 60 40 Assoc Asst Full 20 0 There is some overlap in the salaries. This is probably due to the fact that an experienced associate professor may make as much or more than a new full professor. Likewise, experience assistant professors may make salaries comparable to leff experienced associate professors. The percentiles are: Assoc 25th Percentile 47.83 75th Percentile 54.68 d. Asst 40.40 44.80 Full 66.80 79.68 e. $92,030 since this represents the upper 5% of the salary range for full professors at the universities in the sample. a. The statistics are: 21. Count Average Median Minimum Maximum Standard Deviation b. Year 68 19 0.4932 0.50 0.34 0.63 0.06799 Year 72 19 0.5268 0.53 0.35 0.64 0.07079 1968 geometric mean = 0.4885. 1972 geometric mean = 0.522. 35 Chapter 4: Describing your Data The boxplot is: c. Women in the Labor Force 0.70 0.65 0.60 % of Women 0.55 0.50 0.45 0.40 0.35 Year 68 Year 72 0.30 Year There is one outlier from 1972, coming from the city of St. Louis. The boxplot seems to indicate that the percent of women in the workforce has increased during this period. d. The distributions are close to symmetric. The mean and median values are similar, so you could use either the mean or the median. a. There appears to be a slight downward trend in the draft numbers as the year progresses; but there is a lot of overlap in the boxplots. 22. Count Average Median Standard Deviation Month = 1 31 201.48 215 99.756 Month = 2 29 202.97 210 103.955 Month = 3 31 225.81 256 95.825 Month = 4 30 203.67 225 109.371 Month = 5 31 207.97 226 114.974 Month = 6 30 195.73 208 117.866 Month = 7 31 181.55 188 109.614 Month = 8 31 173.45 145 112.731 Month = 9 30 157.30 168 87.151 Month = 10 31 182.45 201 96.781 Month = 11 30 148.73 132 94.398 Month = 12 31 121.55 100 95.059 36 Chapter 4: Describing your Data Numbers by Month 400 350 300 Draft Numbers 250 200 150 100 50 0 Month = 1 Month = 2 Month = 3 Month = 4 Month = 5 Month = 6 Month = 7 Month = 8 Month = 9 Month = 10 Month = 11 Month = 12 -50 Month The statistics and boxplot are: b. Count Average Median Standard Deviation Quarter = 1 91 210.24 216 99.332 Quarter = 2 91 202.52 219 112.974 Quarter = 3 92 170.91 168 103.310 Quarter = 4 92 150.93 134 97.677 Draft Numbers by Quarter 400 350 300 Draft Numbers 250 200 150 100 50 0 Quarter = 1 Quarter = 2 Quarter = 3 Quarter = 4 -50 Quarter The first two quarters of the year appear to be similar, but there is a drop off in the third quarter and then another drop off in the fourth quarter. 37 Chapter 4: Describing your Data The statistics and boxplot are: c. Count Average Median Standard Deviation Half = 1 182 206.38 218 106.149 Half = 2 184 160.92 155 100.757 Draft Numbers by Half 400 350 300 Draft Numbers 250 200 150 100 50 0 Half = 1 Half = 2 -50 Half There appears to be a drop off in the draft numbers from the first half of the year to the second. d. On average, it would appear that persons born in the second half of the year received a lower draft number and are therefore more likely to be selected in the draft. This effect is difficult to see until you create groups which are large enough to compensate for the variability between individual numbers. One possible cause could be how the draft numbers are randomized. If the draft numbers are poured into the drum with the early birth dates being poured in first and the later dates poured in last, the later birth dates will be located near the top of the pile. Rotating the drum will randomize the order somewhat, but there will still be a larger proportion of dates near the top of the drum, causing those dates to receive lower draft numbers. 38 Chapter 4: Describing your Data 23. The boxplot is: a. 26 25 24 23 22 21 20 Hedge Sparrow Meadow Pipit Pied Wagtail Robin Tree Pipit Wren 19 b. The statistics are: Hedge Sparrow Count Meadow Pipit Pied Wagtail Robin Tree Pipit Wren 14 45 15 16 15 15 23.1214 22.2989 22.9033 22.5750 23.0900 21.1300 Median 23.05 22.25 23.05 22.55 23.25 21.05 Mode 23.05 22.05 (24.05 ; 21.85) 23.05 (24.05 ; 23.25) (22.05 ; 21.05 ; 20.85) Minimum 20.85 19.65 21.05 21.05 21.05 19.85 Maximum 25.05 24.45 24.85 23.85 24.05 22.25 Std. Dev. 1.06874 0.92063 1.06762 0.68459 0.90143 0.74374 Variance 1.14220 0.84756 1.13981 0.46867 0.81257 0.55314 Average 39 41