Assignments - Wharton Statistics Department

Assignments: 1. A. In order to discover the average number of children, Fred, a young college professor at a large Midwestern university, conducted a survey in which a random sample of 1,000 students in Psychology 101 were asked how many children (including themselves) were in their family. The researcher added all the data together and divided by 1,000 and uncovered the answer – 3.5. The most recent census has found that the average number of children in a U.S. family is 2.5. a. Why are these two numbers different? b. Should Fred have designed his study differently to get the right answer? c. How? B. Prenatal screening for Down syndrome for mothers over the age or 35 is usually recommended. A non-invasive test is about 95% accurate. That is, if the fetus has Down syndrome it will be detected 95% of the time. And if the fetus does not have Down’s it will correctly say so 80% of the time. We know that Down’s is not very common, affecting only about one in every 200 fetuses whose mothers are over age 35. a. What is the probability that if the test says the fetus has Down’s, that the test is correct? b. What is the probability that if the test says the fetus doesn’t have Down’s, that the test is correct? C. In a survey of hospitals it was found that those hospitals that had the highest proportion of female births tended to also have the fewest births of any of the hospitals in the survey. The Jones family, having already had a son, decided to boost their chances of a daughter by going to one of the hospitals that, so far this year, had the highest likelihood of female births. a. Is this a sensible strategy? b. If so, why? If not, why not? c. How does this shed light on why the best performing mutual funds are usually small? d. Should this guide our investment strategy? If so, why? If not, why not? 2. A. Find data displays in the mass media that illustrate at least two of the most common errors. You can find one display with multiple flaws, or two displays with one flaw apiece. Redo the displays correctly. Explain (i) where you found the displays, (ii) what you believe the point of the display was, (iii) what were the flaws, and (iv) what you did to fix them. (e.g. see http://flowingdata.com/2009/11/26/fox-news-makes-the-best-pie-chart-ever/) B. What were the key lessons in Arbuthnot’s (1710) paper? Compare the explanations for the change in the number of christenings in 1704 with that in 1665-1666. B. Find one wonderful display in the mass media. Explain (i) where you found the display, (ii) what you believe the point of the display was, (iii) why you think it is wonderful. 3 A. With your knowledge of improved methods of multivariate display, develop a display the following data set: Antibiotic Bacteria Aerobacter aerogenes Brucella abortus Brucella anthracis Diplococcus pneumoniae Escherichia coli Klebsiella pneumoniae Mycobacterium tuberculosis Proteus vulgaris Pseudomonas aeruginosa Salmonella (Eberthella) typhosa Salmonella schottmuelleri Staphylococcus albus Staphylococcus aureus Streptococcus fecalis Streptococcus hemolyticus Streptococcus viridans Penicillin 870 1 0.001 0.005 100 850 800 3 850 1 10 0.007 0.03 1 0.001 0.005 Streptomycin 1 2 0.01 11 0.4 1.2 5 0.1 2 0.4 0.8 0.1 0.03 1 14 10 Neomycin 1.6 0.02 0.007 10 0.1 1 2 0.1 0.4 0.008 0.09 0.001 0.001 0.1 10 40 The entries of the table are the minimum inhibitory concentration (MIC) in ug/ml, a measure of the effectiveness of the antibiotic. The MIC represents the concentration of antibiotic required to prevent growth in vitro. The covariate “gram staining” describes the reaction of the bacteria to Gram Gram Staining negative negative positive positive negative negative negative negative negative negative negative positive positive positive positive positive staining. Gram-positive bacteria are those that are stained dark blue or violet; Gram-negative bacteria do not react that way. B. Smoothing problem – One might think that if life expectancy is great the murder rate cannot be. But although murder does not take a huge toll on a population perhaps it is an indicant of other lifethreatening processes going on in society. (a) Plot life expectancy as a function of murder rate, then (b) smooth life expectancy by adding the 53h smooth to the plot. What have you learned? (c) Make a separate plot of residuals from the smooth vs. murder rate. What has this taught you? (d) Add a straight-line fit to the plot. Does this help us to understand things better? Or does it hide things that the smooth has told us? Explain. STATE NAME Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi LIFE EXPECT. 69.1 69.3 70.6 70.7 71.7 72.1 72.5 70.1 70.7 68.5 73.6 71.9 70.1 70.9 72.6 72.6 70.1 68.8 70.4 70.2 71.8 70.6 73.0 68.1 MURDER 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 6.2 5.3 10.3 7.1 2.3 4.5 10.6 13.2 2.7 8.5 3.3 11.1 2.3 12.5 HSGRAD 41.3 66.7 58.1 39.9 62.6 63.9 56.0 54.6 52.6 40.6 61.9 59.5 52.6 52.9 59.0 59.9 38.5 42.2 54.7 52.3 58.5 52.8 57.6 41.0 INCOME 3624 6315 4530 3378 5114 4884 5348 4809 4815 4091 4963 4119 5107 4458 4628 4669 3712 3545 3694 5299 4755 4751 4675 3098 ILLITERACY 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2.0 1.9 0.6 0.9 0.7 0.5 0.6 1.6 2.8 0.7 0.9 1.1 0.9 0.6 2.4 Missouri Montana Nebraska Nevada NewHampshire NewJersey NewMexico NewYork NorthCarolina NorthDakota Ohio Oklahoma Oregon Pennsylvania RhodeIsland SouthCarolina SouthDakota Tennessee Texas Utah Vermont Virginia Washington WestVirginia Wisconsin Wyoming 70.7 70.6 72.6 69.0 71.2 70.9 70.3 70.6 69.2 72.8 70.8 71.4 72.1 70.4 71.9 68.0 72.1 70.1 70.9 72.9 71.6 70.1 71.7 69.5 72.5 70.3 9.3 5.0 2.9 11.5 3.3 5.2 9.7 10.9 11.1 1.4 7.4 6.4 4.2 6.1 2.4 11.6 1.7 11.0 12.2 4.5 5.5 9.5 4.3 6.7 3.0 6.9 48.8 59.2 59.3 65.2 57.6 52.5 55.2 52.7 38.5 50.3 53.2 51.6 60.0 50.2 46.4 37.8 53.3 41.8 47.4 67.3 57.1 47.8 63.5 41.6 54.5 62.9 4254 4347 4508 5149 4281 5237 3601 4903 3875 5087 4561 3983 4660 4449 4558 3635 4167 3821 4188 4022 3907 4701 4864 3617 4468 4566 4. A. Exact exponential growth – Fred and Alice were born the same year, and each began life with $500. Fred added $100 each year but kept his treasure under his mattress so he earned no interest. Alice added nothing, but earned interest at 7.5% annually. After 25 years, Fred and Alice are getting married. Who has more money? How much does each have? Alice’s cousin Charlie thinks that Fred is a paranoid loser and that Alice is cheap. He used a combined strategy and added $100 a year and obtained 7.5% interest. How much did he have after 25 years? All three continued with their strategies in the hopes of using the money to fund retirement. How much did each have at age 65? a. Generate accumulations for each person for 65 years b. Plot both series. c. Answer the questions. d. Fit linear function to Fred 0.8 0.6 0.6 0.5 0.7 1.1 2.2 1.4 1.8 0.8 0.8 1.1 0.6 1.0 1.3 2.3 0.5 1.7 2.2 0.6 0.6 1.4 0.6 1.4 0.7 0.6 e. Based on this experiment which retirement savings strategy works better, (a) add money regularly or (b) start early. B. In Table 2 below are a number of state statistics. Some are correct and some are made up. a. Through plots, correlations and regression lines discuss the relationship between the correct data and their imaginary counterparts. b. Compare the four NAEP scores and see if the mean NAEP score adequately represents all states. c. How would you characterize Gore and Bush states vis-à-vis their income and academic performance? d. Has this characterization changed for the 2004 election? e. And what about obesity (Table 3)? Include in your answer some discussion of fat blue states and thin red ones (i.e. states with large residuals). Table 2. Correct state data on income and academic accomplishment Median NAEP Scores mean NAEP '00 election State Income Math-4 Rdg - 4 Math-8 Rdg-8 IQ FakeIncome Massachusetts $50,587 242 228 287 273 257 Gore 111 24059 New Hampshire $53,549 243 228 286 271 257 Bush 102 18834 Vermont $41,929 242 226 286 271 256 Gore 102 20049 Minnesota $54,931 242 223 291 268 256 Gore 113 26979 Connecticut $53,325 241 228 284 267 255 Gore 99 18287 North Dakota $36,717 238 222 287 270 254 Bush 111 26457 South Dakota $38,755 237 222 285 270 254 Bush 100 18226 Montana $33,900 236 223 286 270 254 Bush 100 18727 Wyoming $40,499 241 222 284 267 253 Bush 102 20398 Iowa $41,827 238 223 284 268 253 Gore 109 23534 New Jersey $53,266 239 225 281 268 253 Gore 103 21451 Virginia $49,974 239 223 282 268 253 Bush 99 18202 Kansas $42,523 242 220 284 266 253 Bush 101 20253 Maine $37,654 238 224 282 268 253 Gore 99 19508 Colorado $49,617 235 224 283 268 252 Bush 104 21608 Wisconsin $46,351 237 221 284 266 252 Gore 105 22974 Ohio $43,332 238 222 282 267 252 Bush 107 20299 North Carolina $38,432 242 221 281 262 252 Bush 106 21218 Nebraska $43,566 236 221 282 266 251 Bush 101 21278 Washington $44,252 238 221 281 264 251 Gore 92 15353 Indiana $41,581 238 220 281 265 251 Bush 105 22934 Missouri $43,955 235 222 279 267 251 Bush 92 16854 New York $42,432 236 222 280 265 251 Gore 90 16558 Delaware $50,878 236 224 277 265 250 Gore 90 16062 Utah $48,537 235 219 281 264 250 Bush 89 17423 Oregon $42,704 236 218 281 264 250 Gore 100 20629 Idaho $38,613 235 218 280 264 249 Bush 96 19376 Pennsylvania $43,577 236 219 279 264 249 Gore 99 20124 Michigan $45,335 236 219 276 264 249 Gore 99 18624 Illinois $45,906 233 216 277 266 248 Gore 93 17667 Maryland $55,912 233 219 278 262 248 Gore 95 19084 Kentucky $37,893 229 219 274 266 247 Bush 94 18043 Texas $40,659 237 215 277 259 247 Bush 98 18835 South Carolina $38,460 236 215 277 258 246 Bush 87 15325 Florida $38,533 234 218 271 257 245 Bush 87 16067 West Virginia $30,072 231 219 271 260 245 Bush 92 16534 Alaska $55,412 233 212 279 256 245 Bush 92 17892 Rhode Island $44,311 230 216 272 261 245 Gore 89 15989 Oklahoma $35,500 229 214 272 262 244 Bush 98 19397 Georgia $43,316 230 214 270 258 243 Bush 93 15065 Arkansas $32,423 229 214 266 258 242 Bush 98 21603 Tennessee $36,329 228 212 268 258 241 Bush 90 16198 Arizona $41,554 229 209 271 255 241 Bush 92 18130 Nevada $46,289 228 207 268 252 239 Bush 92 15439 Hawaii $49,775 227 208 266 251 238 Gore 94 17341 California $48,113 227 206 267 251 238 Gore 94 17119 Louisiana $33,312 226 205 266 253 238 Bush 99 20266 Alabama $36,771 223 207 262 253 236 Bush 90 15712 Mississippi $32,447 223 205 261 255 236 Bush 90 16220 New Mexico $35,251 223 203 263 252 235 Gore 85 14088 NAEP data were gathered in February, 2003. Table 3 State % Obese 17 Voted For Kerry % Obese 22 Voted For Kerry Colorado Connecticut 17 18 Bush Kerry Nevada Alaska 22 23 Bush Bush Massachusetts New Hampshire 18 18 Kerry Kerry Iowa Kansas 23 23 Bush Bush Utah California 18 19 Bush Kerry Missouri Nebraska 23 23 Bush Bush Maryland New Jersey 19 19 Kerry Kerry North Dakota Ohio 23 23 Bush Bush Rhode Island Vermont 19 19 Kerry Kerry Oklahoma Pennsylvania 23 24 Bush Kerry Florida Montana Oregon 19 19 20 Bush Bush Kerry Arkansas Georgia Indiana 24 24 24 Bush Bush Bush Arizona Idaho 20 20 Bush Bush North Carolina Virginia 24 24 Bush Bush New Mexico Wyoming 20 20 Bush Bush Michigan Kentucky 25 25 Kerry Bush Maine New York 21 21 Kerry Kerry Tennessee Alabama 25 26 Bush Bush Washington D.C 21 21 Kerry Kerry Louisiana South Carolina 26 26 Bush Bush South Dakota Delaware 21 22 Bush Kerry Texas Mississippi 26 27 Bush Bush Illinois Minnesota 22 22 Kerry Kerry West Virginia 28 Bush Hawaii Fat data from NY Times Feb. 1, 2004 Page 12 Centers for Disease Control & Prevention State Wisconsin 5. What is the pricing structure of convertibles? How would you answer someone who asked “how much does a convertible cost? Do the costs of convertibles fall into specific groups?” A transformation is most useful in the revelation of the underlying price structure. Include informative displays and a narrative explaining both what you did and what you found. Car Acura NSX-T Aston Martin DB7 Volante Price $88,725 $136,300 Audio Cabrio $35,100 Bentley Azure $329,400 BMW 318i $33,720 BMW 328i $41,960 BMW Z3 1.9 $29,995 BMW Z3 2.8 $36,470 Chevrolet Camaro $22,295 Chevrolet Camaro RS $23,695 Chevrolet Camaro Z28 $26,045 Chevrolet Cavalier LS $18,435 Chevrolet Corvette convertible $46,000 Chrysler Sebring JX $20,685 Chrysler Sebring JXi $25,295 Dodge Viper RT/10 $66,700 Ferrari F355 Spider $137,075 Ferrari F50 $487,000 Ford Mustang $21,280 Ford Mustang Cobra $28,660 Ford Mustang GT $24,510 Honda del Sol $15,475 Jaguar XK8 $70,480 Lamborghini Diablo Roadster VT $275,100 Mazda Miata M-Edition $24,935 Mazda MX-5 Miata $19,575 Mercedes-Benz SL320 $80,195 Mercedes-Benz SL500 $90,495 Mercedes-Benz SL600 $123,795 Mercedes-Benz SLK230 $40,295 Mitsubishi Eclipse Spyder GS $20,360 Mitsubishi Eclipse Spyder GS-T Turbo $26,200 Pontiac Firebird $23,609 Pontiac Firebird Formula $27,049 Pontiac Firebird Trans Am $28,969 Pontiac Sunfire SE $19,399 Porsche 911 Cabriolet $73,765 Porsche 911 Carrera 4 Cabriolet $79,115 Porsche Boxster $40,745 Saab 900 SE Talledega Turbo $42,520 Saab 900 SE Turbo $41,995 Saab 900 SE V6 $43,495 Saab 900S $36,195 Toyota Celica GT $24,858 Toyota Paseo $17,188 Volkswagon Cabrio $18,425 Volkswagon Cabrio Highline $22,175 Source: The New York Times 8-Jun-97 Section 11, page 1 B. In the table below are life insurance premiums. Find the underlying policy that Jackson National applied in setting rates for the four groups shown. HINT: plotting rates will help you uncover a sensible transformation, after which some sort of decomposition may be helpful. Accompany your result with a descriptive narrative. Jackson National's 10 Year Level-term Policy Monthly Life Insurance Premiums for $100,000 Male Female Age NonSmoker Smoker NonSmoker Smoker 30 12.34 22.34 10.85 17.71 31 12.51 23.23 11.03 17.89 32 12.69 24.21 11.29 18.07 33 12.78 25.19 11.46 18.25 34 13.04 26.26 11.55 18.42 35 13.21 27.41 11.81 18.60 36 13.74 29.01 12.16 19.49 37 14.35 30.71 12.51 20.47 38 14.96 32.57 12.95 21.54 39 15.58 34.53 13.39 22.61 40 16.28 36.67 13.91 23.85 41 17.15 39.25 14.44 25.10 42 17.94 42.10 15.05 26.43 43 18.81 45.12 15.66 27.86 44 19.78 48.51 16.28 29.37 45 20.83 52.07 17.06 30.97 46 22.14 55.18 17.85 32.66 47 23.63 58.56 18.73 34.44 48 25.20 62.21 19.60 36.40 49 27.04 66.04 20.65 38.45 50 28.79 70.13 21.70 40.67 51 30.63 74.67 22.75 43.25 52 32.64 79.57 23.89 46.01 53 34.83 84.73 25.11 49.04 54 37.01 90.34 26.43 52.24 55 39.55 96.30 27.83 55.71 56 42.53 103.06 29.14 58.65 57 45.76 110.27 30.71 61.68 58 49.35 118.01 32.29 64.97 59 53.20 126.38 34.13 68.35 60 57.31 135.37 35.88 72.00 61 63.35 150.77 39.29 79.30 62 70.18 168.03 43.23 87.40 63 77.61 187.35 47.51 96.39 64 85.93 208.97 52.41 106.36 65 95.38 233.18 57.75 117.48 66 106.49 260.59 63.18 130.56 67 119.26 291.30 69.30 145.25 68 133.70 325.74 75.95 161.71 69 149.89 364.37 83.30 180.05 70 168.09 407.62 91.44 200.52 6. Fit a linear model to an average male’s growth and compare its predictions with the growth of Robert Wadlow. What characterizes Wadlow’s deviance (Slope? Intercept? Both?). Would a logistic function fit the data better? An advanced project might be to fit the sum of two logistics; try it only if you feel adventuresome. Average Male Robt. Wadlow Age HT(in) HT(in) 1 30.1 2 34.1 3 37.9 4 41.4 5 44.6 6 47.4 7 49.9 7.5 50.9 8 51.9 72.0 9 53.7 74.5 10 55.5 77.0 11 57.5 79.0 12 60.4 82.5 13 63.8 86.0 60.0 . 13.5 65.4 14 66.8 14.5 67.8 15 68.6 15.5 69.2 16 69.6 16.5 69.9 17 70.1 17.5 70.2 18 70.3 18.5 70.4 19 70.5 89.5 92.0 94.5 96.5 99.5 101.5 20 103.5 21 104.5 22 107.0 7. A. Decompose the table below using iterative median polish and display the final result in a compelling tabular format. Then display the result graphically. Accompany your result with a verbal description of what you have found. Infant Mortality-rates in the United State, all races, 1964-1966 (Entries are numbers of deaths per 1000 live births) Education of father Region <8 9 to 11 12 13-15 >16 Northeast 25.3 25.3 18.2 18.3 16.3 North Central 32.1 29.0 18.8 24.3 19.0 South 38.8 31.0 19.3 15.7 16.8 West 25.4 21.1 20.3 24.0 17.5 B. C. Reanalyze the data in (A) above using means. Compare the results of the two analyses. Find a table of reasonable size (e.g. at least 5 x 10) in a scientific journal of your choice (e.g. Journal of the American Medical Association, Science, Nature, Psychological Bulletin, etc.) and: (i) Revise it according to the rules in Reference 20 or Ref. 14, chapter 10). (ii) Describe what you found that was not obvious initially. Be sure to include the initial table, the revision, and details about where the table came from and what the inferences that the original authors were making from the table. 8. A. In the relatively recent past there was a news article in the paper that reported that circumcision among men helped to prevent cervical cancer among women. a. Describe what sorts of data were likely to have been used to derive this causal conclusion. b. What would be the ideal data gathering experiment to allow such an inference? c. How close is (a) to (b)? B. Schools sometimes advise parents that their child’s academic future would be rosier if she/he repeated kindergarten. a. What sorts of prior evidence do you think the teacher was using to justify such a recommendation? b. What would be the ideal data gathering experiment to allow such an inference? c. How close is (a) to (b)? 9. M&M (12.38 in 7th edition) Do poets die young? Parts a, b, c, 10. Cereals were analyzed by their protein content. It was also noted that different kinds of cereal were placed on different shelves. The mean and standard deviation of protein content are shown in the table below by shelf position as is the results of an analysis of variance and box plots of the results. Analysis of Variance Sum of Source Squares DF Shelf 12.4 2 Error 78.7 74 Total 91.1 Mean Square 6.2 1.1 F-ratio 5.8 P-value 0.004 76 Means and Std. Deviations Shelf Level 1 2 n 20 21 Mean 2.65 1.90 Standard Deviation 1.46 0.99 3 36 2.86 0.72 6 Protein (g) 5 4 3 2 1 0 1 2 3 a) What are the null and alternative hypotheses being tested in the ANOVA b) What does the ANOVA results say about the null hypothesis? Be sure to report in terms of protein content and shelves. c) Can we conclude that cereals on shelf 2 have a lower mean protein content than cereals on shelf 3? Can we conclude that cereals on shelf 2 have a lower mean protein content than cereals on shelf 1? What can we conclude? d) To check for significant differences between the shelf means we can use a Bonferroni test, do so and show all of the pairwise comparisons. What does it say about the questions in part c? 11. M&M 12.38 in 7th edition– this time answer question (f) doing all pair-wise comparisons using the Bonferroni inequality with an overall  = 0.05. 12. University of Pennsylvania Professor Ted Hershberg uses the results obtained by North Carolina researcher William Sanders in his plans to revamp American Public education. Specifically, he cites Sander’s finding that quality of teachers are the largest factor in students’ performance; that big improvements in student performance are caused by their teacher. Sanders makes this inference by looking at the gain (value-added) in test scores for each student over the year that student was in a specific teacher’s class and adjusts for all other factors by using them as covariates. a. What issues would concern you about this inference? b. How would you design a study that would allow such inferences? c. How close do you think the data-gathering scheme from Sanders’ observational study in Tennessee comes to the ideal case you have described in (b)?

Assignments - Wharton Statistics Department

Related documents

Products

Support

Assignments - Wharton Statistics Department

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib