Assignments - Wharton Statistics Department

advertisement
Assignments:
1.
A. In order to discover the average number of children, Fred, a young college
professor at a large Midwestern university, conducted a survey in which a random
sample of 1,000 students in Psychology 101 were asked how many children
(including themselves) were in their family. The researcher added all the data
together and divided by 1,000 and uncovered the answer – 3.5. The most recent
census has found that the average number of children in a U.S. family is 2.5.
a. Why are these two numbers different?
b. Should Fred have designed his study differently to get the right answer?
c. How?
B. Prenatal screening for Down syndrome for mothers over the age or 35 is usually
recommended. A non-invasive test is about 95% accurate. That is, if the fetus
has Down syndrome it will be detected 95% of the time. And if the fetus does
not have Down’s it will correctly say so 80% of the time. We know that Down’s is
not very common, affecting only about one in every 200 fetuses whose mothers
are over age 35.
a. What is the probability that if the test says the fetus has Down’s, that the test
is correct?
b. What is the probability that if the test says the fetus doesn’t have Down’s,
that the test is correct?
C. In a survey of hospitals it was found that those hospitals that had the highest
proportion of female births tended to also have the fewest births of any of the
hospitals in the survey. The Jones family, having already had a son, decided to
boost their chances of a daughter by going to one of the hospitals that, so far
this year, had the highest likelihood of female births.
a. Is this a sensible strategy?
b. If so, why? If not, why not?
c. How does this shed light on why the best performing mutual funds are usually
small?
d. Should this guide our investment strategy? If so, why? If not, why not?
2.
A. Find data displays in the mass media that illustrate at least two of the
most common errors. You can find one display with multiple flaws, or two
displays with one flaw apiece. Redo the displays correctly. Explain (i)
where you found the displays, (ii) what you believe the point of the
display was, (iii) what were the flaws, and (iv) what you did to fix them.
(e.g. see http://flowingdata.com/2009/11/26/fox-news-makes-the-best-pie-chart-ever/)
B. What were the key lessons in Arbuthnot’s (1710) paper? Compare
the explanations for the change in the number of christenings in 1704
with that in 1665-1666.
B. Find one wonderful display in the mass media. Explain (i) where you
found the display, (ii) what you believe the point of the display was,
(iii) why you think it is wonderful.
3 A. With your knowledge of improved methods of multivariate display,
develop a display the following data set:
Antibiotic
Bacteria
Aerobacter aerogenes
Brucella abortus
Brucella anthracis
Diplococcus pneumoniae
Escherichia coli
Klebsiella pneumoniae
Mycobacterium tuberculosis
Proteus vulgaris
Pseudomonas aeruginosa
Salmonella (Eberthella) typhosa
Salmonella schottmuelleri
Staphylococcus albus
Staphylococcus aureus
Streptococcus fecalis
Streptococcus hemolyticus
Streptococcus viridans
Penicillin
870
1
0.001
0.005
100
850
800
3
850
1
10
0.007
0.03
1
0.001
0.005
Streptomycin
1
2
0.01
11
0.4
1.2
5
0.1
2
0.4
0.8
0.1
0.03
1
14
10
Neomycin
1.6
0.02
0.007
10
0.1
1
2
0.1
0.4
0.008
0.09
0.001
0.001
0.1
10
40
The entries of the table are the minimum inhibitory concentration (MIC) in
ug/ml, a measure of the effectiveness of the antibiotic. The MIC represents
the concentration of antibiotic required to prevent growth in vitro. The
covariate “gram staining” describes the reaction of the bacteria to Gram
Gram
Staining
negative
negative
positive
positive
negative
negative
negative
negative
negative
negative
negative
positive
positive
positive
positive
positive
staining. Gram-positive bacteria are those that are stained dark blue or
violet; Gram-negative bacteria do not react that way.
B. Smoothing problem – One might think that if life expectancy is
great the murder rate cannot be. But although murder does not take
a huge toll on a population perhaps it is an indicant of other lifethreatening processes going on in society.
(a) Plot life expectancy as a function of murder rate, then
(b) smooth life expectancy by adding the 53h smooth to the plot. What
have you learned?
(c) Make a separate plot of residuals from the smooth vs. murder rate.
What has this taught you?
(d) Add a straight-line fit to the plot. Does this help us to understand
things better? Or does it hide things that the smooth has told us?
Explain.
STATE NAME
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
LIFE
EXPECT.
69.1
69.3
70.6
70.7
71.7
72.1
72.5
70.1
70.7
68.5
73.6
71.9
70.1
70.9
72.6
72.6
70.1
68.8
70.4
70.2
71.8
70.6
73.0
68.1
MURDER
15.1
11.3
7.8
10.1
10.3
6.8
3.1
6.2
10.7
13.9
6.2
5.3
10.3
7.1
2.3
4.5
10.6
13.2
2.7
8.5
3.3
11.1
2.3
12.5
HSGRAD
41.3
66.7
58.1
39.9
62.6
63.9
56.0
54.6
52.6
40.6
61.9
59.5
52.6
52.9
59.0
59.9
38.5
42.2
54.7
52.3
58.5
52.8
57.6
41.0
INCOME
3624
6315
4530
3378
5114
4884
5348
4809
4815
4091
4963
4119
5107
4458
4628
4669
3712
3545
3694
5299
4755
4751
4675
3098
ILLITERACY
2.1
1.5
1.8
1.9
1.1
0.7
1.1
0.9
1.3
2.0
1.9
0.6
0.9
0.7
0.5
0.6
1.6
2.8
0.7
0.9
1.1
0.9
0.6
2.4
Missouri
Montana
Nebraska
Nevada
NewHampshire
NewJersey
NewMexico
NewYork
NorthCarolina
NorthDakota
Ohio
Oklahoma
Oregon
Pennsylvania
RhodeIsland
SouthCarolina
SouthDakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
WestVirginia
Wisconsin
Wyoming
70.7
70.6
72.6
69.0
71.2
70.9
70.3
70.6
69.2
72.8
70.8
71.4
72.1
70.4
71.9
68.0
72.1
70.1
70.9
72.9
71.6
70.1
71.7
69.5
72.5
70.3
9.3
5.0
2.9
11.5
3.3
5.2
9.7
10.9
11.1
1.4
7.4
6.4
4.2
6.1
2.4
11.6
1.7
11.0
12.2
4.5
5.5
9.5
4.3
6.7
3.0
6.9
48.8
59.2
59.3
65.2
57.6
52.5
55.2
52.7
38.5
50.3
53.2
51.6
60.0
50.2
46.4
37.8
53.3
41.8
47.4
67.3
57.1
47.8
63.5
41.6
54.5
62.9
4254
4347
4508
5149
4281
5237
3601
4903
3875
5087
4561
3983
4660
4449
4558
3635
4167
3821
4188
4022
3907
4701
4864
3617
4468
4566
4. A. Exact exponential growth – Fred and Alice were born the same year, and
each began life with $500. Fred added $100 each year but kept his treasure
under his mattress so he earned no interest. Alice added nothing, but earned
interest at 7.5% annually. After 25 years, Fred and Alice are getting
married. Who has more money? How much does each have? Alice’s cousin
Charlie thinks that Fred is a paranoid loser and that Alice is cheap. He used
a combined strategy and added $100 a year and obtained 7.5% interest. How
much did he have after 25 years? All three continued with their strategies
in the hopes of using the money to fund retirement. How much did each have
at age 65?
a. Generate accumulations for each person for 65 years
b. Plot both series.
c. Answer the questions.
d. Fit linear function to Fred
0.8
0.6
0.6
0.5
0.7
1.1
2.2
1.4
1.8
0.8
0.8
1.1
0.6
1.0
1.3
2.3
0.5
1.7
2.2
0.6
0.6
1.4
0.6
1.4
0.7
0.6
e. Based on this experiment which retirement savings strategy works
better, (a) add money regularly or (b) start early.
B.
In Table 2 below are a number of state statistics. Some are correct
and some are made up.
a. Through plots, correlations and regression lines discuss the
relationship between the correct data and their imaginary
counterparts.
b. Compare the four NAEP scores and see if the mean NAEP
score adequately represents all states.
c. How would you characterize Gore and Bush states vis-à-vis
their income and academic performance?
d. Has this characterization changed for the 2004 election?
e. And what about obesity (Table 3)? Include in your answer
some discussion of fat blue states and thin red ones (i.e.
states with large residuals).
Table 2. Correct state data on income and academic accomplishment
Median
NAEP Scores
mean
NAEP
'00
election
State
Income
Math-4
Rdg - 4
Math-8
Rdg-8
IQ
FakeIncome
Massachusetts
$50,587
242
228
287
273
257
Gore
111
24059
New Hampshire
$53,549
243
228
286
271
257
Bush
102
18834
Vermont
$41,929
242
226
286
271
256
Gore
102
20049
Minnesota
$54,931
242
223
291
268
256
Gore
113
26979
Connecticut
$53,325
241
228
284
267
255
Gore
99
18287
North Dakota
$36,717
238
222
287
270
254
Bush
111
26457
South Dakota
$38,755
237
222
285
270
254
Bush
100
18226
Montana
$33,900
236
223
286
270
254
Bush
100
18727
Wyoming
$40,499
241
222
284
267
253
Bush
102
20398
Iowa
$41,827
238
223
284
268
253
Gore
109
23534
New Jersey
$53,266
239
225
281
268
253
Gore
103
21451
Virginia
$49,974
239
223
282
268
253
Bush
99
18202
Kansas
$42,523
242
220
284
266
253
Bush
101
20253
Maine
$37,654
238
224
282
268
253
Gore
99
19508
Colorado
$49,617
235
224
283
268
252
Bush
104
21608
Wisconsin
$46,351
237
221
284
266
252
Gore
105
22974
Ohio
$43,332
238
222
282
267
252
Bush
107
20299
North Carolina
$38,432
242
221
281
262
252
Bush
106
21218
Nebraska
$43,566
236
221
282
266
251
Bush
101
21278
Washington
$44,252
238
221
281
264
251
Gore
92
15353
Indiana
$41,581
238
220
281
265
251
Bush
105
22934
Missouri
$43,955
235
222
279
267
251
Bush
92
16854
New York
$42,432
236
222
280
265
251
Gore
90
16558
Delaware
$50,878
236
224
277
265
250
Gore
90
16062
Utah
$48,537
235
219
281
264
250
Bush
89
17423
Oregon
$42,704
236
218
281
264
250
Gore
100
20629
Idaho
$38,613
235
218
280
264
249
Bush
96
19376
Pennsylvania
$43,577
236
219
279
264
249
Gore
99
20124
Michigan
$45,335
236
219
276
264
249
Gore
99
18624
Illinois
$45,906
233
216
277
266
248
Gore
93
17667
Maryland
$55,912
233
219
278
262
248
Gore
95
19084
Kentucky
$37,893
229
219
274
266
247
Bush
94
18043
Texas
$40,659
237
215
277
259
247
Bush
98
18835
South Carolina
$38,460
236
215
277
258
246
Bush
87
15325
Florida
$38,533
234
218
271
257
245
Bush
87
16067
West Virginia
$30,072
231
219
271
260
245
Bush
92
16534
Alaska
$55,412
233
212
279
256
245
Bush
92
17892
Rhode Island
$44,311
230
216
272
261
245
Gore
89
15989
Oklahoma
$35,500
229
214
272
262
244
Bush
98
19397
Georgia
$43,316
230
214
270
258
243
Bush
93
15065
Arkansas
$32,423
229
214
266
258
242
Bush
98
21603
Tennessee
$36,329
228
212
268
258
241
Bush
90
16198
Arizona
$41,554
229
209
271
255
241
Bush
92
18130
Nevada
$46,289
228
207
268
252
239
Bush
92
15439
Hawaii
$49,775
227
208
266
251
238
Gore
94
17341
California
$48,113
227
206
267
251
238
Gore
94
17119
Louisiana
$33,312
226
205
266
253
238
Bush
99
20266
Alabama
$36,771
223
207
262
253
236
Bush
90
15712
Mississippi
$32,447
223
205
261
255
236
Bush
90
16220
New Mexico
$35,251
223
203
263
252
235
Gore
85
14088
NAEP data were gathered in February, 2003.
Table 3
State
% Obese
17
Voted For
Kerry
% Obese
22
Voted For
Kerry
Colorado
Connecticut
17
18
Bush
Kerry
Nevada
Alaska
22
23
Bush
Bush
Massachusetts
New Hampshire
18
18
Kerry
Kerry
Iowa
Kansas
23
23
Bush
Bush
Utah
California
18
19
Bush
Kerry
Missouri
Nebraska
23
23
Bush
Bush
Maryland
New Jersey
19
19
Kerry
Kerry
North Dakota
Ohio
23
23
Bush
Bush
Rhode Island
Vermont
19
19
Kerry
Kerry
Oklahoma
Pennsylvania
23
24
Bush
Kerry
Florida
Montana
Oregon
19
19
20
Bush
Bush
Kerry
Arkansas
Georgia
Indiana
24
24
24
Bush
Bush
Bush
Arizona
Idaho
20
20
Bush
Bush
North Carolina
Virginia
24
24
Bush
Bush
New Mexico
Wyoming
20
20
Bush
Bush
Michigan
Kentucky
25
25
Kerry
Bush
Maine
New York
21
21
Kerry
Kerry
Tennessee
Alabama
25
26
Bush
Bush
Washington
D.C
21
21
Kerry
Kerry
Louisiana
South Carolina
26
26
Bush
Bush
South Dakota
Delaware
21
22
Bush
Kerry
Texas
Mississippi
26
27
Bush
Bush
Illinois
Minnesota
22
22
Kerry
Kerry
West Virginia
28
Bush
Hawaii
Fat data from
NY Times Feb. 1, 2004
Page 12
Centers for Disease Control & Prevention
State
Wisconsin
5. What is the pricing structure of convertibles? How would you answer
someone who asked “how much does a convertible cost? Do the costs of
convertibles fall into specific groups?” A transformation is most useful in
the revelation of the underlying price structure. Include informative
displays and a narrative explaining both what you did and what you found.
Car
Acura NSX-T
Aston Martin DB7 Volante
Price
$88,725
$136,300
Audio Cabrio
$35,100
Bentley Azure
$329,400
BMW 318i
$33,720
BMW 328i
$41,960
BMW Z3 1.9
$29,995
BMW Z3 2.8
$36,470
Chevrolet Camaro
$22,295
Chevrolet Camaro RS
$23,695
Chevrolet Camaro Z28
$26,045
Chevrolet Cavalier LS
$18,435
Chevrolet Corvette convertible
$46,000
Chrysler Sebring JX
$20,685
Chrysler Sebring JXi
$25,295
Dodge Viper RT/10
$66,700
Ferrari F355 Spider
$137,075
Ferrari F50
$487,000
Ford Mustang
$21,280
Ford Mustang Cobra
$28,660
Ford Mustang GT
$24,510
Honda del Sol
$15,475
Jaguar XK8
$70,480
Lamborghini Diablo Roadster VT
$275,100
Mazda Miata M-Edition
$24,935
Mazda MX-5 Miata
$19,575
Mercedes-Benz SL320
$80,195
Mercedes-Benz SL500
$90,495
Mercedes-Benz SL600
$123,795
Mercedes-Benz SLK230
$40,295
Mitsubishi Eclipse Spyder GS
$20,360
Mitsubishi Eclipse Spyder GS-T Turbo
$26,200
Pontiac Firebird
$23,609
Pontiac Firebird Formula
$27,049
Pontiac Firebird Trans Am
$28,969
Pontiac Sunfire SE
$19,399
Porsche 911 Cabriolet
$73,765
Porsche 911 Carrera 4 Cabriolet
$79,115
Porsche Boxster
$40,745
Saab 900 SE Talledega Turbo
$42,520
Saab 900 SE Turbo
$41,995
Saab 900 SE V6
$43,495
Saab 900S
$36,195
Toyota Celica GT
$24,858
Toyota Paseo
$17,188
Volkswagon Cabrio
$18,425
Volkswagon Cabrio Highline
$22,175
Source: The New York Times
8-Jun-97
Section 11, page 1
B. In the table below are life insurance premiums. Find the underlying policy
that Jackson National applied in setting rates for the four groups shown.
HINT: plotting rates will help you uncover a sensible transformation, after
which some sort of decomposition may be helpful. Accompany your result
with a descriptive narrative.
Jackson National's 10 Year Level-term Policy
Monthly Life Insurance Premiums for $100,000
Male
Female
Age NonSmoker Smoker
NonSmoker
Smoker
30
12.34
22.34
10.85
17.71
31
12.51
23.23
11.03
17.89
32
12.69
24.21
11.29
18.07
33
12.78
25.19
11.46
18.25
34
13.04
26.26
11.55
18.42
35
13.21
27.41
11.81
18.60
36
13.74
29.01
12.16
19.49
37
14.35
30.71
12.51
20.47
38
14.96
32.57
12.95
21.54
39
15.58
34.53
13.39
22.61
40
16.28
36.67
13.91
23.85
41
17.15
39.25
14.44
25.10
42
17.94
42.10
15.05
26.43
43
18.81
45.12
15.66
27.86
44
19.78
48.51
16.28
29.37
45
20.83
52.07
17.06
30.97
46
22.14
55.18
17.85
32.66
47
23.63
58.56
18.73
34.44
48
25.20
62.21
19.60
36.40
49
27.04
66.04
20.65
38.45
50
28.79
70.13
21.70
40.67
51
30.63
74.67
22.75
43.25
52
32.64
79.57
23.89
46.01
53
34.83
84.73
25.11
49.04
54
37.01
90.34
26.43
52.24
55
39.55
96.30
27.83
55.71
56
42.53
103.06
29.14
58.65
57
45.76
110.27
30.71
61.68
58
49.35
118.01
32.29
64.97
59
53.20
126.38
34.13
68.35
60
57.31
135.37
35.88
72.00
61
63.35
150.77
39.29
79.30
62
70.18
168.03
43.23
87.40
63
77.61
187.35
47.51
96.39
64
85.93
208.97
52.41
106.36
65
95.38
233.18
57.75
117.48
66
106.49
260.59
63.18
130.56
67
119.26
291.30
69.30
145.25
68
133.70
325.74
75.95
161.71
69
149.89
364.37
83.30
180.05
70
168.09
407.62
91.44
200.52
6. Fit a linear model to an average male’s growth and compare its predictions
with the growth of Robert Wadlow. What characterizes Wadlow’s deviance
(Slope? Intercept? Both?). Would a logistic function fit the data better? An
advanced project might be to fit the sum of two logistics; try it only if you
feel adventuresome.
Average Male
Robt. Wadlow
Age
HT(in)
HT(in)
1
30.1
2
34.1
3
37.9
4
41.4
5
44.6
6
47.4
7
49.9
7.5
50.9
8
51.9
72.0
9
53.7
74.5
10
55.5
77.0
11
57.5
79.0
12
60.4
82.5
13
63.8
86.0
60.0
.
13.5
65.4
14
66.8
14.5
67.8
15
68.6
15.5
69.2
16
69.6
16.5
69.9
17
70.1
17.5
70.2
18
70.3
18.5
70.4
19
70.5
89.5
92.0
94.5
96.5
99.5
101.5
20
103.5
21
104.5
22
107.0
7. A. Decompose the table below using iterative median polish and display the
final result in a compelling tabular format. Then display the result
graphically. Accompany your result with a verbal description of what you
have found.
Infant Mortality-rates in the United State, all races, 1964-1966
(Entries are numbers of deaths per 1000 live births)
Education of father
Region
<8
9 to 11
12
13-15
>16
Northeast
25.3
25.3
18.2
18.3
16.3
North Central
32.1
29.0
18.8
24.3
19.0
South
38.8
31.0
19.3
15.7
16.8
West
25.4
21.1
20.3
24.0
17.5
B.
C.
Reanalyze the data in (A) above using means. Compare the results
of the two analyses.
Find a table of reasonable size (e.g. at least 5 x 10) in a scientific
journal of your choice (e.g. Journal of the American Medical
Association, Science, Nature, Psychological Bulletin, etc.) and:
(i) Revise it according to the rules in Reference 20 or Ref. 14,
chapter 10).
(ii) Describe what you found that was not obvious initially.
Be sure to include the initial table, the revision, and details about
where the table came from and what the inferences that the
original authors were making from the table.
8. A. In the relatively recent past there was a news article in the paper that
reported that circumcision among men helped to prevent cervical cancer
among women.
a. Describe what sorts of data were likely to have been used to derive
this causal conclusion.
b. What would be the ideal data gathering experiment to allow such an
inference?
c. How close is (a) to (b)?
B. Schools sometimes advise parents that their child’s academic future
would be rosier if she/he repeated kindergarten.
a. What sorts of prior evidence do you think the teacher was using to
justify such a recommendation?
b. What would be the ideal data gathering experiment to allow such an
inference?
c. How close is (a) to (b)?
9. M&M (12.38 in 7th edition) Do poets die young? Parts a, b, c,
10. Cereals were analyzed by their protein content. It was also noted that
different kinds of cereal were placed on different shelves. The mean and
standard deviation of protein content are shown in the table below by shelf
position as is the results of an analysis of variance and box plots of the
results.
Analysis of Variance
Sum of
Source Squares DF
Shelf
12.4
2
Error
78.7
74
Total
91.1
Mean
Square
6.2
1.1
F-ratio
5.8
P-value
0.004
76
Means and Std. Deviations
Shelf
Level
1
2
n
20
21
Mean
2.65
1.90
Standard
Deviation
1.46
0.99
3
36
2.86
0.72
6
Protein (g)
5
4
3
2
1
0
1
2
3
a) What are the null and alternative hypotheses being tested in the
ANOVA
b) What does the ANOVA results say about the null hypothesis? Be
sure to report in terms of protein content and shelves.
c) Can we conclude that cereals on shelf 2 have a lower mean protein
content than cereals on shelf 3? Can we conclude that cereals on
shelf 2 have a lower mean protein content than cereals on shelf 1?
What can we conclude?
d) To check for significant differences between the shelf means we
can use a Bonferroni test, do so and show all of the pairwise
comparisons. What does it say about the questions in part c?
11. M&M 12.38 in 7th edition– this time answer question (f) doing all pair-wise
comparisons using the Bonferroni inequality with an overall  = 0.05.
12. University of Pennsylvania Professor Ted Hershberg uses the results
obtained by North Carolina researcher William Sanders in his plans to
revamp American Public education. Specifically, he cites Sander’s finding
that quality of teachers are the largest factor in students’ performance;
that big improvements in student performance are caused by their teacher.
Sanders makes this inference by looking at the gain (value-added) in test
scores for each student over the year that student was in a specific
teacher’s class and adjusts for all other factors by using them as covariates.
a. What issues would concern you about this inference?
b. How would you design a study that would allow such inferences?
c. How close do you think the data-gathering scheme from Sanders’
observational study in Tennessee comes to the ideal case you have
described in (b)?
Download