Statistics 101 - Wharton Statistics Department

advertisement
Statistics 101
Final Exam
December 20, 2004
Notes:
1. The exam is open book and notes. Calculators are permitted, but not computers.
2. Please include all of your work in the blue book.
3. Please indicate the null and alternative hypotheses when appropriate.
The questions on this exam are based on the operations of Wharton’s undergraduate placement office.
The data are all simulated to protect confidentiality.
1. (21 pts) Wharton School students who graduated in the class of 2005 were surveyed to
obtain employment information. There were 625 graduates, but only 400 returned the
survey. Sixty six respondents did not have a job at the time of the survey. The students
without jobs were eliminated from the analysis of salary data.
Summary data in units of thousands of dollars: Average salary=62.6; standard
deviation=7.4; n=334.
A. Are the salaries sufficiently high to make the claim that the mean salary µ exceeds 60?
Use =.01
B. i) If the true mean salary µ was really 61, what is the probability that the test in part A
would incorrectly retain the null hypothesis? Assume that σ=7.4
ii) What sample size of salaries would be required so that the probability of incorrectly
retaining the null hypothesis is .05 when the true mean is 61? Assume that σ=7.4
and =.01
C. One of the concerns of the study is that 225 graduates did not respond. Twenty of the
non-respondents were called back and their salaries were recorded. These twenty nonrespondents were matched (based on a variety of characteristics including major and
location of job) to twenty from among the original 334 respondents.
Summary data of the differences between the salaries of the original respondents and
the respondents who provided salaries after they were called back: Average=3.25;
standard deviation=7.15; n=20.
Are these data sufficient in concluding that the mean salary of initial respondents is
different from the mean salary of those who did not respond initially? Use =.05
2. (21 pts) Data were collected on the number of job offers that each of the initial 400
respondents received (See Figure 2a).
Figure 2a Distribution of number of job offers
Distributions
J obs
11
. 01
10
. 05 . 10
. 25
. 50
. 75
. 90 . 95
. 99
9
8
7
6
5
4
3
2
1
0
-1
-3
-2
-1
0
1
2
3
Normal Quantile Plot
Quantiles
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
10.000
10.000
9.975
5.000
4.000
2.000
1.000
0.000
0.000
0.000
0.000
Mome nts
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
2.77
2.1078918
0.1053946
2.9771981
2.5628019
400
A. i) If you were going to approximate the above data by a normal distribution what
normal distribution would you use?
ii) Use appropriate output to comment on whether a normal distribution is a reasonable
description for the above data?
B. i) What would be a 95% confidence interval for the mean assuming that =2.11?
ii) What would be the required sample size so that the 95% confidence interval in Bi)
has a margin of error of .2 jobs? Assume that =2.11
Data are collected that relate whether a person did not receive any job or at least one
job to choice of major (See Figure 2b).
Figure 2b. Relationship between major and whether graduate received any job offer
Contingency Analysis of Major By Job?
Freq: Frequency
Contingency Table
Major
Job?
Count Finance
Total %
Col %
Row %
No
Yes
Management Marketing
14
3.50
9.52
21.21
133
33.25
90.48
39.82
147
36.75
30
7.50
21.74
45.45
108
27.00
78.26
32.34
138
34.50
7
1.75
19.44
10.61
29
7.25
80.56
8.68
36
9.00
Other
15
3.75
18.99
22.73
64
16.00
81.01
19.16
79
19.75
66
16.50
334
83.50
400
Te sts
Source
Model
Error
C. Total
N
DF
3
394
397
400
-LogLike RSquare (U)
4.53140
0.0089
504.30675
508.83816
Test
ChiSquare Prob>ChiSq
Likelihood Ratio
9.063
0.0285
Pearson
8.523
0.0364
C. Find a 95% confidence interval for the proportion of students who did not receive any
job offer?
D. i) Describe in one or two sentences how the fraction of students without jobs vary with
respect to major.
ii) The P-value of .0364 (labeled Prob>ChiSq in the Pearson row) is the P-value
corresponding to the null hypothesis that there is no relationship between Major and
whether a student does not receive any job offer. Explain what the P-value means in
the context of this problem and how you would use this P-value in testing the null hypothesis.
3. (20 pts) Assume that the number of job offers has the following probability distribution:
x:
0 1 2 3 4 5 6
7
8
9
10
p(x): .15 .15 .25 .20 .11 .07 .02 .02 .01 .01 .01
A. What is µ for this probability distribution?
B. Determine whether the sample we observed plausibly comes from this distribution by
testing whether 66 respondents out of 400 without a job is sufficient in rejecting the null hypothesis
that the true proportion is .15 Use =.05
One thousand samples of size 400 are generated from the above distribution
(see Figure 3).
Figure 3 Distribution of averages of 400 from the above p(x)
Quantiles
Av erage
. 001 . 01 . 05. 10 . 25 . 50 . 75 . 90. 95 . 99 . 999
2.9
2.8
2.7
2.6
2.5
2.4
2.3
2.2
-4
-3
-2
-1
0
1
2
3
4
100.0% maximum
99.5%
97.5%
90.0%
75.0%
quartile
50.0%
median
25.0%
quartile
10.0%
2.5%
0.5%
0.0%
minimum
2.9200
2.8598
2.7749
2.7025
2.6375
2.5725
2.5025
2.4400
2.3850
2.3250
2.1575
Normal Quantile Plot
Mome nts
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
2.57099
0.1008198
0.0031882
2.5772463
2.5647337
1000
C. i) Explain whether or not the 1000 samples of averages is well-approximated by a
normal distribution by referring to appropriate output.
ii) The 95% confidence interval for the mean is obtained using the standard
formula for each sample. Of these 1000 intervals 961 included the true mean.
The null hypothesis is that the standard formula produces confidence intervals
that include the mean 95% of the time. Would the null hypothesis be rejected at =.05?
4. (12 points) A regression is run relating number of job offers to salary. In Figure 4a all 400
individuals are included using a salary of zero for those who did not receive any offer.
In Figure 4b the 66 respondents without a job offer are eliminated and only the remaining 334
individuals were used.
Figures 4a and 4b appear on the next page.
A. Explain whether the results of the two models are similar in terms of
i) predicting salaries from the number of job offers
ii) the accuracy of these predictions.
B. Explain which of the two models is more sensible by referring to appropriate output.
Figure 4a Regression relating salary to job
offers for all 400 respondents
Figure 4b Regression relating salary to job
offers for 334 respondents with jobs
Bivariate Fit of Salary By Job Offers
Bivariate Fit of Salary By Job Offers
90
90
80
80
70
50
Salary
Salary
60
40
70
60
30
20
50
10
0
40
-1
0
1
2
3
4
5
6
7
8
9 10 11
1
2
3
4
Job Off ers
Linear Fit
6
7
8
9
10
Linear Fit
Linear Fit
Linear Fit
Salary = 29.885839 + 9.4377944 Job Of f ers
Salary = 56.727649 + 2.0693021 Job Of f ers
Sum mary of Fit
Sum mary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observ ations (or Sum Wgts)
0.458412
0.457051
17.85332
52.27701
400
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observ ations (or Sum Wgts)
Analys is of Variance
Source
Model
Error
C. Total
DF Sum of Squares Mean Square
F Ratio
1
107376.03
107376 336.8756
398
126858.89
319 Prob > F
399
234234.92
<.0001
Term
Intercept
Job Off ers
0.178728
0.176254
6.713941
62.60719
334
Analys is of Variance
Source
Model
Error
C. Total
Param e te r Estim ate s
DF Sum of Squares Mean Square
1
3256.854
3256.85
332
14965.564
45.08
333
18222.418
F Ratio
72.2509
Prob > F
<.0001
Param e te r Estim ate s
Estimate Std Error
29.885839 1.511665
9.4377944 0.514204
t Ratio Prob>|t|
19.77 <.0001
18.35 <.0001
Term
Intercept
Job Off ers
30
20
10
10
Residual
Residual
5
Job Off ers
-10
-30
Estimate Std Error
56.727649 0.783211
2.0693021 0.243446
t Ratio Prob>| t|
72.43 <.0001
8.50 <.0001
0
-10
-20
-50
-1
0
1
2
3
4
5
6
Job Off ers
7
8
9
10
11
1
2
3
4
5
6
Job Off ers
7
8
9
10
5. (18 points) Investment banking companies often ask probability questions of applicants. Parts A
and B are two such questions.
A. A recent Wall Street Journal article indicated how physicists from MIT were able to beat the
roulette wheel by 6% by tracking the movement of the ball and entering the information into a
computer. They were subsequently banned from playing. Specifically, assume that the expected profit
on each roll is $60 with a standard deviation of $1,000
If they play for 10,000 rolls (it is correct to assume that the rolls are independent), what is the
probability that they make over $500,000?
B. Another favorite question involves two fair coins (i.e., P(Head)=.5) and one coin with a probability
of a head of .75. You flip each of the coins in any order. You want to maximize the chance of getting
at least two heads in a row. Is it best to flip the unfair coin first, second or third or does it matter?
Explain your answer.
C. A manager of the company wants to determine whether getting the probability question correct has
any relation to whether a person is going to be a successful investor as determined by being in the
“Top 10% Club”. Of the 63 employees who got the probability question correct, 10 of them are in the
“Top 10% Club”. Is this sufficient in showing that those who get the probability question correct have
greater than a .1 chance of being in the “Top 10% Club”? Use =.05
6. (8 points) For each of the following questions please state whether the value with the new
information would be
a) Lower
b) The same
c) Higher
d) Cannot say without further information
than the corresponding value assuming the information that was given originally.
a) If the sample size for the test in Question 1 part a. were doubled and the sample average and
standard deviation remained the same, what would happen to the P-value?
b) If the assumed value of the mean in Question 1 part b. was changed from 61 to 62
what would happen to the probability of making an error?
c) In the regression of salary and number of job offers (without the zero job offer respondents) it was
determined that 30 of the respondents with one job offer accepted the job that they received from
summer employment between their Junior and Senior year. Since these jobs are prestigious, the
salaries of these students were relatively high. If these 30 respondents were removed from the analysis
what effect would this have on the slope?
d) If the analysis as in Figure 4b were restricted to only Finance majors, what effect would this have
on the R-squared value?
Download