Statistics 101 Final Exam December 20, 2004 Notes: 1. The exam is open book and notes. Calculators are permitted, but not computers. 2. Please include all of your work in the blue book. 3. Please indicate the null and alternative hypotheses when appropriate. The questions on this exam are based on the operations of Wharton’s undergraduate placement office. The data are all simulated to protect confidentiality. 1. (21 pts) Wharton School students who graduated in the class of 2005 were surveyed to obtain employment information. There were 625 graduates, but only 400 returned the survey. Sixty six respondents did not have a job at the time of the survey. The students without jobs were eliminated from the analysis of salary data. Summary data in units of thousands of dollars: Average salary=62.6; standard deviation=7.4; n=334. A. Are the salaries sufficiently high to make the claim that the mean salary µ exceeds 60? Use =.01 B. i) If the true mean salary µ was really 61, what is the probability that the test in part A would incorrectly retain the null hypothesis? Assume that σ=7.4 ii) What sample size of salaries would be required so that the probability of incorrectly retaining the null hypothesis is .05 when the true mean is 61? Assume that σ=7.4 and =.01 C. One of the concerns of the study is that 225 graduates did not respond. Twenty of the non-respondents were called back and their salaries were recorded. These twenty nonrespondents were matched (based on a variety of characteristics including major and location of job) to twenty from among the original 334 respondents. Summary data of the differences between the salaries of the original respondents and the respondents who provided salaries after they were called back: Average=3.25; standard deviation=7.15; n=20. Are these data sufficient in concluding that the mean salary of initial respondents is different from the mean salary of those who did not respond initially? Use =.05 2. (21 pts) Data were collected on the number of job offers that each of the initial 400 respondents received (See Figure 2a). Figure 2a Distribution of number of job offers Distributions J obs 11 . 01 10 . 05 . 10 . 25 . 50 . 75 . 90 . 95 . 99 9 8 7 6 5 4 3 2 1 0 -1 -3 -2 -1 0 1 2 3 Normal Quantile Plot Quantiles 100.0% maximum 99.5% 97.5% 90.0% 75.0% quartile 50.0% median 25.0% quartile 10.0% 2.5% 0.5% 0.0% minimum 10.000 10.000 9.975 5.000 4.000 2.000 1.000 0.000 0.000 0.000 0.000 Mome nts Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 2.77 2.1078918 0.1053946 2.9771981 2.5628019 400 A. i) If you were going to approximate the above data by a normal distribution what normal distribution would you use? ii) Use appropriate output to comment on whether a normal distribution is a reasonable description for the above data? B. i) What would be a 95% confidence interval for the mean assuming that =2.11? ii) What would be the required sample size so that the 95% confidence interval in Bi) has a margin of error of .2 jobs? Assume that =2.11 Data are collected that relate whether a person did not receive any job or at least one job to choice of major (See Figure 2b). Figure 2b. Relationship between major and whether graduate received any job offer Contingency Analysis of Major By Job? Freq: Frequency Contingency Table Major Job? Count Finance Total % Col % Row % No Yes Management Marketing 14 3.50 9.52 21.21 133 33.25 90.48 39.82 147 36.75 30 7.50 21.74 45.45 108 27.00 78.26 32.34 138 34.50 7 1.75 19.44 10.61 29 7.25 80.56 8.68 36 9.00 Other 15 3.75 18.99 22.73 64 16.00 81.01 19.16 79 19.75 66 16.50 334 83.50 400 Te sts Source Model Error C. Total N DF 3 394 397 400 -LogLike RSquare (U) 4.53140 0.0089 504.30675 508.83816 Test ChiSquare Prob>ChiSq Likelihood Ratio 9.063 0.0285 Pearson 8.523 0.0364 C. Find a 95% confidence interval for the proportion of students who did not receive any job offer? D. i) Describe in one or two sentences how the fraction of students without jobs vary with respect to major. ii) The P-value of .0364 (labeled Prob>ChiSq in the Pearson row) is the P-value corresponding to the null hypothesis that there is no relationship between Major and whether a student does not receive any job offer. Explain what the P-value means in the context of this problem and how you would use this P-value in testing the null hypothesis. 3. (20 pts) Assume that the number of job offers has the following probability distribution: x: 0 1 2 3 4 5 6 7 8 9 10 p(x): .15 .15 .25 .20 .11 .07 .02 .02 .01 .01 .01 A. What is µ for this probability distribution? B. Determine whether the sample we observed plausibly comes from this distribution by testing whether 66 respondents out of 400 without a job is sufficient in rejecting the null hypothesis that the true proportion is .15 Use =.05 One thousand samples of size 400 are generated from the above distribution (see Figure 3). Figure 3 Distribution of averages of 400 from the above p(x) Quantiles Av erage . 001 . 01 . 05. 10 . 25 . 50 . 75 . 90. 95 . 99 . 999 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 -4 -3 -2 -1 0 1 2 3 4 100.0% maximum 99.5% 97.5% 90.0% 75.0% quartile 50.0% median 25.0% quartile 10.0% 2.5% 0.5% 0.0% minimum 2.9200 2.8598 2.7749 2.7025 2.6375 2.5725 2.5025 2.4400 2.3850 2.3250 2.1575 Normal Quantile Plot Mome nts Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 2.57099 0.1008198 0.0031882 2.5772463 2.5647337 1000 C. i) Explain whether or not the 1000 samples of averages is well-approximated by a normal distribution by referring to appropriate output. ii) The 95% confidence interval for the mean is obtained using the standard formula for each sample. Of these 1000 intervals 961 included the true mean. The null hypothesis is that the standard formula produces confidence intervals that include the mean 95% of the time. Would the null hypothesis be rejected at =.05? 4. (12 points) A regression is run relating number of job offers to salary. In Figure 4a all 400 individuals are included using a salary of zero for those who did not receive any offer. In Figure 4b the 66 respondents without a job offer are eliminated and only the remaining 334 individuals were used. Figures 4a and 4b appear on the next page. A. Explain whether the results of the two models are similar in terms of i) predicting salaries from the number of job offers ii) the accuracy of these predictions. B. Explain which of the two models is more sensible by referring to appropriate output. Figure 4a Regression relating salary to job offers for all 400 respondents Figure 4b Regression relating salary to job offers for 334 respondents with jobs Bivariate Fit of Salary By Job Offers Bivariate Fit of Salary By Job Offers 90 90 80 80 70 50 Salary Salary 60 40 70 60 30 20 50 10 0 40 -1 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 Job Off ers Linear Fit 6 7 8 9 10 Linear Fit Linear Fit Linear Fit Salary = 29.885839 + 9.4377944 Job Of f ers Salary = 56.727649 + 2.0693021 Job Of f ers Sum mary of Fit Sum mary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observ ations (or Sum Wgts) 0.458412 0.457051 17.85332 52.27701 400 RSquare RSquare Adj Root Mean Square Error Mean of Response Observ ations (or Sum Wgts) Analys is of Variance Source Model Error C. Total DF Sum of Squares Mean Square F Ratio 1 107376.03 107376 336.8756 398 126858.89 319 Prob > F 399 234234.92 <.0001 Term Intercept Job Off ers 0.178728 0.176254 6.713941 62.60719 334 Analys is of Variance Source Model Error C. Total Param e te r Estim ate s DF Sum of Squares Mean Square 1 3256.854 3256.85 332 14965.564 45.08 333 18222.418 F Ratio 72.2509 Prob > F <.0001 Param e te r Estim ate s Estimate Std Error 29.885839 1.511665 9.4377944 0.514204 t Ratio Prob>|t| 19.77 <.0001 18.35 <.0001 Term Intercept Job Off ers 30 20 10 10 Residual Residual 5 Job Off ers -10 -30 Estimate Std Error 56.727649 0.783211 2.0693021 0.243446 t Ratio Prob>| t| 72.43 <.0001 8.50 <.0001 0 -10 -20 -50 -1 0 1 2 3 4 5 6 Job Off ers 7 8 9 10 11 1 2 3 4 5 6 Job Off ers 7 8 9 10 5. (18 points) Investment banking companies often ask probability questions of applicants. Parts A and B are two such questions. A. A recent Wall Street Journal article indicated how physicists from MIT were able to beat the roulette wheel by 6% by tracking the movement of the ball and entering the information into a computer. They were subsequently banned from playing. Specifically, assume that the expected profit on each roll is $60 with a standard deviation of $1,000 If they play for 10,000 rolls (it is correct to assume that the rolls are independent), what is the probability that they make over $500,000? B. Another favorite question involves two fair coins (i.e., P(Head)=.5) and one coin with a probability of a head of .75. You flip each of the coins in any order. You want to maximize the chance of getting at least two heads in a row. Is it best to flip the unfair coin first, second or third or does it matter? Explain your answer. C. A manager of the company wants to determine whether getting the probability question correct has any relation to whether a person is going to be a successful investor as determined by being in the “Top 10% Club”. Of the 63 employees who got the probability question correct, 10 of them are in the “Top 10% Club”. Is this sufficient in showing that those who get the probability question correct have greater than a .1 chance of being in the “Top 10% Club”? Use =.05 6. (8 points) For each of the following questions please state whether the value with the new information would be a) Lower b) The same c) Higher d) Cannot say without further information than the corresponding value assuming the information that was given originally. a) If the sample size for the test in Question 1 part a. were doubled and the sample average and standard deviation remained the same, what would happen to the P-value? b) If the assumed value of the mean in Question 1 part b. was changed from 61 to 62 what would happen to the probability of making an error? c) In the regression of salary and number of job offers (without the zero job offer respondents) it was determined that 30 of the respondents with one job offer accepted the job that they received from summer employment between their Junior and Senior year. Since these jobs are prestigious, the salaries of these students were relatively high. If these 30 respondents were removed from the analysis what effect would this have on the slope? d) If the analysis as in Figure 4b were restricted to only Finance majors, what effect would this have on the R-squared value?