Master of Applied Statistics Applied Statistics Comprehensive Exam May 2015 Directions: This is a closed book exam with a 3-hour time limit. Attached you will find the relevant computer output, formula pages, and tables for the normal, t, 2, and F distributions. You may use a non-programmable, non-graphing calculator. Answer Only Five of the Six Questions. 1. There are many gambling newsletters that purport to improve a bettor’s odds of winning bets on NFL football games. To investigate whether a particular newsletter’s betting schemes are profitable, a random sample of 50 games is selected. By following the recommendations of the newsletter, 30 of the 50 games produced winning wagers. a) Test whether the newsletter can be said to significantly increase the odds of winning over what one could expect by selecting the winner at random (that is, a 50% of chance of winning). Use a 0.05 level of significance. b) Describe a type II error of your test in (a) in words. Describe the general relationship between the probability of the type I error and the probability of the type II error. c) Suppose that the power of your test in (a) is 0.396 when the probability of winning a wager on an NFL game following the newsletter’s betting schemes is 0.60. Interpret this power. d) Suppose the number of games sampled is increased from 50 to 100. Describe how the power would change when changing the number of games sampled from 50 to 100 (no need to do a calculation). Similarly, describe how the power would change when changing the level of significance from 0.05 to 0.01. 2. The following table presents a summary (sample size, mean, and standard deviation) of data regarding SAT scores (verbal and math) for some randomly selected high school students who intended to major in engineering or in language/literature. Prospective major Engineering (n=15) Language/literature (n=15) Verbal x =446 s=42 x =534 s=45 Math x =548 s=57 x =517 s=52 a) Is there sufficient evidence to indicate a difference in mean verbal SAT scores for high school students intending to major in engineering and in language/literature? Conduct an appropriate test at a 0.05 level of significance. State your conclusion clearly. b) Depending on the test you performed, one assumption of your analysis in part (a) might have been the equal variances of verbal SAT scores for the two populations of high school students intending to major in engineering and in language/literature. What test would you use to check this assumption and what assumptions are needed for your test? c) If one wants to test whether students intending to major in engineering on average have a higher Math SAT score than Verbal SAT score, should one use a two-sample t test or a paired t test? Justify your choice. Is it possible to conduct the test just based on the summary statistics shown in the table rather than using all the raw data? Show your reasons. Page 1 of 3 3. A fire insurance company wants to relate the amount of fire damage in major residential fires to the distance between the burning house and the nearest fire station. The study is to be conducted in a large suburb of a major city; a sample of 15 recent fires in this suburb is selected. The amount of damage (y, in thousands of dollars) and the distance between the fire and the nearest fire station (x, in miles) are collected (raw data not shown). The scatterplot on the provided output shows a linear pattern between the damage and distance, and thus a simple linear regression model is implemented. Answer the following questions based on the output. a) Do you think that the simple linear regression model provides good fit of this data set? If so, show your reasons; if not, make suggestions on how to improve the model fit. b) Based on the residual plots in the output, comment on the validity of the assumptions required for simple linear regression analysis. c) Interpret the estimate of the slope parameter in the simple linear regression model. d) Is it appropriate to interpret the estimated intercept for this model? Explain. e) Let ρ denote the population coefficient of correlation between the random variables: y = damage and x = distance in the study. Find an estimate of ρ based on the sample data here. What does ρ=0 imply about the regression line? Find a p-value for the hypothesis test for testing H0: ρ=0 vs. Ha: ρ≠0 based on the sample data and output and be sure to clearly state your conclusion. 4. A small company processes parcels. Let y = number of parcels processed in a day, and we have data for twenty days (for this analysis, treat y as a continuous variable). The goal of the analysis is to get a handle on the productivity of the various company workers. There are five workers and a foreman; the foreman supervises but also works with the others and was always present during this twenty-day period. In this particular period, workers 1 and 5 were always present / absent at the same time. For each day, define the regressors x1 = 1 if worker 1 was present, x1 = 0 otherwise. x2 = 1 if worker 2 was present, x2 = 0 otherwise. x3 = 1 if worker 3 was present, x3 = 0 otherwise. x4 = 1 if worker 4 was present, x4 = 0 otherwise x5 = 1 if worker 5 was present, x5 = 0 otherwise. The quality manager conducted a multiple regression of y on x1-x4. All regression assumptions were checked with no problems found. Answer the following questions using the output page for problem 4. a) Interpret the estimated regression coefficient for x3 for the company owner, who knows no statistics (no jargon please). b) Interpret the estimated intercept for the model for the company owner, also without jargon. c) Is there strong evidence that worker 2 is productive (in the only sense that this model can detect)? Give hypotheses, test statistic value, p-value, and a careful interpretation. d) Tomorrow, only worker 3 and the foreman will be present. Give a prediction for the number of parcels produced tomorrow, along with an approximate 95% margin of error for your prediction. e) Why was x5 not included in this multiple regression? f) Is there strong collinearity in the regression on x1-x4 ? Why or why not? Page 2 of 3 5. A young scientist was investigating natural fly repellents. She placed 20 flies in a jar on its side; the open end was covered in cheesecloth soaked in a natural fly repellent (e.g. lemon juice, citric acid, etc – see Figure 1 on the Problem 5 output page). After a time, she noted y = number of flies “repelled” by the liquid (treat y as a continuous variable for this analysis). For the control (distilled water) and each of the 6 natural fly repellants, she repeated this experiment 7 times, using a new set of 20 flies each time. All 49 runs were performed in completely random order. a) Is there a name for this particular kind of experimental design (what is it)? b) What are the formal statistical assumptions of a one-way analysis of variance to be performed on this data? Based only on the boxplots, do you have any concerns about the assumptions? What further assumption checks would you perform? Ignore any concerns about assumptions for the rest of this question. c) The F statistic for the ANOVA test of homogeneity is 13.55. Interpret this result for a layperson. d) With a design like this, one could decide to analyze the data in several different ways. Pretending you haven't yet seen the data, discuss the pros and cons of making your only analysis be (i) the ANOVA F test, (ii) Dunnett's MCC multiple comparison method, or (iii) Tukey's HSD multiple comparison method. e) Because there is a control treatment, the experimenter initially chose to use Dunnett’s method for multiple comparisons. After looking at the results and noting that treatment 4 (Lemon Juice) might be superior to all other treatments, she changed her mind and wanted to also compare treatment 4 to all others. What method should she use to do this if she wants to do all comparisons with family-wise type I error rate at most 0.05 ? 6. Salary equity studies (simulated data, but the writer has seen real data very much like this): for each of 96 employees performing a similar job, let y = annual salary. We are interested in comparing mean y between genders, but of course we should, at the very least, adjust for x = years of experience. We have n1 = 80 males and n2 = 16 females; refer to “output for problem 6” in answering the following questions. a) The scatter plot display shows (first row) y vs. x for females and males, and (second row) log10(y) vs. x. Which response variable do you prefer for the ANCOVA analysis, and why? For the rest of this problem, we will use LY = log10(y) as response variable; ANCOVA assumptions for this response variable were checked and deemed palatable. b) The first step in many ANCOVA analyses is to test for equality of slopes. In language understandable to the company’s human resources (HR) officer, who knows little statistics, explain what it would mean in this context if the slopes were unequal. c) After testing, the equal-slopes ANCOVA model was found to be palatable. Selected results are shown below the scatter plots in “Problem 6 output”. What do these analyses say about the adjusted mean (log10) salaries of men vs. women in this company? Support your interpretations with confidence statements and/or p-values as appropriate, but somewhere in your discussion should be an interpretation for the company’s HR officer. Here are the commands used in producing the output shown: PROC GLM; CLASS GENDER; MODEL LY = GENDER X; LSMEANS GENDER / CL PDIFF; Page 3 of 3