HW2 Solutions Q1. Which type of chart should you use to plot the number of MBA students in each track, by gender? > Bar chart. Q2. A survey of 60 customers included two questions: 1. What is your primary residence? (House, apartment, condo, room/residence, other) 2. What device are you most likely to use to watch a movie? (TV, computer, tablet, smartphone, other) To investigate if there is a relationship between primary residence and most likely device, what method would you use? > Chi-square test. Q3. We wish to know if infant mortality rate, internet broadband subscribers (per 100 people), energy usage per capita, and population size (big vs. not big) can be used to predict a country's GDP. Which method is most appropriate? > Multiple regression Q4. The dependent variable is Employment Status (employed or not employed). The predictor variables are age, gender (male, female, other), and years of education. What is the best method to use? > Logistic regression Q5. After surveying 70 random Vancouverites and 73 random Torontonians, you wish to know if the proportion of Vancouverites who can drive a standard transmission is the same or different from the proportion of Torontonians who can drive a standard transmission. What is the best method to use? > Two-sample z-test Q6. This CBC article (http://www.cbc.ca/news/politics/grenier-uselection-debate-polls-1.3782098) from the fall of 2016 discusses the problem with online polls. Read the first five paragraphs and then answer the following question: Why are the results of these polls not representative of the population? Give 3 reasons in bullet form, quoting the article directly if you like. > Lots of good answers. Q7. Random working adults from three cities were surveyed to find out which form of transportation they would most prefer to use to commute to and from work, if all options were available. The results are shown here: Vancouver Calgary Edmonton Public Transit 45 23 39 Car 77 44 35 Bike 12 18 8 Other 15 12 19 What is the null hypothesis? > The distribution of commuting preference is the same across all three cities. Q8. Continuing with the question about commuting... What is the test statistic? > 18.15 Q9. What is the P-value? Report your answer to at least 4 decimal places. > 0.0059 Q10. What is the conclusion? > Reject the null hypothesis. Q11. What is the best way to tell if there is a nonlinear relationship between height (X) and salary (Y) in a particular sample? > Plot the data Q12. Use the Wage data (from class) to create a simple linear regression model with Wage as the Y variable and Age as the X variable. Consider the hypothesis test of whether or not the model, as a whole, is “useful”. What is the test statistic? Report your answer to at least 2 decimal places. > 0.27 Q13. What is the P-value that is associated with the hypothesis test in the above question? Report your answer to at least 4 decimal places. > 0.6037 Q14. Now do the same thing with Wage as the Y variable and Educ as the X variable. For the hypothesis test of whether or not the model is “useful”, find the test statistic and the P-value. What is the test statistic? Report your answer to at least 2 decimal places. > 17.26 Q15. What is the P-value? > 0.0002 Q16. We could create a multiple regression model based on the Wage data using all four X variables. What principle suggests we should possibly include only one or two of the X variables in our final version of the model? > Parsimony Q17. Thinking about multiple linear regression... More variables will always result in a better model. > False Q18. Thinking about multiple linear regression... More variables will always result in a better fit (or at least as good a fit) of the sample data. > True Q19. Thinking about multiple linear regression... The F-test tells us if the model is useful and which coefficients to include. > False Q20. Thinking about multiple linear regression... The t-test for a coefficient tells us if that X variable has a nonlinear relationship with Y. > False Q21. Thinking about multiple linear regression... Each residual plot should show roughly equal spread along the range of that X variable, if the model is properly constructed. > True Q22. Download the GPA SAT datasetPreview the document containing data on 14 students. The following variables are included in this dataset: Uni_GPA — GPA after first 2 years at university. Verbal — Verbal score on the SAT. Math — Math score on the SAT. HS_GPA — High school GPA upon graduation. Create a correlation table for these four variables and upload it (preferably as a pdf). > The correlation table must show all pairs of correlations: Uni-GPA Verbal Math HS_GPA Uni_GPA — Verbal 0.851 — Math 0.777 HS_GPA 0.897 0.795 — 0.797 0.709 — Q23. We wish to use multiple regression to predict Uni_GPA using the other three variables. Without running the model, based only on the correlations between each X variable and the Y variable (Uni_GPA), which variables might we expect to have significant coefficients and why? (Provide a brief answer of no more than a couple sentences. Because you are just guessing, I'm interested in the "why" part of your answer.) > There are different ways to answer this. Examples include: • The correlations are all high, so we might expect all of the coefficients to be significant. In regression, a high correlation between an X variable and the Y variable often means that X variable is useful in explaining Y, so the t-test for its coefficient would show significance (e.g., have a P-value < 0.05). • OR The correlations are all high, including among the X variables. This may result in only some of the coefficients being significant (e.g., only some of the t-tests for X variable coefficients will have P-values less than alpha=0.05). • OR The correlations are all high, but they’re also correlated with each other. So perhaps only the highest one, HS_GPA, will have a significant coefficient. The others may not provide additional information. Q24. Create a multiple regression model with all three potential X variables. Then remove one variable at a time (if any) based on its P-value until all coefficients are significant. Which X variables are in this final model? (select all correct answers) > HS_GPA Q25. Why do you think the final model achieved in the above question is what it is? If this contradicts what you predicted based only on the correlations calculated earlier, make sure you comment on this difference. (Provide a brief answer of no more than 2–3 sentences) > The idea here is that the X variables are highly correlated with each other, so the marginal value of each additional X variable is low. In other words, the best one, HS_GPA, explains Y so well that the others don’t add much. (Note that this concept is called multicollinearity; you don’t need to know this term for BABS 550). Q26. Even though t-tests are intended to be used with quantitative data, they can also be used with categorical data if the nearly normal condition is satisfied. > False Q27. A company has developed a special diet for athletes. They decide to test this diet out by recruiting 19 participants and giving them this diet for 4 weeks. They timed these participants running a 10K race before the diet and again after the diet. The company wants to show specifically that there is an improvement in race times. Download this dataPreview the document, conduct a hypothesis test, and report your results... What are the hypotheses? > H0: μd = 0 > HA: μd > 0 (or could be < if you set up differently) Q28. What is the test statistic? > 1.64 (or could be negative if you set up differently) Q29. What is the P-value? > 0.0587 Q30. What is the conclusion? > fail to reject Q31. Suppose your colleague has used a two-sample z-test to see if there is a difference between the proportion of male employees and female employees receiving promotions within their first year of employement. The p-value is 0.043, suggsting there is evidence that the two proportions are not the same. But then you learn that your colleague decides to remove one observation and redo the analysis in order to get a pvalue of 0.052. He uses this result to claim there is no statistically significant difference between promotions within the first year of employment for the two populations. Which of the following statements are true? > This is unethical