Uploaded by Izzy H

HW2 Solutions BABS 550 300 2020W Application of Statistics in Management

advertisement
HW2 Solutions
Q1. Which type of chart should you use to plot the number of MBA students in each track, by gender?
> Bar chart.
Q2. A survey of 60 customers included two questions:
1. What is your primary residence? (House, apartment, condo, room/residence, other)
2. What device are you most likely to use to watch a movie? (TV, computer, tablet, smartphone, other)
To investigate if there is a relationship between primary residence and most likely device, what method would
you use?
> Chi-square test.
Q3. We wish to know if infant mortality rate, internet broadband subscribers (per 100 people), energy usage per
capita, and population size (big vs. not big) can be used to predict a country's GDP. Which method is most
appropriate?
> Multiple regression
Q4. The dependent variable is Employment Status (employed or not employed). The predictor variables are
age, gender (male, female, other), and years of education. What is the best method to use?
> Logistic regression
Q5. After surveying 70 random Vancouverites and 73 random Torontonians, you wish to know if the proportion
of Vancouverites who can drive a standard transmission is the same or different from the proportion of
Torontonians who can drive a standard transmission. What is the best method to use?
> Two-sample z-test
Q6. This CBC article (http://www.cbc.ca/news/politics/grenier-uselection-debate-polls-1.3782098) from the fall of
2016 discusses the problem with online polls. Read the first five paragraphs and then answer the following
question:
Why are the results of these polls not representative of the population? Give 3 reasons in bullet form, quoting
the article directly if you like.
> Lots of good answers.
Q7. Random working adults from three cities were surveyed to find out which form of transportation they would
most prefer to use to commute to and from work, if all options were available. The results are shown here:
Vancouver Calgary Edmonton
Public Transit 45
23
39
Car
77
44
35
Bike
12
18
8
Other
15
12
19
What is the null hypothesis?
> The distribution of commuting preference is the same across all three cities.
Q8. Continuing with the question about commuting...
What is the test statistic?
> 18.15
Q9. What is the P-value?
Report your answer to at least 4 decimal places.
> 0.0059
Q10. What is the conclusion?
> Reject the null hypothesis.
Q11. What is the best way to tell if there is a nonlinear relationship between height (X) and salary (Y) in a
particular sample?
> Plot the data
Q12. Use the Wage data (from class) to create a simple linear regression model with Wage as the Y variable
and Age as the X variable. Consider the hypothesis test of whether or not the model, as a whole, is “useful”.
What is the test statistic?
Report your answer to at least 2 decimal places.
> 0.27
Q13. What is the P-value that is associated with the hypothesis test in the above question?
Report your answer to at least 4 decimal places.
> 0.6037
Q14. Now do the same thing with Wage as the Y variable and Educ as the X variable. For the hypothesis test of
whether or not the model is “useful”, find the test statistic and the P-value.
What is the test statistic?
Report your answer to at least 2 decimal places.
> 17.26
Q15. What is the P-value?
> 0.0002
Q16. We could create a multiple regression model based on the Wage data using all four X variables. What
principle suggests we should possibly include only one or two of the X variables in our final version of the
model?
> Parsimony
Q17. Thinking about multiple linear regression...
More variables will always result in a better model.
> False
Q18. Thinking about multiple linear regression...
More variables will always result in a better fit (or at least as good a fit) of the sample data.
> True
Q19. Thinking about multiple linear regression...
The F-test tells us if the model is useful and which coefficients to include.
> False
Q20. Thinking about multiple linear regression...
The t-test for a coefficient tells us if that X variable has a nonlinear relationship with Y.
> False
Q21. Thinking about multiple linear regression...
Each residual plot should show roughly equal spread along the range of that X variable, if the model is properly
constructed.
> True
Q22. Download the GPA SAT datasetPreview the document containing data on 14 students. The following
variables are included in this dataset:
Uni_GPA — GPA after first 2 years at university.
Verbal — Verbal score on the SAT.
Math — Math score on the SAT.
HS_GPA — High school GPA upon graduation.
Create a correlation table for these four variables and upload it (preferably as a pdf).
> The correlation table must show all pairs of correlations:
Uni-GPA Verbal Math HS_GPA
Uni_GPA —
Verbal
0.851
—
Math
0.777
HS_GPA 0.897
0.795 —
0.797 0.709 —
Q23. We wish to use multiple regression to predict Uni_GPA using the other three variables. Without running
the model, based only on the correlations between each X variable and the Y variable (Uni_GPA), which
variables might we expect to have significant coefficients and why?
(Provide a brief answer of no more than a couple sentences. Because you are just guessing, I'm interested in
the "why" part of your answer.)
> There are different ways to answer this. Examples include:
• The correlations are all high, so we might expect all of the coefficients to be significant. In
regression, a high correlation between an X variable and the Y variable often means that X variable is useful in
explaining Y, so the t-test for its coefficient would show significance (e.g., have a P-value < 0.05).
• OR The correlations are all high, including among the X variables. This may result in only some of the
coefficients being significant (e.g., only some of the t-tests for X variable coefficients will have P-values less
than alpha=0.05).
• OR The correlations are all high, but they’re also correlated with each other. So perhaps only the highest one,
HS_GPA, will have a significant coefficient. The others may not provide additional information.
Q24. Create a multiple regression model with all three potential X variables. Then remove one variable at a time
(if any) based on its P-value until all coefficients are significant. Which X variables are in this final model?
(select all correct answers)
> HS_GPA
Q25. Why do you think the final model achieved in the above question is what it is? If this contradicts what you
predicted based only on the correlations calculated earlier, make sure you comment on this difference.
(Provide a brief answer of no more than 2–3 sentences)
> The idea here is that the X variables are highly correlated with each other, so the marginal value of each
additional X variable is low. In other words, the best one, HS_GPA, explains Y so well that the others don’t add
much. (Note that this concept is called multicollinearity; you don’t need to know this term for BABS 550).
Q26. Even though t-tests are intended to be used with quantitative data, they can also be used with categorical
data if the nearly normal condition is satisfied.
> False
Q27. A company has developed a special diet for athletes. They decide to test this diet out by recruiting 19
participants and giving them this diet for 4 weeks. They timed these participants running a 10K race before the
diet and again after the diet. The company wants to show specifically that there is an improvement in race
times. Download this dataPreview the document, conduct a hypothesis test, and report your results...
What are the hypotheses?
> H0: μd = 0
> HA: μd > 0 (or could be < if you set up differently)
Q28. What is the test statistic?
> 1.64 (or could be negative if you set up differently)
Q29. What is the P-value?
> 0.0587
Q30. What is the conclusion?
> fail to reject
Q31. Suppose your colleague has used a two-sample z-test to see if there is a difference between the
proportion of male employees and female employees receiving promotions within their first year of
employement. The p-value is 0.043, suggsting there is evidence that the two proportions are not the same. But
then you learn that your colleague decides to remove one observation and redo the analysis in order to get a pvalue of 0.052. He uses this result to claim there is no statistically significant difference between promotions
within the first year of employment for the two populations.
Which of the following statements are true?
> This is unethical
Download