Uploaded by Mina Hoang

Final Exam Preparation Guide

advertisement
BUS 445
Final Exam Preparation
Summer 2023
The final exam will have the following types of questions:
•
Interpretation: You will be provided with computer output, which may be model
output or graphical visualizations, and be asked to explain and interpret.
o Strategic Advice for your interpretation: The best strategy is to write the
minimum you need to clearly indicate understanding. Students sometimes
write pages to explain something that can be addressed in 3 or 4 sentences, in
the hopes that somewhere they will hit on something they can get marks for.
To the person marking (yes, me), this approach makes it obvious that the
understanding is not there! A question answered correctly in two sentences
will get a better grade than if those two sentences are buried somewhere in a
page of jargons.
•
Some managerial communication questions, requiring you to explain some
analytics concepts/metrics/results to non-technical managers
•
Some simple calculations to show understanding of concepts.
•
Some short answer questions.
A brief list of areas covered:
Major Methods: Contingency Tables, Linear and Logistic Regression, Trees, Forests, Cluster
Analysis, Principal Components, Applied Segmentation
Building Blocks and Frameworks: mental models, measurement scales, missing values,
coefficients and p-values, effect size, non-linear effects, correlated predictors, overfitting and
measures of model fit, oversampling, lift charts, CRISP-DM
Of course, none of these topics stand alone. For example, measurement scales affect everything;
overfitting is closely tied to lift charts.
Sample questions
Short Answer Easy Questions
•
A survey of data scientists asks them to rank their most preferred software, that is first,
second, third, etc. The data from the survey is coded as 1,2,3 etc. in a spreadsheet. What
would be the risk if an inexperienced analyst uses this coding?
•
In developing a logistic regression model, what can an analyst do to reduce the likelihood
of overfitting using only data on which the model is estimated?
•
A family-owned manufacturer makes vegan nut cheese (a cheese-like product made using
nuts instead of dairy). The store tracks which of its 10 types of vegan cheese is the most
popular, by counting how many Kilograms (Kg) of each was sold each month. This
measurement scale of this variable is a _______________scale
Short Answer Medium Difficulty Questions
•
Why should you not use “statistical significance” from regression models as the sole
indication of predictor importance?
•
A linear regression model has both continuous and categorical predictors. Explain the
difference in how the estimated coefficients of each type of variable is interpreted.
•
A random forest model, designed to predict uptake of a promotional program for a meal
delivery service, shows that the two most important variables are the customers’ numbers
of past deliveries and their total purchases. In a regression model using the same
variables, total purchases does not even appear to be significant. What is the most likely
reason these two models are so different? Your explanation needs to include the
mechanism of the random forest algorithm that allows both variables to appear
important.
•
A weakness of tree models compared to regression models is that they can miss weak
predictors after detecting the strongest predictors. Explain why this difference occurs, in
terms of the difference in the estimation process of the two models.
An Example Interpretation Question
The Canadian Immigration and Refugee Board decides whether to allow or deny refugee status to
refugee claimants. Claimants who have been denied refugee status may ask the Federal Court of
Appeal for permission to appeal the negative ruling. A judge then either gives or denies leave to
appeal the ruling. Imagine that you are an immigration and refugee activist and that you have
collected the following data on cases requesting leave to appeal the negative ruling of the Board.
The data is as follows:
-Name of judge hearing case. A factor with levels: Desjardins, Heald, Hugessen,
Iacobucci, MacGuigan, Mahoney, Marceau, Pratte, Stone, Urie.
merit
-Judgment of merit of the case by an independent (not the judge) rater. A factor with
levels: no, case has no merit; yes, case has some merit (leave to appeal should be
granted).
decision -Judge's decision. A factor with levels: no, leave to appeal not granted; yes, leave to
appeal granted.
language -Language of case. A factor with levels: English, French.
location -Location of original refugee claim. A factor with levels: Montreal, other,
Toronto.
success -success rate, for all cases from the applicant's nation.
judge
You run a logistic regression model with decision as the target variable, with the model developed
so that the probability of “yes” is modeled (Prob = 1 is a certainty of a “yes” decision). The
following output is generated:
Coefficients:
Estimate Std. Error
(Intercept)
0.51916 0.68266
judge[T.Heald]
-1.36324 0.53655
judge[T.Hugessen] -1.49779 0.5289
judge[T.Iacobucci] -2.70031 0.72730
judge[T.MacGuigan] -1.28781 0.4616
judge[T.Mahoney] -0.84209 0.5348
judge[T.Marceau] 1.07194 0.59673
judge[T.Pratte]
-2.00107 0.59556
judge[T.Stone]
-1.66145 0.55652
judge[T.Urie]
-0.07157 0.75393
language[T.French] -0.19384 0.60281
location[T.other]
1.19430 0.67761
location[T.Toronto] 0.94914 0.60813
merit[T.yes]
1.40494 0.27475
success
1.60878 0.30155
z value
0.760
-2.541
-2.832
-3.713
-2.789
-1.574
1.796
-3.360
-2.985
-0.095
-0.322
1.763
1.561
5.114
5.335
Pr(>|z|)
0.446962
0.011062 *
0.004630 **
0.000205 ***
0.005280 **
0.115413
0.072435 .
0.000779 ***
0.002832 **
0.924373
0.747785
0.077981 .
0.118579
3.16e-07 ***
9.55e-08 ***
1.1 Which variable is continuous, and what is its effect on the decision, in plain language? (2 marks)
1.2 If you were a refugee claimant, which judge would you prefer and why? The “why” requires a
short explanation of how to interpret the output. (3 marks)
1.3 If you wanted to improve the AIC of the model by removing variables, which variables would
you want to investigate further before removing, why would you want to investigate them before
removing, and how would you investigate them? Use your judgement of the problem context to
address the “why”. (4 marks)
An Example Interpretation and Communication Question
The following Classification tree was calibrated from a study of predictors of Registered
Retirement Savings Plan Contribution for customers of a bank. The target variable has the value
“Y” for individuals who contributed and “N” otherwise, with “Y” oversampled to create a 50/50
balance in the analysis data set. The selected predictor variables are
payr12
TOTDEP
newMRGGBAL
newLOANBAL
= 1 if customer uses payroll deposit, 0 otherwise
average total monthly deposits over previous 12 months
average monthly mortgage balance over previous 12 months
average monthly personal loan balance over previous 12 months
Create a powerpoint presentation for management of the financial institution consisting of
• one slide that gives a few brief managerial (no jargon) bullet points to introduce how any
tree is created, and
• one slide that gives a few brief managerial (no jargon) bullet points to guide your
discussion of how any tree is interpreted, and
• one slide that gives a few brief managerial (no jargon) bullet points to guide your
discussion explaining the substance of what this particular tree is saying.
Sketch the slides in your exam booklet. Include brief notes attached to each slide, which you
would use to remind yourself what to say in the presentation.
• only put concise bullet points on each slide
• do not reproduce the tree. Assume your audience can look at the tree on a projection
or handout as you are describing it.
e.g.,
How ANY Tree
Created (at least 3
points)
Notes
Interpretation (at least
4 points)
-
This Tree means
(at least 3 points)
Notes
Notes
(9 marks)
An Example Easy Calculation Question
Best Buy was planning an email promotional flyer to send to customers who have an online
account and had signed up to receive regular promotional information. The creative group at
Best Buy had some differences of opinions on what type of email design generates the best
response. Cas, the chief analytic strategist, thus suggested an experiment to resolve the question.
The group then designed three different email flyers that were sent out to a small sample of 600
of their online customers to find out if there would be any difference and what would work best.
Below is a contigency table summarizing the data.
Creative A
Creative B
Creative C
Total
No purchase
100
150
150
400
< $40
30
30
40
100
$40 - $100
40
15
5
60
30
5
5
40
> $100
Total
200
200
200
600
Produce a contingency table of row proportions. What pattern do you spot? (4 marks)
If the p-value of the chi-square statistic for the contigency table is < 0.001. What does that mean,
in plain English? (2 marks)
Download