BUS 445 Final Exam Preparation Summer 2023 The final exam will have the following types of questions: • Interpretation: You will be provided with computer output, which may be model output or graphical visualizations, and be asked to explain and interpret. o Strategic Advice for your interpretation: The best strategy is to write the minimum you need to clearly indicate understanding. Students sometimes write pages to explain something that can be addressed in 3 or 4 sentences, in the hopes that somewhere they will hit on something they can get marks for. To the person marking (yes, me), this approach makes it obvious that the understanding is not there! A question answered correctly in two sentences will get a better grade than if those two sentences are buried somewhere in a page of jargons. • Some managerial communication questions, requiring you to explain some analytics concepts/metrics/results to non-technical managers • Some simple calculations to show understanding of concepts. • Some short answer questions. A brief list of areas covered: Major Methods: Contingency Tables, Linear and Logistic Regression, Trees, Forests, Cluster Analysis, Principal Components, Applied Segmentation Building Blocks and Frameworks: mental models, measurement scales, missing values, coefficients and p-values, effect size, non-linear effects, correlated predictors, overfitting and measures of model fit, oversampling, lift charts, CRISP-DM Of course, none of these topics stand alone. For example, measurement scales affect everything; overfitting is closely tied to lift charts. Sample questions Short Answer Easy Questions • A survey of data scientists asks them to rank their most preferred software, that is first, second, third, etc. The data from the survey is coded as 1,2,3 etc. in a spreadsheet. What would be the risk if an inexperienced analyst uses this coding? • In developing a logistic regression model, what can an analyst do to reduce the likelihood of overfitting using only data on which the model is estimated? • A family-owned manufacturer makes vegan nut cheese (a cheese-like product made using nuts instead of dairy). The store tracks which of its 10 types of vegan cheese is the most popular, by counting how many Kilograms (Kg) of each was sold each month. This measurement scale of this variable is a _______________scale Short Answer Medium Difficulty Questions • Why should you not use “statistical significance” from regression models as the sole indication of predictor importance? • A linear regression model has both continuous and categorical predictors. Explain the difference in how the estimated coefficients of each type of variable is interpreted. • A random forest model, designed to predict uptake of a promotional program for a meal delivery service, shows that the two most important variables are the customers’ numbers of past deliveries and their total purchases. In a regression model using the same variables, total purchases does not even appear to be significant. What is the most likely reason these two models are so different? Your explanation needs to include the mechanism of the random forest algorithm that allows both variables to appear important. • A weakness of tree models compared to regression models is that they can miss weak predictors after detecting the strongest predictors. Explain why this difference occurs, in terms of the difference in the estimation process of the two models. An Example Interpretation Question The Canadian Immigration and Refugee Board decides whether to allow or deny refugee status to refugee claimants. Claimants who have been denied refugee status may ask the Federal Court of Appeal for permission to appeal the negative ruling. A judge then either gives or denies leave to appeal the ruling. Imagine that you are an immigration and refugee activist and that you have collected the following data on cases requesting leave to appeal the negative ruling of the Board. The data is as follows: -Name of judge hearing case. A factor with levels: Desjardins, Heald, Hugessen, Iacobucci, MacGuigan, Mahoney, Marceau, Pratte, Stone, Urie. merit -Judgment of merit of the case by an independent (not the judge) rater. A factor with levels: no, case has no merit; yes, case has some merit (leave to appeal should be granted). decision -Judge's decision. A factor with levels: no, leave to appeal not granted; yes, leave to appeal granted. language -Language of case. A factor with levels: English, French. location -Location of original refugee claim. A factor with levels: Montreal, other, Toronto. success -success rate, for all cases from the applicant's nation. judge You run a logistic regression model with decision as the target variable, with the model developed so that the probability of “yes” is modeled (Prob = 1 is a certainty of a “yes” decision). The following output is generated: Coefficients: Estimate Std. Error (Intercept) 0.51916 0.68266 judge[T.Heald] -1.36324 0.53655 judge[T.Hugessen] -1.49779 0.5289 judge[T.Iacobucci] -2.70031 0.72730 judge[T.MacGuigan] -1.28781 0.4616 judge[T.Mahoney] -0.84209 0.5348 judge[T.Marceau] 1.07194 0.59673 judge[T.Pratte] -2.00107 0.59556 judge[T.Stone] -1.66145 0.55652 judge[T.Urie] -0.07157 0.75393 language[T.French] -0.19384 0.60281 location[T.other] 1.19430 0.67761 location[T.Toronto] 0.94914 0.60813 merit[T.yes] 1.40494 0.27475 success 1.60878 0.30155 z value 0.760 -2.541 -2.832 -3.713 -2.789 -1.574 1.796 -3.360 -2.985 -0.095 -0.322 1.763 1.561 5.114 5.335 Pr(>|z|) 0.446962 0.011062 * 0.004630 ** 0.000205 *** 0.005280 ** 0.115413 0.072435 . 0.000779 *** 0.002832 ** 0.924373 0.747785 0.077981 . 0.118579 3.16e-07 *** 9.55e-08 *** 1.1 Which variable is continuous, and what is its effect on the decision, in plain language? (2 marks) 1.2 If you were a refugee claimant, which judge would you prefer and why? The “why” requires a short explanation of how to interpret the output. (3 marks) 1.3 If you wanted to improve the AIC of the model by removing variables, which variables would you want to investigate further before removing, why would you want to investigate them before removing, and how would you investigate them? Use your judgement of the problem context to address the “why”. (4 marks) An Example Interpretation and Communication Question The following Classification tree was calibrated from a study of predictors of Registered Retirement Savings Plan Contribution for customers of a bank. The target variable has the value “Y” for individuals who contributed and “N” otherwise, with “Y” oversampled to create a 50/50 balance in the analysis data set. The selected predictor variables are payr12 TOTDEP newMRGGBAL newLOANBAL = 1 if customer uses payroll deposit, 0 otherwise average total monthly deposits over previous 12 months average monthly mortgage balance over previous 12 months average monthly personal loan balance over previous 12 months Create a powerpoint presentation for management of the financial institution consisting of • one slide that gives a few brief managerial (no jargon) bullet points to introduce how any tree is created, and • one slide that gives a few brief managerial (no jargon) bullet points to guide your discussion of how any tree is interpreted, and • one slide that gives a few brief managerial (no jargon) bullet points to guide your discussion explaining the substance of what this particular tree is saying. Sketch the slides in your exam booklet. Include brief notes attached to each slide, which you would use to remind yourself what to say in the presentation. • only put concise bullet points on each slide • do not reproduce the tree. Assume your audience can look at the tree on a projection or handout as you are describing it. e.g., How ANY Tree Created (at least 3 points) Notes Interpretation (at least 4 points) - This Tree means (at least 3 points) Notes Notes (9 marks) An Example Easy Calculation Question Best Buy was planning an email promotional flyer to send to customers who have an online account and had signed up to receive regular promotional information. The creative group at Best Buy had some differences of opinions on what type of email design generates the best response. Cas, the chief analytic strategist, thus suggested an experiment to resolve the question. The group then designed three different email flyers that were sent out to a small sample of 600 of their online customers to find out if there would be any difference and what would work best. Below is a contigency table summarizing the data. Creative A Creative B Creative C Total No purchase 100 150 150 400 < $40 30 30 40 100 $40 - $100 40 15 5 60 30 5 5 40 > $100 Total 200 200 200 600 Produce a contingency table of row proportions. What pattern do you spot? (4 marks) If the p-value of the chi-square statistic for the contigency table is < 0.001. What does that mean, in plain English? (2 marks)