Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. TUTORIAL Predictive Modeling Copyright © 2017 by DecisionPro Inc. This document is primarily intended to be used in conjunction with the Enginius software suite. To order copies or request permission to reproduce materials, go to http://www.enginius.biz. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means –electronic, mechanical, photocopying, recording or otherwise– without the permission of DecisionPro, Inc. v180516 1 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. Overview Predictive modeling is an individual-level response model that helps analyze and explain the choices individual customers make in a market. The predictive modeling model helps firms understand the extent to which factors such as the price of a brand or its ease of installation influence a customer's choice. A brand's purchase probability at the individual level can be aggregated to determine the brand's market share at the market level. Firms also can use predictive modeling to develop marketing programs tailored to specific market segments, or even to individual customers. Further, if a company has purchase data about its products versus those of its competitors (product choice data), as well as some observed independent variables (e.g., gender, price, promotion), it can use predictive modeling to answer such questions as: Does a customer’s gender influence his or her purchase decision regarding our product(s)? Do competitor’s promotions affect the purchase of our product(s)? How do our promotions affect our sales rates? Getting Started Predictive modeling allows you to use your own data directly or to use a preformatted template. Because predictive modeling requires a specific data format, users with their own data should to review the preformatted template to learn about the appropriate structure. Step 1 - Creating a template The screen capture below shows the dialog box that results from using Enginius Templates (Predictive Modeling). 2 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. The options are as follows: Target variable: There are four types of data that can be used in predictive modeling: 1. Choice between 2 alternatives (0/1): This data format is suitable when customers have two choice alternatives, such as a choice between “buy” and “don’t buy.” That is, it requires a yes-no decision process. 2. Choice between multiple alternatives (A/B/C): This data format considers customer choices across a subset of related competitors, such as brands A, B, and C. Therefore; it requires a one-out-of-many decision process. 3. Continuous (X): This data format is suitable when the variable has an infinite number of possibilities such as amount spent. 4. Discrete-continuous (0/X): This data format is suitable when the variable has a continuous if it occurs (eg, 0 if it doesn’t occur but X if it does). Calibration data: 1. Number of predictive variables: The number of independent variables you collected or observed during the study, such as respondent gender, product on sales, and so forth. 2. Number of observations: The total number of respondents (customers) in your study. Out-of-sample predictions: When checked, the template will include an additional data block for entering observations that are used to assess the predictive validity of the model. Note: the check box at the bottom of the dialog box will cause the template to populate with sample (random) data that will allow you to run Predictive modeling immediately so you can preview the output produced. After selecting the desired model options, click Run to generate the data collection template, as shown below: 3 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. Step 2 - Entering your data Predictive modeling requires: Predictive variables: A column represents each variable specified for the study, and all independent variables should have consistent value ranges. That is, the data for a single independent variable should be scaled within the same range. Independent variables can take on discrete values if they are appropriately specified using dummy-variable coding. Target variable: The target variable column will consist of the customer’s choice. Examples include: 1. 0 or 1 for buy/don’t buy analysis 2. Big Spender/Small Spender/Inactive for multiple alternative analysis 3. 280, 5, 175, 595, 1625, 100 for continuous data Optional data requirements: Out –of-sample data: If you selected Out-of-sample predictions when setting up your template, you will see a data block for the Out-of-sample data. Step 3 - Running analyses To run Predictive Modeling analysis of the data you have loaded/prepared, click on the Predictive Modeling icon on the left side of the Enginius dashboard. The following example shows the data that is contained in the Predictive Modeling tutorial. 4 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. The above dialog box will allow you to specify the parameters for the analysis you are about to run. Target Variable allows you to select the type of choice data that you have available. Calibration data allows you to specify the data block that contain your predictive variables and the data block that contains your target variable. o Box-Cox transforms the predictors: See Appendix A o Cross –validation: See Appendix A Out-of-sample predictions allow you assess the predictive validity of the model by using the results from the calibration with an additional data block. Make the desired selections for the above data blocks and click the Run button. Reminder: Clicking the world icon beside the “Run” option will allow you to choose a different output format for the report. The Predictive modeling analysis will be run with the chosen selections and the analysis report will be generated. The analysis described below was created with the elections shown above. When analysis is complete, the following dialog box will appear: 5 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. Interpreting the results Confusion matrix The confusion matrix section of the report assesses the model performance. The confusion matrix contains two matrices: numerical counts and percentages of the same data.. The diagonal of both matrices indicate the convergence of the observed and predicted data. A high value or percentage at this diagonal intersection represents a high correlation between observed and predicted behavior. Model predictions The model predictions table shows how well the model compares to actual results. 6 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. Gain chart and lift A gain chart is a useful representation of how good a predictive model is at identifying the most likely responses. The x-axis represents the population ordered in decreasing order of choice likelihood, and the y-axis represents the actual choices. The diagonal represents a random selection process, and the red line the actual data. The model performance improves when the green line (representing the model) departs from random selection and approaches the truth. The dashed green line represents the gain chart obtained on the entire calibration data, without cross-validation, whereas the green area represents the same obtained by cross-validation. The latter sometimes provides degraded but more realistic performance results. 7 Licensed to Elany Rubet, Rensselaer Polytechnic Institute (rubete@rpi.edu). Do not copy or distribute. Appendix A Box-Cox transforms the predictors In predictive models, the distribution of some variables might be highly skewed. Typically, the number of past customers' transactions or past purchases will be skewed: a large number of customers have made just 1 purchase in the past, many customers have made ~10 purchases, and only a handful have made 100 purchases or more. The same problem will often happen with purchase amounts, income, etc. Since many predictive models (linear and logistic regressions) work best when predictors and target variables follow a more normal-like distribution, the Box-Cox transformation will redress skewed variables so that they become more balanced. It is an automatic process that does not require the user's intervention. A Box-Cox transformation will automatically transform a variable X into a new variable Y with a more normal-like distribution. Even though X -> Y is always defined, Y -> X might not be. For that reason, while a Box-Cox transform can be applied to predictors, it cannot be applied to the target variable. In the case of target variables, only logtransforms are available. Cross-Validation Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. In case of 10-fold cross-validation, for instance, the model is estimated on 90% of the data set, and tested on the remaining 10%. The operation is repeated 10 times, with a different test set each time. 8 Powered by TCPDF (www.tcpdf.org)