1. In fitting a model to predict whether a person viewing an ecommerce web site will click on a particular link, a certain company drew the training data from web logs of the browsing records of prior visitors. Various variables were found to be useful in predicting the target, including a binary variable indicating whether or not the person made a purchase. How should that variable be handled: It should be included as a predictor due to its likely predictive power. It should be excluded since it is uncorrelated with the target variable. It should be excluded since it will not be available in new data. It should be included, but only in models that rely on binary input variables. Explanation: A purchase will, in most cases, be the last thing that occurs in a web browsing session, and it will occur after the user has clicked a number of other links. For a newly arriving customer, whether they ultimately make a purchase will probably not be known when you are trying to predict whether they will click on a particular link. This topic is covered in our Predictive Analytics series. 2. A dataset has 2000 records and 50 variables with 6% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. What is the approximate probability that a given record will be removed? 50 0.06 0.03 0.95 Explanation: The probability that a given record will have a value for the first variable is 0.94. The probability that a given record will have a value for the first variable and the second variable is 0.94 x 0.94. For the first, second and third, it is 0.94 x 0.94 x 0.94. Carry on through all 50 variables, and the probability that a record will have values for all of them is 0.94^50 = 0.045. So the probability that a record will not have values for all of them, and will have to be removed, is (1 - 0.045) = about 0.95. The basic probability calculations are covered in Statistics 1 and 2, and the issue of missing data in predictive models is handled in Predictive Analytics 1 and 2. Different ways of dealing with missing data are described in our Missing Data course. 3. Two predictive models have been fit to data with a binary target variable, using the 100+ predictor variables that are available. One is a logistic regression model, the other is a neural net using the maximum number of variables, layers, nodes and cycles permitted by the software. Which of the following is/are true: (i) The logistic regression model is less likely to overfit the training data; (ii) The difference in accuracy between training and validation data is likely to be greater for the neural net than the logistic regression; (iii) A simpler neural net may perform worse on the training data, but better on the validation data All of the above i and ii only ii only i and iii only Explanation: It is possible that a deep (complex) neural net left to run in unconstrained fashion can fit data to perfection - i.e. with 0 error. By contrast, a logistic regression's predictions have global structure defined by a limited number of parameters that will leave residuals for most predicted values. All three choices below follow from these facts. This subject is covered in Predictive Analytics 1 and 2. 4. A business wishes to segment its customers into a small number of groups so that it can effectively target marketing efforts at different customer types. Which of the following correctly describes the process and the order of the steps: a. This is a predictive modeling task: select variables, decide number of clusters, apply clustering method(s), normalize the data, describe the clusters. b. This is an unsupervised learning task: select variables, normalize the data, apply clustering method(s), decide number of clusters, denormalize the data, describe the clusters. c. This is a clustering task: select variables, apply clustering method(s), normalize the data, decide number of clusters, denormalize the data, describe the clusters. d. This is an unsupervised learning task: decide number of clusters, apply clustering method(s), select variables, normalize the data, describe the clusters. Explanation: Normalizing the data, if deemed necessary, should be done before the clustering, otherwise the input variables may contribute in very imbalanced fashion. The use of clustering methods is described in Predictive Analytics 3 and Cluster Analysis. 5. Souvenir sales at a beach resort in Queensland, Australia are shown in this figure as raw data and as transformed data. Choose the statement, below, that is most accurate: The appropriate function to represent the trend relationship between demand (Y) and time (X) is a linear one, since the logarithmic transformation of the y-axis produces an approximate linear relationship with respect to trend, and annual seasonality effe The appropriate model for these data is an annual seasonal one, since there are annual sales peaks and there is no trend involved. The lower figure is a logarithmic transformation of the upper figure, and it shows that the appropriate model for the relationship between Y (demand) and X (time) is an exponential one, with seasonal effects. The lower figure is an exponential transformation of the data in the upper figure, and its purpose is to account for the seasonal effects. Explanation: The annual peaks in sales correspond to summers in the southern hemisphere, which makes sense for a beach resort. The Y-axis in the lower figure is a logarithmic scale, indicating a log-transform. The fact that the log representation is more or less linear indicates that the relationship in the raw data is an exponential one. This topic is covered in the Forecasting course. 6. The attached figure is a set of association rules derived from transactional data on cosmetic purchases. Which of the following statements is most accurate: The underlying data are in the form of a count matrix (columns = products, rows = customers, cells = number of purchases over time), and Support (c) indicates the percentage of all customers who purchase the Consequent items. The underlying data are in the form of a count matrix (columns = products, rows = customers, cells = number of purchases over time), and Support (c) indicates the percentage of all transactions in which the Consequent item is purchased. The underlying data are a binary matrix (columns = products, rows = transactions, cells = 0/1 for purchase or no purchase), and the lift ratio, applied to transactions, = P(Consequent|Antecedent)/P(Consequent). The 4th rule is interpreted as follows: "If Brushes are purchased, the probability is 0.5636 that Bronzer + Concealer + Nail Polish will be purchased in a subsequent transaction." Explanation: Data used in association rules are typically in binary matrix form, in which rows are transactions, columns are products and 0/1 indicates purchase/no-purchase. Rules about "what goes with what" pertain to single transactions. This subject is covered in our Predictive Analytics 3 course. 7. Consider two different text mining tasks: (i) Mining the "contact us" submissions from a web site, to predict purchase/no-purchase, (ii) Mining internal email correspondence in a natural resource company to determine relevance to an environmental enforcement action. Think carefully about the process of preparing the text for predictive modeling, and the scenario involved. Which of the following is most true: Normalizing all email addresses (and replacement with a single term denoting "emailtoken") would probably be appropriate in case (i) but not case (ii) The email addresses in case (ii) will probably occur with very low and roughly equal frequency and not be meaningful for prediction. Some email addresses in case (i) will probably occur with highly unequal frequency, and merit stemming Developing concepts will be important in case (i) for the purpose of extracting meaning from individual documents. Explanation: The "contact us" submissions will contain email addresses that are all different from one another, so will not contain useful systematic information to predict sales. The company emails, by contrast, occur at different frequencies that can be correlated with other data to yield useful predictive information. This subject is covered in our Text Mining course. 8. In considering the use of logistic regression, neural networks, and classification & regression trees for prediction and classification, (choose the best answer) Logistic regression is best at capturing a linear relationship for predicting continuous data outcomes, while both neural nets and classification & regression trees excel at capturing interactions in the predictors. Both neural nets and classification & regression trees produce "black box" models, while logistic regression requires more computation time. Unlike neural nets, both logistic regression and classification & regression trees produce models that help explain the effect of predictor variables on the target. Logistic regression is computationally efficient, while both neural nets and classification produce decision rules that are easily explained to non-statisticians. Explanation: Logistic regression is computationally fast and yields predictor coefficients for predicting binary (not continuous) outcomes; neural networks are highly flexible but, unlike classification and regression trees, do not provide interpretable information about relationships. These methods are covered in detail in our Predictive Analytics series. 9. A direct response advertising firm, in a test of a popup web offer presented to all visitors, gets a response rate of 1.5% with no predictive model applied. It develops a logistic regression model to estimate the probability that visitors will respond. In validating the model on a holdout sample, it gets a lift of 2 on the top decile. Which of the following is true? The predictive model will effectively lift the response probability for the average visitor by 50%. The predictive model will increase the response probability for the average visitor by 100%. Those 10% of the visitors predicted as most likely to respond will respond at an average rate of 3%. Those 2% of the visitors predicted as most likely to respond will respond at an average rate of 10%. The popup offer will yield a full (100%) response rate if limited to the top 1.5%. Explanation: In predictive models, customers are ordered according to probability of response, and decile lift, calculated by decile from top to bottom, shows how much better you do compared to the average no-model response rate. A lift of 2 indicates a doubling of the response rate. This is covered in detail in our Predictive analytics series. 10. A political consultant wants to predict how individual voters will vote, and has data on whether the voter has voted in the past 10 years worth of primary and general elections, data on 100+ demographic attributes of the neighborhood in which the voter lives, as well as purchased data on 200+ consumer spending variables. Which of the following would NOT be useful in dealing with the issues of dimension reduction and feature selection: Correlation analysis Principal components analysis Replacing some raw variables with derived variables Using a neural net with fewer layers than normal Variable elimination based on domain knowledge Explanation: Altering the structure of the neural net will not affect the number of variables (dimensions), nor will it provide information that can easily be used to eliminate or consolidate variables.