180119 Predictive

advertisement
1. In fitting a model to predict whether a person viewing an ecommerce web site will click on a particular link, a certain
company drew the training data from web logs of the browsing records of prior visitors. Various variables were found
to be useful in predicting the target, including a binary variable indicating whether or not the person made a purchase.
How should that variable be handled:

It should be included as a predictor due to its likely predictive power.

It should be excluded since it is uncorrelated with the target variable.

It should be excluded since it will not be available in new data.

It should be included, but only in models that rely on binary input variables.
Explanation: A purchase will, in most cases, be the last thing that occurs in a web browsing session, and it will occur
after the user has clicked a number of other links. For a newly arriving customer, whether they ultimately make a
purchase will probably not be known when you are trying to predict whether they will click on a particular link. This
topic is covered in our Predictive Analytics series.
2. A dataset has 2000 records and 50 variables with 6% of the values missing, spread randomly throughout the
records and variables. An analyst decides to remove records that have missing values. What is the approximate
probability that a given record will be removed?

50

0.06

0.03

0.95
Explanation: The probability that a given record will have a value for the first variable is 0.94. The probability that a
given record will have a value for the first variable and the second variable is 0.94 x 0.94. For the first, second and
third, it is 0.94 x 0.94 x 0.94. Carry on through all 50 variables, and the probability that a record will have values for all
of them is 0.94^50 = 0.045. So the probability that a record will not have values for all of them, and will have to be
removed, is (1 - 0.045) = about 0.95. The basic probability calculations are covered in Statistics 1 and 2, and the
issue of missing data in predictive models is handled in Predictive Analytics 1 and 2. Different ways of dealing with
missing data are described in our Missing Data course.
3. Two predictive models have been fit to data with a binary target variable, using the 100+ predictor variables that
are available. One is a logistic regression model, the other is a neural net using the maximum number of variables,
layers, nodes and cycles permitted by the software. Which of the following is/are true: (i) The logistic regression
model is less likely to overfit the training data; (ii) The difference in accuracy between training and validation data is
likely to be greater for the neural net than the logistic regression; (iii) A simpler neural net may perform worse on the
training data, but better on the validation data

All of the above

i and ii only

ii only

i and iii only
Explanation: It is possible that a deep (complex) neural net left to run in unconstrained fashion can fit data to
perfection - i.e. with 0 error. By contrast, a logistic regression's predictions have global structure defined by a limited
number of parameters that will leave residuals for most predicted values. All three choices below follow from these
facts. This subject is covered in Predictive Analytics 1 and 2.
4. A business wishes to segment its customers into a small number of groups so that it can effectively target
marketing efforts at different customer types. Which of the following correctly describes the process and the order of
the steps:

a. This is a predictive modeling task: select variables, decide number of clusters, apply clustering method(s),
normalize the data, describe the clusters.

b. This is an unsupervised learning task: select variables, normalize the data, apply clustering method(s),
decide number of clusters, denormalize the data, describe the clusters.

c. This is a clustering task: select variables, apply clustering method(s), normalize the data, decide number
of clusters, denormalize the data, describe the clusters.

d. This is an unsupervised learning task: decide number of clusters, apply clustering method(s), select
variables, normalize the data, describe the clusters.
Explanation: Normalizing the data, if deemed necessary, should be done before the clustering, otherwise the input
variables may contribute in very imbalanced fashion. The use of clustering methods is described in Predictive
Analytics 3 and Cluster Analysis.
5. Souvenir sales at a beach resort in Queensland, Australia are shown in this figure as raw data and as transformed
data. Choose the statement, below, that is most accurate:

The appropriate function to represent the trend relationship between demand (Y) and time (X) is a linear
one, since the logarithmic transformation of the y-axis produces an approximate linear relationship with
respect to trend, and annual seasonality effe

The appropriate model for these data is an annual seasonal one, since there are annual sales peaks and
there is no trend involved.

The lower figure is a logarithmic transformation of the upper figure, and it shows that the appropriate model
for the relationship between Y (demand) and X (time) is an exponential one, with seasonal effects.

The lower figure is an exponential transformation of the data in the upper figure, and its purpose is to
account for the seasonal effects.
Explanation: The annual peaks in sales correspond to summers in the southern hemisphere, which makes sense for
a beach resort. The Y-axis in the lower figure is a logarithmic scale, indicating a log-transform. The fact that the log
representation is more or less linear indicates that the relationship in the raw data is an exponential one. This topic is
covered in the Forecasting course.
6. The attached figure is a set of association rules derived from transactional data on cosmetic purchases. Which of
the following statements is most accurate:

The underlying data are in the form of a count matrix (columns = products, rows = customers, cells =
number of purchases over time), and Support (c) indicates the percentage of all customers who purchase the
Consequent items.

The underlying data are in the form of a count matrix (columns = products, rows = customers, cells =
number of purchases over time), and Support (c) indicates the percentage of all transactions in which the
Consequent item is purchased.

The underlying data are a binary matrix (columns = products, rows = transactions, cells = 0/1 for purchase or
no purchase), and the lift ratio, applied to transactions, = P(Consequent|Antecedent)/P(Consequent).

The 4th rule is interpreted as follows: "If Brushes are purchased, the probability is 0.5636 that Bronzer +
Concealer + Nail Polish will be purchased in a subsequent transaction."
Explanation: Data used in association rules are typically in binary matrix form, in which rows are transactions,
columns are products and 0/1 indicates purchase/no-purchase. Rules about "what goes with what" pertain to single
transactions. This subject is covered in our Predictive Analytics 3 course.
7. Consider two different text mining tasks: (i) Mining the "contact us" submissions from a web site, to predict
purchase/no-purchase, (ii) Mining internal email correspondence in a natural resource company to determine
relevance to an environmental enforcement action. Think carefully about the process of preparing the text for
predictive modeling, and the scenario involved. Which of the following is most true:

Normalizing all email addresses (and replacement with a single term denoting "emailtoken") would probably
be appropriate in case (i) but not case (ii)

The email addresses in case (ii) will probably occur with very low and roughly equal frequency and not be
meaningful for prediction.

Some email addresses in case (i) will probably occur with highly unequal frequency, and merit stemming

Developing concepts will be important in case (i) for the purpose of extracting meaning from individual
documents.
Explanation: The "contact us" submissions will contain email addresses that are all different from one another, so will
not contain useful systematic information to predict sales. The company emails, by contrast, occur at different
frequencies that can be correlated with other data to yield useful predictive information. This subject is covered in our
Text Mining course.
8. In considering the use of logistic regression, neural networks, and classification & regression trees for prediction
and classification, (choose the best answer)

Logistic regression is best at capturing a linear relationship for predicting continuous data outcomes, while
both neural nets and classification & regression trees excel at capturing interactions in the predictors.

Both neural nets and classification & regression trees produce "black box" models, while logistic regression
requires more computation time.

Unlike neural nets, both logistic regression and classification & regression trees produce models that help
explain the effect of predictor variables on the target.

Logistic regression is computationally efficient, while both neural nets and classification produce decision
rules that are easily explained to non-statisticians.
Explanation: Logistic regression is computationally fast and yields predictor coefficients for predicting binary (not
continuous) outcomes; neural networks are highly flexible but, unlike classification and regression trees, do not
provide interpretable information about relationships. These methods are covered in detail in our Predictive Analytics
series.
9. A direct response advertising firm, in a test of a popup web offer presented to all visitors, gets a response rate of
1.5% with no predictive model applied. It develops a logistic regression model to estimate the probability that visitors
will respond. In validating the model on a holdout sample, it gets a lift of 2 on the top decile. Which of the following is
true?

The predictive model will effectively lift the response probability for the average visitor by 50%.

The predictive model will increase the response probability for the average visitor by 100%.

Those 10% of the visitors predicted as most likely to respond will respond at an average rate of 3%.

Those 2% of the visitors predicted as most likely to respond will respond at an average rate of 10%.

The popup offer will yield a full (100%) response rate if limited to the top 1.5%.
Explanation: In predictive models, customers are ordered according to probability of response, and decile lift,
calculated by decile from top to bottom, shows how much better you do compared to the average no-model response
rate. A lift of 2 indicates a doubling of the response rate. This is covered in detail in our Predictive analytics series.
10. A political consultant wants to predict how individual voters will vote, and has data on whether the voter has voted
in the past 10 years worth of primary and general elections, data on 100+ demographic attributes of the
neighborhood in which the voter lives, as well as purchased data on 200+ consumer spending variables. Which of the
following would NOT be useful in dealing with the issues of dimension reduction and feature selection:

Correlation analysis

Principal components analysis

Replacing some raw variables with derived variables

Using a neural net with fewer layers than normal

Variable elimination based on domain knowledge
Explanation: Altering the structure of the neural net will not affect the number of variables (dimensions), nor will it
provide information that can easily be used to eliminate or consolidate variables.
Download