Uploaded by Juan Gamez

PA1 Assignment 1 Answers wk 1

Predictive Analytics 1 – Machine Learning Tools
Week 1: Assignment 1 answers
Problem 2.1
2.1 Assuming that data mining techniques are to be used in the following cases, identify whether the
task required is supervised or unsupervised learning.
a. Deciding whether to issue a loan to an applicant based on demographic and financial data
(with reference to a database of similar data on prior customers).
Answer to 2.1.a:
This is supervised learning, because the database includes information on whether the loan was
approved or not.
b. In an online bookstore, making recommendations to customers concerning additional items
to buy based on the buying patterns in prior transactions.
Answer to 2.1.b:
This is unsupervised learning, because there is no apparent outcome (e.g., whether the
recommendation was adopted or not).
c. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to
other packets whose threat status is known.
Answer to 2.1.c:
This is supervised learning, because for the other packets the status is known.
d. Identifying segments of similar customers.
Answer to 2.1.d:
This is unsupervised learning because there is no known outcome (though once you use unsupervised
learning to identify segments, you could use supervised learning to classify new customers into those
e. Predicting whether a company will go bankrupt based on comparing its financial data to those
of similar bankrupt and non bankrupt firms.
Answer to 2.1.e:
This is supervised learning, because the status of the similar firms is known.
f. Estimating the repair time required for an aircraft based on a trouble ticket.
Answer to 2.1.f:
This is supervised learning, because there is likely to be knowledge of actual (historic) repair times of
similar repairs.
g. Automated sorting of mail by zip code scanning.
Answer to 2.1.g:
This is supervised learning, as there is likely to be knowledge about whether the sorting was correct in
previous mail sorting.
h. Printing of custom discount coupons at the conclusion of a grocery store checkout based on
what you just bought and what others have bought previously.
Answer to 2.1.h:
This is unsupervised learning, if we assume that we do not know what will be purchased in the future.
Problem 2.6
2.6 In fitting a model to classify prospects as purchasers or nonpurchasers, a certain company drew the
training data from internal data that include demographic and purchase information. Future data to be
classified will be lists purchased from other sources, with demographic (but not purchase) data included.
It was found that “refund issued” was a useful predictor in the training data. Why is this not an
appropriate variable to include in the model?
Answer to 2.6:
The variable “refund issued” is unknown prior to the actual purchase, and therefore is not useful in a
predictive model of future purchase behavior. In fact, “refund issued” can only be present for actual
purchases but never for non-purchases. This explains why it was found to be closely related to
Problem 2.7
2.7 A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly
throughout the records and variables. An analyst decides to remove records with missing values. About
how many records would you expect to be removed?
Answer to 2.7:
For a record to have all values present, it must avoid having a missing value (P = 0.95) for each of 50
records. The chance that a given record will escape having a missing value for two variables is 0.95 *
0.95 = 0.903. The chance that a given record would escape having a missing value for all 50 records is
(0.95)50 = 0.076945. This implies that 1-0.076944 = 0.9231 (92.31%) of all records will have missing
values and would be deleted.
Problem 2.10
2.10 Two models are applied to a dataset that has been partitioned. Model A is considerably
more accurate than model B on the training data, but slightly less accurate than model B on the
validation data. Which model are you more likely to consider for final deployment?
Answer to 2.10:
We prefer the model with the lowest error on the validation data. Model A might be overfitting the
training data. We would therefore select model B for deployment on new data.