5/24/16 Predictive Analytics 1 – Machine Learning Tools Week 1: Assignment 1 answers Problem 2.1 2.1 Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. a. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers). Answer to 2.1.a: This is supervised learning, because the database includes information on whether the loan was approved or not. b. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions. Answer to 2.1.b: This is unsupervised learning, because there is no apparent outcome (e.g., whether the recommendation was adopted or not). c. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known. Answer to 2.1.c: This is supervised learning, because for the other packets the status is known. d. Identifying segments of similar customers. Answer to 2.1.d: This is unsupervised learning because there is no known outcome (though once you use unsupervised learning to identify segments, you could use supervised learning to classify new customers into those segments). e. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non bankrupt firms. Answer to 2.1.e: This is supervised learning, because the status of the similar firms is known. f. Estimating the repair time required for an aircraft based on a trouble ticket. Answer to 2.1.f: This is supervised learning, because there is likely to be knowledge of actual (historic) repair times of similar repairs. 5/24/16 g. Automated sorting of mail by zip code scanning. Answer to 2.1.g: This is supervised learning, as there is likely to be knowledge about whether the sorting was correct in previous mail sorting. h. Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously. Answer to 2.1.h: This is unsupervised learning, if we assume that we do not know what will be purchased in the future. Problem 2.6 2.6 In fitting a model to classify prospects as purchasers or nonpurchasers, a certain company drew the training data from internal data that include demographic and purchase information. Future data to be classified will be lists purchased from other sources, with demographic (but not purchase) data included. It was found that “refund issued” was a useful predictor in the training data. Why is this not an appropriate variable to include in the model? Answer to 2.6: The variable “refund issued” is unknown prior to the actual purchase, and therefore is not useful in a predictive model of future purchase behavior. In fact, “refund issued” can only be present for actual purchases but never for non-purchases. This explains why it was found to be closely related to purchase/non-purchase. Problem 2.7 2.7 A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records with missing values. About how many records would you expect to be removed? Answer to 2.7: For a record to have all values present, it must avoid having a missing value (P = 0.95) for each of 50 records. The chance that a given record will escape having a missing value for two variables is 0.95 * 0.95 = 0.903. The chance that a given record would escape having a missing value for all 50 records is (0.95)50 = 0.076945. This implies that 1-0.076944 = 0.9231 (92.31%) of all records will have missing values and would be deleted. Problem 2.10 2.10 Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment? Answer to 2.10: We prefer the model with the lowest error on the validation data. Model A might be overfitting the training data. We would therefore select model B for deployment on new data.