Q1. For each data mining task below, indicate whether it is predictive modeling, association analysis, cluster analysis or anomaly detection. (5 points) (a) Deciding whether to issue a loan to an applicant, based on demographic and financial data (with reference to a database of similar data on prior customers). Predictive Answer: (b) In an online bookstore, making recommendations to customers concerning additional items to buy, based on the buying patterns in prior transactions. Predictive / Association Answer: (c) Identifying a network data packet as dangerous (virus, hacker attack), based on comparison to other packets whose threat status is known. Anomaly Detection Answer: (d) Identifying segments of similar customers. Cluster Answer: (e) Printing of custom discount coupons at the conclusion of a grocery store checkout, based on what you just bought and what others have bought previously. Association / Predictive Answer: Q2. Classify the following attributes as binary, discrete or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). (10 points) Income : =>Continuous – quantitative - ratio Property area : =>Continuous – quantitative - ratio Ownership of boat (yes/no) : =>Binary – qualitative - nominal Days of the week (coded Mon, Tue, Wed,….) => Discrete – qualitative - nominal Number of beds in a hospital : => Discrete – quantitative - ratio Final grades in an MBA class (A+, A, …) : => Discrete – qualitative - ordinal 1 Petal length : => Continuous – quantitative - ratio Iris flower type (virginica, etc…) :=> Discrete – qualitative - nominal Shirt size (XS, S, M, L, XL) : => Discrete – qualitative - ordinal Frequent flier miles accumulated : => Continuous – quantitative - ratio Q3. TRUE/FALSE questions ___F_____Association Rule Mining is equivalent to Classification because in both cases rules are derived ____T____Dealing with high dimensionality is often a challenge in data mining ___T_____The median and the mean are the same for a population that is normally distributed ___T_____Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set (TRUE/FALSE – cont.) __F______Euclidean distance is the only similarity measure that can be used in cluster analysis __T______Scatterplots are good data visualization tools. ___F_____Cluster analysis always provides a scientific, clear-cut answer to a segmentation problem. Q4) MULTIPLE CHOICE QUESTIONS: (5 points) 5.1 Similarity between data points when deciding on whether or not they belong to the same cluster, is measured by: a. b. c. d. Distance measure Whether or not they belong to the same prediction class (a) and (b) (a) or (b) 2