Solution Sketched Final Exam COSC 4335 Data Mining May 12, 2015 Your Name: Your student id: Problem 1 --- Association and Sequence Mining [16] Problem 2 --- PageRank [9] Problem 3 --- Anomaly/Outlier Detection [9] Problem 4 --- Classification and Other [16] Problem 5 --- Preprocessing [10] : Grade: The exam is “open books” and you have 95 minutes to complete the exam. The exam will count approx. 20% towards the course grade. 1 1) Association Rule and Sequence Mining [16] a) Assume the Apriori-style sequence mining algorithm described on pages 429-435 of our textbook is used and the algorithm generated 3-sequences listed below: Frequent 3-sequences Candidate Generation Candidates that survived pruning <(1) (2 3)> (1) (2 3 4) <(1) (3 4) > (1) (2 3) (4) <(1 2) (3)> (1) (3 4) (5) <(2 3 4)> (1 2) (3) (4) none <(2 3) (4)> (1 2) (3 4) <(2)(3)(4)> (2) (3 4) (5) <(2) (3 4)> <(3 4) (5)> Please verify my solution! One error at most [4]; 2 errors at most [3]; 3 errors at most [1.5] What candidate 4-sequences are generated from this 3-sequence set? Which of the generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook on page 435 to describe your answer! [5] b) Assume you run APRIORI with a given support threshold on a supermarket transaction database and you receive a single frequent 7-item set and no frequent 8itemsets. What can be said about the total number of itemsets that are frequent in this case? [3] 2**7 [2.5] or to be exact 2**7-1 as the empty set is not a frequent item set [3] c) Assume the itemset {A, B, D} is frequent; what rules will be generated from this itemset and how will it be determined that they satisfy the confidence threshold? [4] All possible rules that contain all elements {A, B, D} are generated; that is, 6 rules that have {A], {B}, {D}, {A,B}, {A,D}, and {B,D} and the left hand side and the missing item(s) on their right hand side are created, and generated rule’s confidence is computed by dividing the support of {A,B,D} by the support of the rules left-hand side—as the left hand side is a subset of {A,B,D} it is frequent and therefore its support is stored in the frequent item set table that has been computed by the APRIORI algorithm. 2 Problem1 continued d) What are the difficulties in using association rule mining for data sets that contain a lot of continuous attributes? [4] Association rule mining techniques have been designed for the discrete domain, and in order to apply this approach to datasets with continuous attributes the continuous attributes have to discretized. Discretization might lose valuable information and is quite difficult in practice. For example. if the intervals are too long, we may lose some patterns because of their lack of confidence. If the intervals are too short, we may lose some patterns because of their lack of support. Finally, rules that have neighboring intervals in their left hand side and right hand side might need to be postprocessed to capture application semantics. 2) PageRank [8] a) Give the equation system that PAGERANK would set up for the webpage structure given below, and d is assumed to be 0.8. [4] . P1 P2 P3 P4 𝑷𝑹(𝑷𝟒 ) ] 𝟐 𝑷𝑹(𝑷𝟏 ) 𝑷𝑹(𝑷𝟑 ) 𝑷𝑹(𝑷𝟒 ) 𝑷𝑹(𝑷𝟐 ) = 𝟎. 𝟐 + 𝟎. 𝟖 [ + + ] 𝟏 𝟐 𝟐 𝑷𝑹(𝑷𝟑 ) = 𝟎. 𝟐 𝑷𝑹(𝑷𝟑 ) 𝑷𝑹(𝑷𝟒 ) = 𝟎. 𝟐 + 𝟎. 𝟖 [ ] 𝟐 𝑷𝑹(𝑷𝟏 ) = 𝟎. 𝟐 + 𝟎. 𝟖 [ One error at most [2.5] points; 2 errors at most [1] point. 3 b) How is the equation system obtained as you answer to question a used to obtain the page rank of the 4 webpages? [2] The equations are initialized (with 1) and recursively updated until there is no more change, and the final PR values for the webpages are reported as the page rank of the page. c) What does the parameter d model—the damping factor—in PageRank’s equation systems? How does it impact the solution of the equation system? [3] It defines the probability that a web surfer navigates through the web by following the links. [1.5]. If d is smaller, the page rank of the lesser important pages increase (or you can say, the variance of the pagerank distribution of the webpages decreases) [1.5] 3) Anomaly/Outlier Detection [9] a) Give a sketch of an approach that uses boxplots for outlier/anomaly detection! [4] One possible solution: Outliers are defined as objects whose distance to the box is above or below a given threshold; e.g. values that are above 25%-pencentile+1.5IQR or below 75%=percentile-1.5IQR are defined as outliers. b) How do model-based approaches to outlier detection work? Give a specific example of a model-based approach to outlier detection! [5] Model-based approaches fit a probabilistic model/a density function M to the observations and then define outliers as observations whose probability is below a given threshold[2.5]; e.g. a Normal distribution N(,) is fitted to the observations of a single attribute and x is defined as an outlier if d,(x)< where is a density threshold and d, is the density function of N(,) [2.5] 4) Classification and Other [16] a) Compare k-Nearest Neighbor Classifiers and Decision Trees! What are the main differences between the two approaches? [5] Nearest Neighbor Classifiers Consider only the neighborhood classifylocal classifier [0.5] Uses multiple decision boundaries that edges of convex polygons that can computed using Vornoi tessalations [1] Lazy learner, so doesn’t build a model of training data[0.5] No training cost [0.5] Distance function is critical! [0.5] Decision Tree to Hierarchical decision making scheme [0.5] are Uses a multiple decision boundaries which are be axis parallal/rectangular [1] the Builds the model of the training data [0.5] (somewhat) expensive to learn [0.5] Does not rely on distance functions but rather on the ordering of the attribute values [0.5] At most 5 points; other answers might deserve credit! 4 b) The following dataset is given (depicted below) with A being a continuous attribute and GINI is used as the evaluation function. What root test would be generated by the decision tree induction algorithm? Please justify your answer![4] Root test: A >= A 0.22 0.31 0.31 0.31 0.33 0.41 0.43 Class 0 0 1 0 1 1 1 Possible splits: A0.22: (1,0); (2,4) A0.31: (3,1); (0,3) A0.33: (3,2); (0,2) A0.41: (3,3); (0,1) as A0.31has a purity of 100%/75% which is much higher than the purity of the other splits, therefore, this split will be selected. Alternatively, the GINI for the 4 splits could be computed and the split with the highest GINI-gain would be selected . 5 c) Assume you use 10-fold cross validation to determine the accuracy of a classifier. How is the accuracy of the classifier computed, when this approach is chosen. [3] When this approach is chosen there will be 10 training-test set pairs and 10 models will be generated from each training set, and for each example in a testset it is determined if it is classified correctly or not by the learnt model; as each example occurs in exactly one of the 10 testsets1 the accuracy can be computed straight forwardly by: (number of examples classified correctly)/(total number of examples) d) Assume you want to use support vector machines for classification problems involving more than two classes. Propose a generalization of support vector machines of your own preference for non-binary classification problems! [4+2 extra points] Many possible solutions: e.g. you could learn k binary SVM classifier that determines is an example belong to class Ck or not and then it is determined voting (the vote is the amount of the vote is the example’s distance to the hyperplane, if it classified correctly and the negative distance.otherwise). 5) Preprocessing [10] a) Why is dimensionality reduction important in data mining? [2] – Avoid curse of dimensionality[0.5] – Reduce amount of time and memory required by data mining algorithms[0.5] – Allow data to be more easily visualized [0.5] – May increase accuracy by eliminating irrelevant and redundant features [0.5] – might reduce noise [0.5] At most 2 points b) What is the goal of Principal Component Analysis? What is a principal component? How are principal components used for dimensionality reduction? [4] The goal is to reduce the dimensionality of a dataset by selecting a subset of the principal components that captures a large proportion of the variation in the dataset [2] Principal component are linearly uncorrelated variables (linear combinations of the original attributes in the dataset) [2]. Principle components are computed in the order of their contribution of the variation of the dataset and this process is stopped, if a certain percentage of the variation in the dataset is captured by the first k principal components [1] At most 4 points! c) Assume you use association rule mining for a very very large transaction database. How could sampling be used to mine the transaction database in a reasonable time? [4] Many possible solutions: e.g. create a sample of the transaction database (e.g. take 10000 of 1000000 transactions), mine association rules, re-compute the confidence and support of the obtained association rules on the whole transaction database, and drop those rules from the obtained ruleset whose support or confidence is below the respective thresholds. 1 If they do not mention this fact at most [2]. 6