CMPT-741 Fall 2009 Data Mining Martin Ester Sample Exam with Solutions Problem 1 (Multiple Choice) Mark the correct answers for the following questions either directly in the table or on a separate sheet. Note that multiple answers may be correct. a) What clustering algorithms can find clusters of arbitrary shape? b) The silhouette coefficient is a method to determine the natural number of clusters for . .. c) What clustering algorithms optimize an objective function? d) What classifiers are normally considered to be easy to interpret? e) Disjoint training and test datasets are required to estimate the classification performance on . . . f) The confidence of the estimate of classification performance (question e) increases with . . . g) A common weakness of association rule mining is that . .. h) Which of the following are interestingness measures for association rules? i) Knowledge discovered in a KDD process should . . . j) What are typical preprocessing tasks? k-means DSBSCAN CLARANS partitioning algorithms hierarchical algorithms density-based algorithms k-means DSBSCAN CLARANS SVM Linear Regression Decision trees Single-Link (agglomerative Hierarchical) subspace clustering algorithms Single-Link (agglomerative Hierarchical) k-Nearest Neighbor the training the test the entire dataset dataset population increasing training dataset size decreasing training dataset size it is too it produces inefficient too many rules accuracy recall increasing dataset size test decreasing test dataset size it produces not enough interesting rules compactness it produces too many frequent itemsets lift be implicit in be explicit always confirm the given in the given the expectations data of a domain data expert data discretization clustering visualization be useful for a given application feature selection Problem 2 (Association Rules) Consider the following transaction database: TransID T100 T200 T300 T400 Items A, B, C, D A, B, C, E A, B, E, F, H A, C, H Suppose that minimum support is set to 50% and minimum confidence to 60%. a) List all frequent itemsets together with their support. A 100%, A,B 75% A,B,C 50% B 75%, A,C 75% A,B,E 50% C 75%, A,E 50% E 50%, A,H 50% H 50% B,C 50% B,E 50% b) Which of the itemsets from a) are closed? Which of the itemsets from a) are maximal? Closed: A, Maximal: A,H A,B A,C A,B,C A,H A,B,C A,B,E A,B,E c) For all frequent itemsets of maximal length, list all corresponding association rules satisfying the requirements on (minimum support and) minimum confidence together with their confidence. A,B,C A,B A,C B,C B C C B A A,C A,B 66% 66% 100% 66% 66% A,B A,E B,E B E E B A A,E A,B 66% 100% 100% 66% 100% A,B,E d) The lift of an association rule is defined as follows: lift = confidence / support(head) Compute the lift for the association rules from c). A,B A,C B,C B C C B A A,C A,B 66% 66% 100% 66% 66% lift = 0.66 / 0.75 = 0.89 lift = 0.66 / 0.75 = 0.89 lift = 1.0 / 1.0 = 1.00 lift = 0.66 / 0.75 = 0.89 lift = 0.66 / 0.75 = 0.89 A,B A,E B,E B E E B A A,E A,B 66% 100% 100% 66% 100% lift = 0.66 / 0.5 = 1.33 lift = 1.0 / 0.75 = 1.33 lift = 1.0 / 1.0 = 1.00 lift = 0.66 / 0.5 = 1.33 lift = 1.0 / 0.75 = 1.33 Why are only those association rules interesting that have a lift (significantly) larger than 1.0? The lift measures the ratio of the confidence and the expected confidence, assuming independence of the body and the head of the association rule. See the following proof: lift = confidence / support(head)= confidence / ( support(body) * support(head) / support(body) ) = confidence / expected_confidence Only those association rules are interesting for which the actual confidence is significantly larger than the expected confidence. Problem 3 (Clustering Mixed Data) The EM clustering algorithm is usually applied to numerical data, assuming that points of a cluster follow a multi-dimensional Normal distribution. With a modified cluster representation (probability distribution), the algorithm can also handle data with both numerical and categorical attributes. Suppose that data records have the format ( x1 , , x d , x d 1 , , x d e ) , where the first d attributes are numerical and the second e attributes are categorical. a) How can you represent the (conditional) probability distribution of a single categorical attribute Ad+i with k values a1, . . ., ak (given class cj)? Use a simple histogram: Frequency(a1), . . ., Frequency(ak) P( xd i | c j ) Frequency( xd i ) b) The d numerical attributes are still assumed to follow a d-dimensional Normal distribution, i.e. P( x1 ,, xd | c j ) 1 1 (2 ) d | | e2 (( x1 ,, xd ) )T 1 ( x1 ,, xd ) What assumption on the dependency of the numerical attributes and the categorical attributes can you make that allows you to compute the (conditional) joint probability distribution of all d + e attributes? Assumption: each categorical attribute is conditionally (given a certain class) independent from the d numerical attributes and from the other categorical attributes c) With the solutions to a) and b), what is the (conditional) joint probability distribution of all d + e attributes given class cj? e P( x1 , , x d , x d 1 , , x d e | cj) P( x1 ,, x d | c j ) P( x d i | c j ) i 1 Problem 4 (XML Classification) Consider the task of classifying XML documents. Compared to simple text documents, XML documents do not only have a textual content, but also a hierarchical structure represented by the XML tags. An XML document can be modeled as a labeled, ordered and rooted tree, where a node represents an element, an attribute, or a value. We simplify the situation by ignoring attributes. Thus, inner nodes of an XML tree represent elements, and leaf nodes represent values, i.e. the textual content (a string). The following is an (incomplete) example of an XML document representing a scientific paper: <paper> <title>TTTTTTT1 TTTTTT2</title> <author>AAAAA</author> <author>BBBBBB</author> <section> <title>RRRR1 RRRRRRR2</title> <text>SSS1 SSSS2 SSS3 SSS4</text> </section> <section> <title>UUUU1 UUUUUUU2</title> <text>VVVV1 VVV2 VVVV3</text> </section> <references>WWW1 WW2 WW3 W4</references> </paper> The XML structure defines the context of a given term, and one and the same term can appear in different contexts (elements) with different meanings. As an example, it makes a difference whether a term appears within the title of the paper or within the body of one of its sections. As another example, it makes a difference whether a term appears within one of its author elements or within the references element (as author of a referenced paper). a) In order to apply standard classification methods to XML documents, we want to represent each XML document by a feature vector. The features shall capture the content (terms) as well as the structure (elements in which the terms appear). Define an appropriate set of features and provide some illustrating examples. We define one feature for every combination of one element tag and one term, which appears in at least a certain percentage of the documents. E.g., <author> AAAAA, <author> BBBBBB, <title> TTTTTTT b) Suppose that we have 1000 relevant terms and 50 relevant XML elements (tags). How many features (dimensions) do you generate? What problems does such a dimensionality of the feature vectors create? What standard classification method do you recommend? Explain your answers. There are up to 10000 * 50 = 500,000 features. Note that we do only consider combinations of an element and a term that actually appear in the document set. Most features would have values of zero for most of the XML documents. Text feature vectors are always sparse, but the problem gets worse with our XML feature transformation, since it further blows up the number of features. SVM are most appropriate since they can deal well with very high-dimensional datasets. c) Suppose that you want to consider the entire context of terms. For example, you want to distinguish between a term appearing in the title of the paper and the same term appearing in the title of one of the sections. How can you extend your feature set (solution to a) to accommodate this requirement? Provide some illustrating examples. We define one feature for every combination of one path (starting from the root element) of element tags and one term. E.g., <title> TTTTTTT, <section> <title> RRRR, <section> <text> SSS Problem 5 (Clustering and Classification Algorithms) To compare the capabilities of some popular clustering and classification algorithms, provide sample datasets that cannot be dealt with accurately by one algorithm but by the other. In every case, explain why one of the algorithms fails to discover the correct clusters or classes. a) Draw a 2-dimensional dataset with two clusters that can be discovered with 100% accuracy by DBSCAN, but not by k-means. x x x x x x x x x x x x x x x x x x x x x x x x Solid: k-means x x x x Dashed: DBSCAN x x x x Solid: k-means Dashed: DBSCAN k-means represents a cluster by its centroid (a point) and assigns all other points to the closest centroid. b) Draw a 2-dimensional dataset with two clusters that can be discovered with 100% accuracy by k-means, but not by DBSCAN. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Solid: k-means x Dashed: DBSCAN Solid: k-means Dashed: DBSCAN The two clusters have a dense core and a sparser periphery. With a high MinPts value, DBSCAN finds only the dense cores, but misses the periphery. With a low MinPts value, DBSCAN merges both clusters into one, since the periphery of the two clusters has the same density as the bridge between the two clusters. c) Draw a 2-dimensional dataset with two classes that can be classified with 100% accuracy (on the training dataset) by a decision tree, but not by a linear SVM. + + + + + + + + + - - - - - - Solid: decision tree Dashed: SVM + + + + + + + + + These two classes are not linearly separable, but can easily be separated by a first vertical and a second horizontal split. d) Draw a 2-dimensional dataset with two classes that can be classified with 100% accuracy (on the training dataset) by a linear SVM, but not by a 3-Nearest Neighbour classifier. + + - --- + Solid: 3NN Dashed: SVM The two classes are linearly separable. The 3NN classifier misclassifies the positive training example in the solid circle because of the three negative training examples that are below the separating hyper-plane, but closer to the positive example than all other training examples.