Exam0

advertisement
Final Exam
COSC 6335 Data Mining
December 6, 2011
Your Name:
Your student id:
Problem 1 --- Clustering [17]
Problem 2 --- Association and Sequence Mining [14]
Problem 3 --- kNN and Support Vector Machines [16]
Problem 4 --- Preprocessing [11]
Problem 5 --- Spatial Data Mining [5]
Problem 6 --- Data Mining in General [7]
:
Grade:
The exam is “open books” and you have 95 minutes to complete the exam.
The exam will count approx. 33% towards the course grade.
1
1) Clustering [17]
a) How do hierarchical clustering algorithms, such as AGNES, form dendrograms?[2]
Starting with a clustering in which each single object forms a cluster, clusters which are
closest to each with respect to distance metric (such as average distance or minimum
distance between clusters) are merged until a cluster is obtained which contains all
objects of the dataset. Cluster which are merged are connected in the dendrogram.
b) Compute the Silhouette for the following clustering that consists of 2 clusters:
{(0,0), (0.1), (3,3)}, {(4,3), (4,4)}; use Manhattan distance for distance computations.
Compute each point’s silhouette; interpret the results (what do they say about the
clustering of the 5 points; the overall clustering?)! [6]
Zechun adds solution for first part!
Compute Silhouette 4 points; one error at most 2 points; two errors 0 points.
Interpretation: should say that the silhouette for (3,3) is bad[0.5], and that the
silhouette for cluster 1 is not good[0.5], and all other silhouettes are kind of okay.[1]
c) Assume you apply k-means to a dataset which contains outliers; assume you apply kmeans to a 2D-dataset Y={(-100,-100), (0,0), (1,1), (0, 1), (1, 2), (5, 4), (5,5) , (5,6)}
with k=2; how do outliers (e.g. the point (-100,-100)) impact the k-means clustering
result? Propose an approach that alleviates the strong influence of outliers on
clustering results. If your approach would be used to cluster dataset Y; how would the
result change? [6]
Leads to a clustering where the outlier forms a single cluster [2]
Method [2.5]
After method is applied points 2-5 form a cluster, and the last 3 [1.5]
d) The Top Ten Data Mining Algorithm article suggests to use single-link hierarchical
clustering to post-process K-means clustering results, by running the hierarchical
clustering algorithms for a few iterations on the K-means clustering result. What
advantage(s) do you see in this approach? [3]
Clusters with non-convex shapes can be obtained!
2
2) Association Rule and Sequence Mining [14]
a) Assume the Apriori-style sequence mining algorithm described on pages 429-435 of
our textbook is used and the algorithm generated 3-sequences listed below:
Frequent 3-sequences Candidate Generation Candidates that survived pruning
<(1) (2 3)>
<(1) (2) (4)>
<(1) (3) (4)>
<(1 2) (3)>
<(2 3 4)>
<(2 3) (4)>
<(2) (3) (4)>
What candidate 4-sequences are generated from this 3-sequence set? Which of the
generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook
on page 435 to describe your answer! [5]
Zechun adds solution!
b) Give a sketch of an algorithm which creates association rules from the frequent
itemsets (which have been computed using APRIORI)! [5]
No solution given!
c) Assume you run APRIORI with a given support threshold on a supermarket
transaction database and there is a 40-itemset which is frequent. What can be said about
how many frequent itemsets APRIORI needs to compute for this example (hint: answer
this question by giving a lower bound for the number of itemsets which are frequent in
this case)? Based on your answer to the previous question, do you believe it is
computationally feasible to compute all frequent itemsets? [4]
As the 40-itemset is frequent, all its subsets are frequent; consequently there are at least
2401012 solutions [3]; it will not be possible to compute such a large number of frequent
itemsets[1].
3
3) kNN, SVM, & Ensembles [16]
a) The soft margin support vector machine solves the following optimization problem:
What does the first term minimize? Depict all non-zero i in the figure below! What is
the advantage of the soft margin approach over the linear SVM approach? [5]
The width of the margin with respect to the class1 and class2 hyperplan[1]. Depict [2; 2
errors=0 points]. Can deal with classification problems in which the examples are not
linearly separable[2].
b) Referring to the figure above, explain how examples are classified by SVMs! What is
the relationship between i and example i being classified correctly? [4]
Example which are above the straight line hyperplane belong to the round class, and
example below the line belong to the square class [1.5]. An example will be classified
correctly if i is less equal to half of the width of the hyperplane; the width w is the
distance between the class1 and class2 hyperplane. [2.5].
c) How does kNN (k-nearest-neighbor) predict the class label of a new example? [2]
Find the nearest k neighbor of the example which needs to be classified; take a major
vote based on the class labels of the k-nearest neighbors found.
d) Assume you want to use a nearest neighbor classifier for a particular classification
task. How would you approach the problem to choosing the parameter k of a nearest
neighbor classifier? [3]
Use N-fold cross validation to assess the testing accuracy of kNN classifier for different k
values; choose the k for your final classifier for which the testing accuracy is the highest!
e) What can be said about the number of decision boundaries a kNN classifier uses? [2]
Many[2]; as many as N-1 where N is the number of training examples[1 extra point].
4
4) Prepocessing [11]
a) What is the goal of dimensionality reduction techniques, such as PCA—for what are
they used in practice? [4]
– Avoid curse of dimensionality---it is much more difficult to find interesting
patterns in high dimensional spaces [1]
– Reduce amount of time and memory required by data mining algorithms [1]
– Facilitate data visualization [1]
– eliminate irrelevant features [0.5]
– reduce noise [0.5]
– reduce redundancies [0.5]
At most 4 points!
b) Assume you have to mine association rules for a very large transaction database which
contains 9,000,000 transactions. How could sampling be used to speed up association
rule mining? Give a sketch of an approach which speeds up association rule mining
which uses sampling! [5]
One Solution
1. Run the Association Rule Mining algorithm for a much smaller sample (e.g.
500,000 transactions) with a slightly lower support and confidence threshold[-1 if
the same thresholds are use] obtaining a set of association rules R
2. For each rule go through the complete transaction database and compute its
support and confidence, and prune rules which violate confidence or support
thresholds.
3. Return the surviving rules
c) What does it mean if an attribute is irrelevant for a classification problem? [2]
It does to provide any useful information for distinguishing between the different
classes.
5) Spatial Data Mining [5]
What are the main challenges in mining spatial datasetshow does mining spatial
datasets differ from mining business datasets?
Autocorrelation/Tobler’s first law[1.5], Attribute Space is continuous[0.5], no
clearly defined transactions [0.5], complex spatial datatypes such as polygons [1],
separation between spatial and non-spatial attributes [1], importance of putting
results on maps [0.5], regional knowledge[0.5], a lot of objects/a lot of patterns[0.5].
At most 5 points!
6) Data Mining in General [7]
Assume you own an online book store which sells books over the internet. How can your
business benefit from data mining? Limit your answer to 7-10 sentences!
No solution given!
5
Download