Review Questions for November 11

advertisement
Questions and Topics Review
November 11, 2010
1. Discussion of Midterm Exam
2. Assume an association rule if smoke then cancer has a confidence of 86% and a high lift of
5.4. What does this tell you about the relationship of smoking and cancer?
3. Assume you have to do feature selection for a classification task. What are the characteristics
of features (attributes) you might remove from the dataset prior to learning the classification
algorithm?
4. APRIORI has been generalized for mining sequential patterns. How is the APRIORI property
defined and used in the context of sequence mining?
5. What are the difficulties in using association rule mining for data sets that contain a lot of
continuous attributes? [3]
6. Assume the Apriori-style sequence mining algorithm described at pages 429-435 is used and
the algorithm generated 3-sequences listed below:
Frequent 3-sequences
<(1) (2) (3)>
<(1 2 3)>
<(1) (2) (4)>
<(1) (3) (4)>
<(1 2) (3)>
<(2 3) (4)>
<(2) (3) (4)>
<(3) (4 5)>
Candidate Generation
Candidates that survived pruning
Christoph F. Eick
Answers Review October 13, 2009
1.
How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision tree
learning?
No answer first question! To obtain a low generalization error! To find the correct amount of model complexity that leads
to a low generalization error.
1. What are the characteristics of representative-based/ prototype-based clustering algorithms—what do they all
have in common?
a) The form clusters by assigning objects in the dataset to the closest prototype/representative.(using 1-NN
queries b) They are iterative algorithms that change the current partitioning until a predefined termination
condition is met [c)cluster shapes are limited to convex polygons]
1. K-means is one of the most popular clustering algorithms. Give reasons why K-means is that popular!
K-means is popular because it is relatively efficient (runtime complexity is basically O(n)and storage complexity is
O(n)) and easy to use. It uses implicit fitness function (SSE) and terminates at local optimal for this fitness
function. Its properties are well understood.
1. What of the following cluster shapes K-means is capable to discover? a) triangles b) clusters inside clusters
c) the letter ‘T ‘d) any polygon of 5 points e) the letter ’I’
Only a and e!!
5. Assume we apply K-medoids for k=3 to a dataset consisting of 5 objects numbered 1,..5 with the following distance
matrix: The current set of representatives is {1,3,4}; indicate all computations k-medoids (PAM) performs in its next iteration
Distance Matrix:
0 2 4 5 1 object1
0233
015
02
Christoph F. Eick
0
Answers Review October 13, 2009 Cont.
6. What are the characteristics of a border point in DBSCAN?
It is not a core point but it is within the radius  of one or more core points.
6. If you increase the MinPts parameter of DBSCAN; how will this affect the clustering
results?
There will be more outliers! It is hard to say if the number of clusters will increase or
decrease (two effects interact: some clusters die(less clusters); some other bigger
clusters will be split into multiple smaller clusters(more clusters))
6. DBSCAN supports the notion of outliers. Why is this desirable?
a) More descriptive and compact clusters b)no need to remove outliers prior to clustering
6. DBSCAN has a complexity of O(n**2) which can be reduced by using spatial index
structures to O(log(n)*n). Explain!
For each point in the dataset we have to decide if it is a core point or not, which takes
O(n) without supportive data structures; because there are n points in the dataset we
obtain O(n**2). For each core c point we also have to compute all the points that are
density-reachable from c, but this is O(n) or less…
6. How is region discovery in spatial datasets different from traditional clustering?
a) Supports plug-in fitness functions b) Finds clusters in the subspace of spatial
attributes and not in the complete attribute space!
Christoph F. Eick
Download