Final-15-solution

advertisement
Solution Sketched Final Exam
COSC 4335 Data Mining
May 12, 2015
Your Name:
Your student id:
Problem 1 --- Association and Sequence Mining [16]
Problem 2 --- PageRank [9]
Problem 3 --- Anomaly/Outlier Detection [9]
Problem 4 --- Classification and Other [16]
Problem 5 --- Preprocessing [10]
:
Grade:
The exam is “open books” and you have 95 minutes to complete the exam.
The exam will count approx. 20% towards the course grade.
1
1) Association Rule and Sequence Mining [16]
a) Assume the Apriori-style sequence mining algorithm described on pages 429-435 of
our textbook is used and the algorithm generated 3-sequences listed below:
Frequent 3-sequences Candidate Generation Candidates that survived pruning
<(1) (2 3)>
(1) (2 3 4)
<(1) (3 4) >
(1) (2 3) (4)
<(1 2) (3)>
(1) (3 4) (5)
<(2 3 4)>
(1 2) (3) (4)
none
<(2 3) (4)>
(1 2) (3 4)
<(2)(3)(4)>
(2) (3 4) (5)
<(2) (3 4)>
<(3 4) (5)>
Please verify my solution! One error at most [4]; 2 errors at most [3]; 3 errors at most
[1.5]
What candidate 4-sequences are generated from this 3-sequence set? Which of the
generated 4-sequences survive the pruning step? Use format of Figure 7.6 in the textbook
on page 435 to describe your answer! [5]
b) Assume you run APRIORI with a given support threshold on a supermarket
transaction database and you receive a single frequent 7-item set and no frequent 8itemsets. What can be said about the total number of itemsets that are frequent in this
case? [3]
2**7 [2.5] or to be exact 2**7-1 as the empty set is not a frequent item set [3]
c) Assume the itemset {A, B, D} is frequent; what rules will be generated from this
itemset and how will it be determined that they satisfy the confidence threshold? [4]
All possible rules that contain all elements {A, B, D} are generated; that is, 6 rules that
have {A], {B}, {D}, {A,B}, {A,D}, and {B,D} and the left hand side and the missing
item(s) on their right hand side are created, and generated rule’s confidence is computed
by dividing the support of {A,B,D} by the support of the rules left-hand side—as the left
hand side is a subset of {A,B,D} it is frequent and therefore its support is stored in the
frequent item set table that has been computed by the APRIORI algorithm.
2
Problem1 continued
d) What are the difficulties in using association rule mining for data sets that contain a lot
of continuous attributes? [4]
Association rule mining techniques have been designed for the discrete domain, and in
order to apply this approach to datasets with continuous attributes the continuous
attributes have to discretized. Discretization might lose valuable information and is quite
difficult in practice. For example. if the intervals are too long, we may lose some patterns
because of their lack of confidence. If the intervals are too short, we may lose some patterns
because of their lack of support. Finally, rules that have neighboring intervals in their left hand
side and right hand side might need to be postprocessed to capture application semantics.
2) PageRank [8]
a) Give the equation system that PAGERANK would set up for the webpage structure
given below, and d is assumed to be 0.8. [4]
.
P1
P2
P3
P4
𝑷𝑹(𝑷𝟒 )
]
𝟐
𝑷𝑹(𝑷𝟏 ) 𝑷𝑹(𝑷𝟑 ) 𝑷𝑹(𝑷𝟒 )
𝑷𝑹(𝑷𝟐 ) = 𝟎. 𝟐 + 𝟎. 𝟖 [
+
+
]
𝟏
𝟐
𝟐
𝑷𝑹(𝑷𝟑 ) = 𝟎. 𝟐
𝑷𝑹(𝑷𝟑 )
𝑷𝑹(𝑷𝟒 ) = 𝟎. 𝟐 + 𝟎. 𝟖 [
]
𝟐
𝑷𝑹(𝑷𝟏 ) = 𝟎. 𝟐 + 𝟎. 𝟖 [
One error at most [2.5] points; 2 errors at most [1] point.
3
b) How is the equation system obtained as you answer to question a used to obtain the
page rank of the 4 webpages? [2]
The equations are initialized (with 1) and recursively updated until there is no more
change, and the final PR values for the webpages are reported as the page rank of the
page.
c) What does the parameter d model—the damping factor—in PageRank’s equation
systems? How does it impact the solution of the equation system? [3]
It defines the probability that a web surfer navigates through the web by following the
links. [1.5]. If d is smaller, the page rank of the lesser important pages increase (or you
can say, the variance of the pagerank distribution of the webpages decreases) [1.5]
3) Anomaly/Outlier Detection [9]
a) Give a sketch of an approach that uses boxplots for outlier/anomaly detection! [4]
One possible solution: Outliers are defined as objects whose distance to the box is above
or below a given threshold; e.g. values that are above 25%-pencentile+1.5IQR or below
75%=percentile-1.5IQR are defined as outliers.
b) How do model-based approaches to outlier detection work? Give a specific example of
a model-based approach to outlier detection! [5]
Model-based approaches fit a probabilistic model/a density function M to the
observations and then define outliers as observations whose probability is below a given
threshold[2.5]; e.g. a Normal distribution N(,) is fitted to the observations of a single
attribute and x is defined as an outlier if d,(x)< where  is a density threshold and d,
is the density function of N(,) [2.5]
4) Classification and Other [16]
a) Compare k-Nearest Neighbor Classifiers and Decision Trees! What are the main
differences between the two approaches? [5]
Nearest Neighbor Classifiers
Consider
only
the
neighborhood
classifylocal classifier [0.5]
Uses multiple decision boundaries that
edges of convex polygons that can
computed using Vornoi tessalations [1]
Lazy learner, so doesn’t build a model of
training data[0.5]
No training cost [0.5]
Distance function is critical! [0.5]
Decision Tree
to Hierarchical decision making scheme [0.5]
are Uses a multiple decision boundaries which are
be axis parallal/rectangular [1]
the Builds the model of the training data [0.5]
(somewhat) expensive to learn [0.5]
Does not rely on distance functions but rather
on the ordering of the attribute values [0.5]
At most 5 points; other answers might deserve credit!
4
b) The following dataset is given (depicted below) with A being a continuous attribute
and GINI is used as the evaluation function. What root test would be generated by the
decision tree induction algorithm? Please justify your answer![4]
Root test: A >=
A
0.22
0.31
0.31
0.31
0.33
0.41
0.43
Class
0
0
1
0
1
1
1
Possible splits:
A0.22: (1,0); (2,4)
A0.31: (3,1); (0,3)
A0.33: (3,2); (0,2)
A0.41: (3,3); (0,1)
as A0.31has a purity of 100%/75% which is much higher than the purity of the other
splits, therefore, this split will be selected.
Alternatively, the GINI for the 4 splits could be computed and the split with the highest
GINI-gain would be selected .
5
c) Assume you use 10-fold cross validation to determine the accuracy of a classifier. How
is the accuracy of the classifier computed, when this approach is chosen. [3]
When this approach is chosen there will be 10 training-test set pairs and 10 models will
be generated from each training set, and for each example in a testset it is determined if it
is classified correctly or not by the learnt model; as each example occurs in exactly one of
the 10 testsets1 the accuracy can be computed straight forwardly by: (number of examples
classified correctly)/(total number of examples)
d) Assume you want to use support vector machines for classification problems involving
more than two classes. Propose a generalization of support vector machines of your own
preference for non-binary classification problems! [4+2 extra points]
Many possible solutions: e.g. you could learn k binary SVM classifier that determines is
an example belong to class Ck or not and then it is determined voting (the vote is the
amount of the vote is the example’s distance to the hyperplane, if it classified correctly
and the negative distance.otherwise).
5) Preprocessing [10]
a) Why is dimensionality reduction important in data mining? [2]
– Avoid curse of dimensionality[0.5]
– Reduce amount of time and memory required by data mining
algorithms[0.5]
– Allow data to be more easily visualized [0.5]
– May increase accuracy by eliminating irrelevant and redundant features
[0.5]
– might reduce noise [0.5]
At most 2 points
b) What is the goal of Principal Component Analysis? What is a principal component?
How are principal components used for dimensionality reduction? [4]
The goal is to reduce the dimensionality of a dataset by selecting a subset of the principal
components that captures a large proportion of the variation in the dataset [2] Principal
component are linearly uncorrelated variables (linear combinations of the original
attributes in the dataset) [2]. Principle components are computed in the order of their
contribution of the variation of the dataset and this process is stopped, if a certain
percentage of the variation in the dataset is captured by the first k principal components
[1]
At most 4 points!
c) Assume you use association rule mining for a very very large transaction database.
How could sampling be used to mine the transaction database in a reasonable time? [4]
Many possible solutions: e.g. create a sample of the transaction database (e.g. take
10000 of 1000000 transactions), mine association rules, re-compute the confidence and
support of the obtained association rules on the whole transaction database, and drop
those rules from the obtained ruleset whose support or confidence is below the respective
thresholds.
1
If they do not mention this fact at most [2].
6
Download