Exam

advertisement
Exam
Selected Topics in Data and Knowledge Engineering:
Data Mining
2010
This exam is open book and notes. You have 100 minutes to complete it. During the exam it
is forbidden to communicate with other persons participating in the exam. It is also forbidden
to use mobile phones during the exam.
First and last name:
__________________________________
Student signature:
__________________________________
Problem
1
2
3
4
5
6
Total
Grade
Maximum
10
10
10
10
10
10
60
Points
1.
Web mining [10 points]
1
4
2
5
3
1. Set up the equations to compute pagerank (matrix M) for the web graph shown on the
figure above. Calculate the Pagerank of pages 1, 2, …, 5 assuming no tax (perform 1
iteration) [5 points]
2. Define the matrix A (HITS algorithm) for the web graph shown on the figure above.
Calculate the hubbiness and authority of pages 1, 2, …, 5 using HITS algorithm (perform 1
iteration).
[5 points]
2.
Clustering [10 points]
Given the Euclidean distance matrix for objects p1, p2, …, p5.
P1
P2
P3
P4
P5
P1
0
0,9
0,59
0,45
0,65
P2
0,9
0
0,36
0,53
0,02
P3
0,59
0,36
0
0,56
0,15
P4
0,45
0,53
0,56
0
0,24
P5
0,65
0,02
0,15
0,24
0
Apply the hierarchical agglomerative clustering algorithm to cluster objects p1, p2, …, p5.
Show the result of the clustering by drawing a dendrogram. The dendrogram should clearly
show the order in which the objects are merged. Assume the “maximum distance” measure as
the distance measure between clusters.
3.
Association and multilevel association analysis [10 points]
Part 1 [6 points]
1.
Let A, A1, A2 and B denote items in an item hierarchy (taxonomy), and let A1 and A2
denote 2 children of A in the item hierarchy (see the figure below).
A
A1
A2
Circle the correct answer. Justify your answer (1 sentence)
(a)
sup(A  B) = sup(A1  B) + sup(A2  B)
T
F
(b)
if sup(A1  B) > minsup, then sup(A  B) > minsup
T
F
(c)
if sup(A  B) > minsup, then sup(A1  B) > minsup
T
F
2. Let p and q denote a pair of items, and P and Q are their corresponding ancestors (parents)
in the item hierarchy. Assume the confidence of the association rule conf(p  q) >minconf.
Which of the following rules are guaranteed to have confidence higher than minconf:

conf(p  Q) > minconf
T
F

conf(P  q) > minconf
T
F

conf(P  Q) > minconf
T
F
Part 2 [4 points]
Given a frequent itemset ABCD. Assume the confidence of the association rule
conf(AB  CD) < minconf.
Which of the following rules are guaranteed to have confidence lower than minconf:

conf(BCD  A) < minconf
T
F

conf(A  BCD) < minconf
T
F
4.
Classification [10 points]
Given the following training set of examples.
A
B
Decision attribute
1
1
0
0
0
1
0
0
0
1
0
0
0
1
1
1
1
1
0
0
No
No
Yes
No
Yes
No
Yes
Yes
Yes
No
Generate a decision tree from the given training set of examples using SPRINT algorithm
(based in the gini split index).
5.
Multidimensional association rules [10 points]
Given a dataset shown below.
RecId
Temperature
Pressure
Alarm1
1
2
3
4
5
6
7
8
9
95
85
103
97
80
101
83
86
102
1105
1041
1090
1085
1030
1087
1026
1028
1101
No
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
First two attributes are continuous (Temperature, Pressure), while the attribute Alarm1 is the
symmetric binary attribute. Suppose we apply the following equiwidth discretization schema
to the attributes Temperature and Pressure:
Temperature: (80-89), (90 – 99), (100 – 109)
Pressure:
(1000 – 1049), (1050 – 1099), (1100 – 1149)
1. Construct a binarized version of the data set [2 points].
2. Derive all frequent itemsets with support  30% [6 points].
3. Generate one multidimensional association rule from the derived 3-itemset and compute its
confidence [2 points].
6.
Clustering [10 points]
(A)
Given a list of restaurants in Poznan with positive and negative ratings of some of
them given by students. Apply the hierarchical agglomerative clustering algorithm to cluster
students. Show the result of the clustering by drawing a dendrogram, but stop to build the
dendrogram when the number of clusters is k=3. The dendrogram should clearly show the
order in which the objects are merged. Assume the “minimum distance” measure as the
distance measure between clusters. [7 points]
(B)
For the cluster of students including Dave define the ranking of restaurants. Which
restaurant should you recommend to Dave? [3 points]
Casa Mia
Alice
Bob
Cindy
Dave
Estie
Fred
No
No
Bazar
Yes
Yes
Dorota
No
Piano
Yes
Azja
Yes
No
No
No
Yes
Yes
Rodeo
No
No
Farma
Yes
Yes
Sfinks
No
No
No
Haveli
Yes
Yes
No
Download