Add a clustering mining model

advertisement
Assignment 6 – Clustering and Other Data Mining Methods
Note: Show all your work.
Problem 1 (20 points). Consider the following two clusters as shown in the figure. Here,
filled circles are Cluster1 objects and unfilled circles are Cluster2 objects.
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
Cluster1: (2,2) (2,4) (4,4) (4,6)
Cluster2: (4,2) (2,6) (2,8) (4,8)
Using two iterations of the k-Means clustering algorithm, show the two clusters at the
end of each iteration. You don’t need to draw figures like above. It is enough that you
indicate which objects belong to each cluster at the end of each iteration. Again, show all
your work and use Manhattan distance metric to compute the distance between two
objects.
Cluster1: (2,2) (2,4) (4,4) (4,6)
Cluster2: (4,2) (2,6) (2,8) (4,8)
2+2+4+4
=3
4
2+4+4+6
𝐶1(𝑦) =
=4
4
𝐶1 = (3,4)
4+2+2+4
𝑐2(𝑥) =
=3
4
2+6+8+8
𝑐2(𝑦) =
=6
4
𝐶2 = (3,6)
𝐶1(𝑥) =
(2,2)
1
[(2 − 3)2 + (2 − 4)2 ]2 = 2.24
1
[(2 − 3)2 + (2 − 6)2 ]2 = 4.12
(2,4)
1
[(2 − 3)2 + (4 − 4)2 ]2 = 1
1
[(2 − 3)2 + (4 − 6)2 ]2 = 2.24
(2,6)
1
[(2 − 3)2 + (6 − 4)2 ]2 = 2.24
1
[(2 − 3)2 + (6 − 6)2 ]2 = 1
(2,8)
1
[(2 − 3)2 + (8 − 4)2 ]2 = 4.12
1
[(2 − 3)2 + (8 − 6)2 ]2 = 2.24
(4,2)
1
[(4 − 3)2 + (2 − 4)2 ]2 = 2.24
1
[(4 − 3)2 + (2 − 6)2 ]2 = 4.12
(4,4)
1
[(4 − 3)2 + (4 − 4)2 ]2 = 1
1
[(4 − 3)2 + (4 − 6)2 ]2 = 2.24
(4,6)
1
[(4 − 3)2 + (6 − 4)2 ]2 = 2.24
1
[(4 − 3)2 + (6 − 6)2 ]2 = 1
(4,8)
1
[(4 − 3)2 + (8 − 4)2 ]2 = 4.12
1
[(4 − 3)2 + (8 − 6)2 ]2 = 2.24
C1=(2,2)(2,4)(4,2)(4,4)
C2=(2,6)(2,8)(4,6)(4,8)
4+4+2+2
=3
4
4+4+2+2
𝐶1(𝑦) =
=3
4
4+4+2+2
𝐶2(𝑥) =
=3
4
6+6+8+8
𝐶2(𝑥) =
=7
4
𝐶1(𝑥) =
C1=(3,3)
C2=(3,7)
(2,2)
1
[(2 − 3)2 + (2 − 3)2 ]2 = 1.41
1
[(2 − 3)2 + (2 − 7)2 ]2 = 5.1
(2,4)
1
[(2 − 3)2 + (4 − 3)2 ]2 = 2.23
1
[(2 − 3)2 + (4 − 7)2 ]2 = 5.1
(2,6)
1
[(2 − 3)2 + (6 − 3)2 ]2 = 3.16
1
[(2 − 3)2 + (6 − 7)2 ]2 = 1.41
(2,8)
1
[(2 − 3)2 + (8 − 3)2 ]2 = 5.1
1
[(2 − 3)2 + (8 − 7)2 ]2 = 1.41
(4,2)
1
[(4 − 3)2 + (2 − 3)2 ]2 = 1.41
1
[(4 − 3)2 + (2 − 7)2 ]2 = 5.1
(4,4)
1
[(4 − 3)2 + (4 − 3)2 ]2 = 2.23
1
[(4 − 3)2 + (4 − 7)2 ]2 = 5.1
(4,6)
1
[(4 − 3)2 + (6 − 3)2 ]2 = 3.16
1
[(4 − 3)2 + (6 − 7)2 ]2 = 1.41
(4,8)
1
[(4 − 3)2 + (8 − 3)2 ]2 = 5.1
1
[(4 − 3)2 + (8 − 7)2 ]2 = 1.41
C1=(2,2)(2,4)(4,2)(4,4)
C2=(2,6)(2,8)(4,6)(4,8)
Problem 2 (10 points). Consider the following two clusters:
9
8
7
6
C1
5
C0
4
3
2
1
0
1
2
3
4
5
6
7
8
9
Compute the distance between the two clusters (1) using minimum distance and (2) using
average distance. These distance measures are defined in page 410 of the textbook. Use
the Manhattan distance measure.
This was on page 461 in my book.
Min Distance = 1
1
[(4 − 5)2 + (5 − 5)2 ]2 = 1
Average distance = 3.5
1
[(2.5 − 6)2 + (5 − 5)2 ]2 = 3.5
Problem 3 (10 points). Suppose that you issued a query against a search engine and the
search engine retrieved 125 documents. Suppose that there are total 250 documents in the
whole internet that match your query (i.e., these are relevant documents) and 40 of them
were included among 125 retrieved documents. Compute precision, recall, and F_score.
40
𝑃=
= .32
125
40
𝑅=
= .16
250
I must be missing it but I do not see F_score in the lecture notes.
Problem 4 (10 points total). The following is a short dictionary of words found in
analyzing the inaugural speeches of the past few presidents:
{America, Bible, century, ideal, life, nation, people, story, time, today, word,
world}
The following table shows the frequencies of the occurrences of these words in the
speeches of the past four presidents:
President Word
Bush Jr
ideal
life
nation
people
story
world
Clinton
America
ideal
life
time
word
world
Bush Sr
America
Bible
Nation
today
word
Reagan
Bible
century
people
time
today
world
Frequency
2
1
1
1
7
2
4
1
1
1
1
3
1
2
1
2
2
1
1
1
2
1
2
Consider each speech as a document, and let the speech of Bush Jr. be D1, that of Clinton
be D2, that of Bush Sr. be D3, and that of Reagan be D4. (1). Represent each document
as a binary vector. (2). Compute the similarities among the four documents and decide
which two are closest to each other. Use the cosine measure for document similarity.
Pair
V*W
||V||
||W||
cosƟ
Ɵ
D1,D2
9
.22
77.3
√29
2√15
D1,D3
5
.17
80.2
√14
2√15
D1,D4
5
.19
79
2√3
2√15
D2,D3
6
.3
72.5
√29
√14
D2,D4
8
.43
64.5
2√3
√29
D3,D4
4
.31
71.9
2√3
√14
Problem 5 (10 points). Refer to the following figure that shows link (or reference)
relationships among three web pages.
Page A
Page B
Page C
.
Iter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Compute the page ranks of the three pages using the method described in the lecture
(Section 4.6 of Module 6). The computation is done iteratively. You have to perform the
iteration until all ranks do not change their values in the three digits after the decimal
point. Show three page rank values in each iteration. Use 0.85 as the damping factor.
C(A)=1
C(B)=2
C(C)=2
PR(A)
PR(B)
PR(C)
𝟎 𝟎
. 𝟐𝟕𝟖
. 𝟏𝟓 𝟎
. 𝟏𝟓+. 𝟖𝟓 ( + ) =. 𝟏𝟓
. 𝟏𝟓+. 𝟖𝟓 (
) =. 𝟐𝟔𝟖
. 𝟏𝟓+. 𝟖𝟓 (
+ ) =. 𝟐𝟕𝟖
𝟐 𝟐
𝟐
𝟏
𝟐
. 𝟐𝟕𝟖 . 𝟐𝟔𝟖
. 𝟑𝟖𝟐 . 𝟐𝟔𝟖
. 𝟓𝟖𝟗
. 𝟏𝟓+. 𝟖𝟓 (
+
)
. 𝟏𝟓+. 𝟖𝟓 (
+
) =. 𝟓𝟖𝟗
. 𝟏𝟓+. 𝟖𝟓 (
) =. 𝟒
𝟐
𝟐
𝟏
𝟐
𝟐
=. 𝟑𝟖𝟐
. 𝟓𝟖𝟗 . 𝟒
. 𝟓𝟕 . 𝟒
. 𝟖𝟎𝟓
. 𝟏𝟓+. 𝟖𝟓 (
+ ) =. 𝟓𝟕
. 𝟏𝟓+. 𝟖𝟓 (
+ ) =. 𝟖𝟎𝟓
. 𝟏𝟓+. 𝟖𝟓 (
) =. 𝟒𝟗𝟐
𝟐
𝟐
𝟏
𝟐
𝟐
0.701
0.955
0.556
0.792
1.060
0.600
0.855
1.132
0.631
0.900
1.183
0.653
0.930
1.218
0.668
0.951
1.242
0.678
0.966
1.259
0.685
0.977
1.271
0.690
0.984
1.279
0.694
0.989
1.285
0.696
0.992
1.289
0.698
0.995
1.292
0.699
0.996
1.294
0.700
0.997
1.295
0.700
0.998
1.296
0.701
0.999
1.297
0.701
0.999
1.297
0.701
0.999
1.298
0.701
1.000
1.298
0.702
1.000
1.298
0.702
1.000
1.298
0.702
Problem 6 (total 40 points). This problem is a practice of sequence clustering using
SQL Server 2008. Lesson 4 of Intermediate Data Mining tutorial illustrates how to build
and analyze a sequence clustering model. It has five tasks. They are:
1.
2.
3.
4.
5.
Creating a Sequence Clustering Mining Model Structure
Processing the Sequence Clustering Model
Exploring the Sequence Clustering Model
Creating a Related Sequence Clustering Model.
Creating Predictions on a Sequence Clustering Model.
Among these, you are required to perform the first four tasks, except the section titled
Generic Content Tree Viewer which is in the 3rd task.
The tutorial explains the basic concept of Microsoft sequence clustering. If you need
more information about their sequence clustering, refer to appropriate Microsoft
documentation (links to some documents are included in the tutorial).
Requirements: At the beginning of the 3rd task (Exploring the Sequence Clustering
Model), you will rename two clusters as Pacific Cluster and Largest Cluster. After that,
you will explore your model using various tabs under Mining Model Viewer.
Problem 6-1 (10 points). After you explore your mining model in the 3rd task
(Exploring the Sequence Clustering Model), click Cluster Characteristics tab and choose
Pacific Cluster from Cluster drop down menu. (1) Capture the screen and paste it onto
your submission. (2) List the top three models that are put in customer’s shopping basket
as the first model (or item) most frequently in the decreasing order of probability. (3)
Choose the 2-model sequence that is most frequently put in customer’s shopping basket.
2- Mountain-200, Patch Kit, ML Mountain Tire
3- Mountain-200, Hydration Pack
Problem 6-2 (10 points). This time select Largest Cluster from Cluster drop down menu.
(1) Capture the screen and paste it onto your submission. (2) List the top three models
that are put in customer’s shopping basket as the first model (or item) most frequently in
the decreasing order of probability. (3) Choose the 2-model sequence that is most
frequently put in customer’s shopping basket.
2- sport-100, water bottle, cycling cap
3- Mountain bottle cage, water bottle
Problem 6-3 (10 points). After you create a related sequence clustering model in the 4th
task (Creating a Related Sequence Clustering Model) in the Mining Model Viewer, click
State Transitions tab. To the left of the window, there is a slider. Raise the slider all the
way up to All Links. Capture the screen and paste it onto your submission.
Problem 6-4 (10 points). Explore the state transition diagram and answer the following
question. This question is about state transition diagrams generated by Microsoft
Sequence Clustering algorithm. A state transition diagram is one of the most fundamental
tools that is used to represent behavior of many objects, including computers and human
beings. It is intuitively easy to understand but powerful enough to model even very
complex behavior of objects. Here, it is used to model behavior of customers in terms of
in which order customers put models (or items) in their shopping basket. We can also
analyze which sequences of models appear frequently in customers’ baskets. Again, the
basic concept is briefly described in the tutorial. If you need more information, you can
consult Microsoft documentation or any book that describes state transition diagrams (the
state transition diagrams generated by Microsoft Sequence Clustering algorithm are not
different from any typical state transition diagram). You can even google with “state
transition diagram” and you will find plenty of good material.
Consider the following state transition diagram:
Hydration Pack
0.63
Mountain Tire Tube
0.09
Classic Vest
0.21
0.34
Patch Kit
0.17
0.44
0.32
ML Mountain Tire
0.02
Mountain-200
0.11
Note that this is not a complete state transition diagram. Some links are omitted and some
numbers are also missing. You will not see this diagram from your model viewer.
Answer the following questions using this state transition diagram.
(1) Suppose a customer just put Mountain-200 into her shopping basket. Can you
determine which item is the most likely one that will be put in the basket right
after that? If your answer is yes, which one is it? If your answer is no, explain
why
Yes, Hydration Pack
(2) Suppose that a customer just put Mountain Tire Tube in her basket. Can you
determine which item is the most likely one that was put in the basket right before
that? If your answer is yes, which one is it? If your answer is no, explain why.
Yes, Hydration Pack
(3) Suppose a customer just put Mountain-200 into her shopping basket. What is the
probability that the customer will put in her basket the following three models (or
items), in the order as shown, Hydration Pack, Mountain Tire Tube, and Patch Kit
right after she put Mountain-200?
.44*.63*.34=.09
Submission:
Submit all solutions in a single Word or PDF document and upload it to course
assignments section. Please make sure that there are no spaces in the file name. Use
firstName_lastName_HW6.doc or firstName_lastName_HW6.pdf as the file name.
To capture the snapshot of a screen, with the required window selected, press CTRLALT-PrintScreen. Then, do a Paste in the Word document.
Download