Assignment 6 – Clustering and Other Data Mining Methods Note: Show all your work. Problem 1 (20 points). Consider the following two clusters as shown in the figure. Here, filled circles are Cluster1 objects and unfilled circles are Cluster2 objects. 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Cluster1: (2,2) (2,4) (4,4) (4,6) Cluster2: (4,2) (2,6) (2,8) (4,8) Using two iterations of the k-Means clustering algorithm, show the two clusters at the end of each iteration. You don’t need to draw figures like above. It is enough that you indicate which objects belong to each cluster at the end of each iteration. Again, show all your work and use Manhattan distance metric to compute the distance between two objects. Cluster1: (2,2) (2,4) (4,4) (4,6) Cluster2: (4,2) (2,6) (2,8) (4,8) 2+2+4+4 =3 4 2+4+4+6 𝐶1(𝑦) = =4 4 𝐶1 = (3,4) 4+2+2+4 𝑐2(𝑥) = =3 4 2+6+8+8 𝑐2(𝑦) = =6 4 𝐶2 = (3,6) 𝐶1(𝑥) = (2,2) 1 [(2 − 3)2 + (2 − 4)2 ]2 = 2.24 1 [(2 − 3)2 + (2 − 6)2 ]2 = 4.12 (2,4) 1 [(2 − 3)2 + (4 − 4)2 ]2 = 1 1 [(2 − 3)2 + (4 − 6)2 ]2 = 2.24 (2,6) 1 [(2 − 3)2 + (6 − 4)2 ]2 = 2.24 1 [(2 − 3)2 + (6 − 6)2 ]2 = 1 (2,8) 1 [(2 − 3)2 + (8 − 4)2 ]2 = 4.12 1 [(2 − 3)2 + (8 − 6)2 ]2 = 2.24 (4,2) 1 [(4 − 3)2 + (2 − 4)2 ]2 = 2.24 1 [(4 − 3)2 + (2 − 6)2 ]2 = 4.12 (4,4) 1 [(4 − 3)2 + (4 − 4)2 ]2 = 1 1 [(4 − 3)2 + (4 − 6)2 ]2 = 2.24 (4,6) 1 [(4 − 3)2 + (6 − 4)2 ]2 = 2.24 1 [(4 − 3)2 + (6 − 6)2 ]2 = 1 (4,8) 1 [(4 − 3)2 + (8 − 4)2 ]2 = 4.12 1 [(4 − 3)2 + (8 − 6)2 ]2 = 2.24 C1=(2,2)(2,4)(4,2)(4,4) C2=(2,6)(2,8)(4,6)(4,8) 4+4+2+2 =3 4 4+4+2+2 𝐶1(𝑦) = =3 4 4+4+2+2 𝐶2(𝑥) = =3 4 6+6+8+8 𝐶2(𝑥) = =7 4 𝐶1(𝑥) = C1=(3,3) C2=(3,7) (2,2) 1 [(2 − 3)2 + (2 − 3)2 ]2 = 1.41 1 [(2 − 3)2 + (2 − 7)2 ]2 = 5.1 (2,4) 1 [(2 − 3)2 + (4 − 3)2 ]2 = 2.23 1 [(2 − 3)2 + (4 − 7)2 ]2 = 5.1 (2,6) 1 [(2 − 3)2 + (6 − 3)2 ]2 = 3.16 1 [(2 − 3)2 + (6 − 7)2 ]2 = 1.41 (2,8) 1 [(2 − 3)2 + (8 − 3)2 ]2 = 5.1 1 [(2 − 3)2 + (8 − 7)2 ]2 = 1.41 (4,2) 1 [(4 − 3)2 + (2 − 3)2 ]2 = 1.41 1 [(4 − 3)2 + (2 − 7)2 ]2 = 5.1 (4,4) 1 [(4 − 3)2 + (4 − 3)2 ]2 = 2.23 1 [(4 − 3)2 + (4 − 7)2 ]2 = 5.1 (4,6) 1 [(4 − 3)2 + (6 − 3)2 ]2 = 3.16 1 [(4 − 3)2 + (6 − 7)2 ]2 = 1.41 (4,8) 1 [(4 − 3)2 + (8 − 3)2 ]2 = 5.1 1 [(4 − 3)2 + (8 − 7)2 ]2 = 1.41 C1=(2,2)(2,4)(4,2)(4,4) C2=(2,6)(2,8)(4,6)(4,8) Problem 2 (10 points). Consider the following two clusters: 9 8 7 6 C1 5 C0 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Compute the distance between the two clusters (1) using minimum distance and (2) using average distance. These distance measures are defined in page 410 of the textbook. Use the Manhattan distance measure. This was on page 461 in my book. Min Distance = 1 1 [(4 − 5)2 + (5 − 5)2 ]2 = 1 Average distance = 3.5 1 [(2.5 − 6)2 + (5 − 5)2 ]2 = 3.5 Problem 3 (10 points). Suppose that you issued a query against a search engine and the search engine retrieved 125 documents. Suppose that there are total 250 documents in the whole internet that match your query (i.e., these are relevant documents) and 40 of them were included among 125 retrieved documents. Compute precision, recall, and F_score. 40 𝑃= = .32 125 40 𝑅= = .16 250 I must be missing it but I do not see F_score in the lecture notes. Problem 4 (10 points total). The following is a short dictionary of words found in analyzing the inaugural speeches of the past few presidents: {America, Bible, century, ideal, life, nation, people, story, time, today, word, world} The following table shows the frequencies of the occurrences of these words in the speeches of the past four presidents: President Word Bush Jr ideal life nation people story world Clinton America ideal life time word world Bush Sr America Bible Nation today word Reagan Bible century people time today world Frequency 2 1 1 1 7 2 4 1 1 1 1 3 1 2 1 2 2 1 1 1 2 1 2 Consider each speech as a document, and let the speech of Bush Jr. be D1, that of Clinton be D2, that of Bush Sr. be D3, and that of Reagan be D4. (1). Represent each document as a binary vector. (2). Compute the similarities among the four documents and decide which two are closest to each other. Use the cosine measure for document similarity. Pair V*W ||V|| ||W|| cosƟ Ɵ D1,D2 9 .22 77.3 √29 2√15 D1,D3 5 .17 80.2 √14 2√15 D1,D4 5 .19 79 2√3 2√15 D2,D3 6 .3 72.5 √29 √14 D2,D4 8 .43 64.5 2√3 √29 D3,D4 4 .31 71.9 2√3 √14 Problem 5 (10 points). Refer to the following figure that shows link (or reference) relationships among three web pages. Page A Page B Page C . Iter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Compute the page ranks of the three pages using the method described in the lecture (Section 4.6 of Module 6). The computation is done iteratively. You have to perform the iteration until all ranks do not change their values in the three digits after the decimal point. Show three page rank values in each iteration. Use 0.85 as the damping factor. C(A)=1 C(B)=2 C(C)=2 PR(A) PR(B) PR(C) 𝟎 𝟎 . 𝟐𝟕𝟖 . 𝟏𝟓 𝟎 . 𝟏𝟓+. 𝟖𝟓 ( + ) =. 𝟏𝟓 . 𝟏𝟓+. 𝟖𝟓 ( ) =. 𝟐𝟔𝟖 . 𝟏𝟓+. 𝟖𝟓 ( + ) =. 𝟐𝟕𝟖 𝟐 𝟐 𝟐 𝟏 𝟐 . 𝟐𝟕𝟖 . 𝟐𝟔𝟖 . 𝟑𝟖𝟐 . 𝟐𝟔𝟖 . 𝟓𝟖𝟗 . 𝟏𝟓+. 𝟖𝟓 ( + ) . 𝟏𝟓+. 𝟖𝟓 ( + ) =. 𝟓𝟖𝟗 . 𝟏𝟓+. 𝟖𝟓 ( ) =. 𝟒 𝟐 𝟐 𝟏 𝟐 𝟐 =. 𝟑𝟖𝟐 . 𝟓𝟖𝟗 . 𝟒 . 𝟓𝟕 . 𝟒 . 𝟖𝟎𝟓 . 𝟏𝟓+. 𝟖𝟓 ( + ) =. 𝟓𝟕 . 𝟏𝟓+. 𝟖𝟓 ( + ) =. 𝟖𝟎𝟓 . 𝟏𝟓+. 𝟖𝟓 ( ) =. 𝟒𝟗𝟐 𝟐 𝟐 𝟏 𝟐 𝟐 0.701 0.955 0.556 0.792 1.060 0.600 0.855 1.132 0.631 0.900 1.183 0.653 0.930 1.218 0.668 0.951 1.242 0.678 0.966 1.259 0.685 0.977 1.271 0.690 0.984 1.279 0.694 0.989 1.285 0.696 0.992 1.289 0.698 0.995 1.292 0.699 0.996 1.294 0.700 0.997 1.295 0.700 0.998 1.296 0.701 0.999 1.297 0.701 0.999 1.297 0.701 0.999 1.298 0.701 1.000 1.298 0.702 1.000 1.298 0.702 1.000 1.298 0.702 Problem 6 (total 40 points). This problem is a practice of sequence clustering using SQL Server 2008. Lesson 4 of Intermediate Data Mining tutorial illustrates how to build and analyze a sequence clustering model. It has five tasks. They are: 1. 2. 3. 4. 5. Creating a Sequence Clustering Mining Model Structure Processing the Sequence Clustering Model Exploring the Sequence Clustering Model Creating a Related Sequence Clustering Model. Creating Predictions on a Sequence Clustering Model. Among these, you are required to perform the first four tasks, except the section titled Generic Content Tree Viewer which is in the 3rd task. The tutorial explains the basic concept of Microsoft sequence clustering. If you need more information about their sequence clustering, refer to appropriate Microsoft documentation (links to some documents are included in the tutorial). Requirements: At the beginning of the 3rd task (Exploring the Sequence Clustering Model), you will rename two clusters as Pacific Cluster and Largest Cluster. After that, you will explore your model using various tabs under Mining Model Viewer. Problem 6-1 (10 points). After you explore your mining model in the 3rd task (Exploring the Sequence Clustering Model), click Cluster Characteristics tab and choose Pacific Cluster from Cluster drop down menu. (1) Capture the screen and paste it onto your submission. (2) List the top three models that are put in customer’s shopping basket as the first model (or item) most frequently in the decreasing order of probability. (3) Choose the 2-model sequence that is most frequently put in customer’s shopping basket. 2- Mountain-200, Patch Kit, ML Mountain Tire 3- Mountain-200, Hydration Pack Problem 6-2 (10 points). This time select Largest Cluster from Cluster drop down menu. (1) Capture the screen and paste it onto your submission. (2) List the top three models that are put in customer’s shopping basket as the first model (or item) most frequently in the decreasing order of probability. (3) Choose the 2-model sequence that is most frequently put in customer’s shopping basket. 2- sport-100, water bottle, cycling cap 3- Mountain bottle cage, water bottle Problem 6-3 (10 points). After you create a related sequence clustering model in the 4th task (Creating a Related Sequence Clustering Model) in the Mining Model Viewer, click State Transitions tab. To the left of the window, there is a slider. Raise the slider all the way up to All Links. Capture the screen and paste it onto your submission. Problem 6-4 (10 points). Explore the state transition diagram and answer the following question. This question is about state transition diagrams generated by Microsoft Sequence Clustering algorithm. A state transition diagram is one of the most fundamental tools that is used to represent behavior of many objects, including computers and human beings. It is intuitively easy to understand but powerful enough to model even very complex behavior of objects. Here, it is used to model behavior of customers in terms of in which order customers put models (or items) in their shopping basket. We can also analyze which sequences of models appear frequently in customers’ baskets. Again, the basic concept is briefly described in the tutorial. If you need more information, you can consult Microsoft documentation or any book that describes state transition diagrams (the state transition diagrams generated by Microsoft Sequence Clustering algorithm are not different from any typical state transition diagram). You can even google with “state transition diagram” and you will find plenty of good material. Consider the following state transition diagram: Hydration Pack 0.63 Mountain Tire Tube 0.09 Classic Vest 0.21 0.34 Patch Kit 0.17 0.44 0.32 ML Mountain Tire 0.02 Mountain-200 0.11 Note that this is not a complete state transition diagram. Some links are omitted and some numbers are also missing. You will not see this diagram from your model viewer. Answer the following questions using this state transition diagram. (1) Suppose a customer just put Mountain-200 into her shopping basket. Can you determine which item is the most likely one that will be put in the basket right after that? If your answer is yes, which one is it? If your answer is no, explain why Yes, Hydration Pack (2) Suppose that a customer just put Mountain Tire Tube in her basket. Can you determine which item is the most likely one that was put in the basket right before that? If your answer is yes, which one is it? If your answer is no, explain why. Yes, Hydration Pack (3) Suppose a customer just put Mountain-200 into her shopping basket. What is the probability that the customer will put in her basket the following three models (or items), in the order as shown, Hydration Pack, Mountain Tire Tube, and Patch Kit right after she put Mountain-200? .44*.63*.34=.09 Submission: Submit all solutions in a single Word or PDF document and upload it to course assignments section. Please make sure that there are no spaces in the file name. Use firstName_lastName_HW6.doc or firstName_lastName_HW6.pdf as the file name. To capture the snapshot of a screen, with the required window selected, press CTRLALT-PrintScreen. Then, do a Paste in the Word document.