Data Mining 2010 Assignment 2: Clustering and Classification General This assignment consists of 2 parts with a total of 3 questions. The assignment should be made in teams of two students. Solutions should be handed in ultimately Friday, June 25. Send your report, containing the answers to the questions below, in Word or PDF format by e-mail to ad@cs.uu.nl. Name your file dmopdr2.studentid where studentid is replaced by the studentid of either team member. Put your name and studentnumber on your work, and in the body of your e-mail. Reports should be written in Dutch unless you are not a native speaker, in which case you may write your report in English. The report will be graded pass or fail. Part I: Animal Dental Records We have ’dental records’ on 66 animals. For each animal the following 8 attributes have been recorded: 1. 2. 3. 4. 5. 6. 7. 8. Number Number Number Number Number Number Number Number of of of of of of of of top incisors bottom incisors top canines bottom canines top pre-molars bottom pre-molars top molars bottom molars The objective is to cluster the animals into a number of groups with similar dental records. 1 Question 1: k-means clustering One possibility is to use the k-means clustering algorithm for this purpose. a) Complete the following table by filling in the within-cluster sum of squares (also called sum of squared errors or SSE for short) for five different random starts (round to one decimal place). Select cluster mode Use training set in Weka. For each value of k, report the best solution found in the final column. k 2 3 4 5 6 random seed 10 20 30 40 50 best What number of clusters would you choose on the basis of the results? b) Right-click the best result for k = 3, and select Visualize cluster assignments. Slide Jitter to its maximum value and enlarge the window to get a good visualization. Select the following X and Y attributes: 1. X = top-premolars, Y = bottom-premolars 2. X = top-molars, Y = bottom-molars 3. X = bottom-incisors, Y = bottom-canines 4. X = top-incisors, Y = top-canines Which pair of attributes seems to give the best separation of the clusters? Explain. c) We can also use Manhattan distance rather than Euclidian distance to find the clusters. Find the best clusters for k=3, using the same random seeds as under a). Do you get the same clustering? d) To get a concise description of the best clustering produced for k = 3, we are going to give it to a tree classifier. In the Visualize cluster assignments window, select Save to output the cluster assignment to a data file. In the data file, replace Cluster by class in 2 @attribute Cluster {cluster1,cluster2,cluster3} Load this file, and apply J48 (disable pruning and keep the parameter M on its default value 2). Evaluate on the training sample. Does it give a good description of the clusters? Visualize the tree. Is it what you expected considering what you found under question b? Explain. Select Visualize classifier errors and select the appropriate X and Y attributes (slide Jitter to maximal and enlarge the window to get a good visualization). You can right-click on the plotted points to get additional information. Which animal is assigned to the wrong cluster? Can you get a perfect description of the clusters by changing the value of the parameter M? Question 2: DBScan (Note: This question requires Weka 3.6. I have requested its installation on the university computers, but this may not have happened yet on Thursday, June 10. In that case, proceed with the next question, and return to this exercise later.) This question is more open-ended than the previous one. Cluster the data with the DBScan algorithm. Try different settings of the parameters epsilon and minPoints. Discuss the clusterings (in particular: the number of clusters) you find with different parameter settings in relation to the meaning of the parameters. Pick one of the clusterings you find and discuss its ’biological plausibility’. 3 The Dentition Data N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Animal Opossum Hairy tail mole Common mole Star nose mole Brown bat Silver hair bat Pigmy bat House bat Red bat Hoary bat Lump nose bat Armadillo Pika Snowshoe rabbit Beaver Marmot Groundhog Prairie Dog Ground Squirrel Chipmunk Gray squirrel Fox squirrel Pocket gopher Kangaroo rat Pack rat Field mouse Muskrat Black rat House mouse Porcupine Guinea pig Coyote I 5 3 3 3 2 2 2 2 1 1 2 0 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 4 3 2 3 3 3 3 3 3 3 3 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 C 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 c 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 P 3 4 3 4 3 2 2 1 2 2 2 0 2 3 2 2 2 2 2 2 1 1 1 1 0 0 0 0 0 1 1 4 4 p 3 4 3 4 3 3 2 2 2 2 3 0 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 4 M 4 3 3 3 3 3 3 3 3 3 3 8 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 m 4 3 3 3 3 3 3 3 3 3 3 8 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Num 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 Animal Wolf Fox Bear Civet cat Raccoon Marten Fisher Weasel Mink Ferrer Wolverine Badger Skunk River otter Sea otter Jaguar Ocelot Cougar Lynx Fur seal Sea lion Walrus Grey seal Elephant seal Peccary Elk Deer Moose Reindeer Antelope Bison Mountain goat Musk ox Mountain sheep I 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 2 0 0 0 0 0 0 0 0 0 i 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 2 0 2 1 3 4 4 4 4 4 4 4 4 4 C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 5 c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 P 4 4 4 4 4 4 4 3 3 3 4 3 3 4 3 3 3 3 3 4 4 3 3 4 3 3 3 3 3 3 3 3 3 3 p 4 4 4 4 4 4 4 3 3 3 4 3 3 3 3 2 2 2 2 4 4 3 3 4 3 3 3 3 3 3 3 3 3 3 M 2 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 2 1 3 3 3 3 3 3 3 3 3 3 m 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 0 2 1 3 3 3 3 3 3 3 3 3 3 Part II: The Federalist Papers We consider the problem: can we attribute a text to an author? This problem may occur for several reasons, for example: the author has remained anonymous, or he/she uses an alias. A controversial example is: have all works that have been attributed to Shakespeare really been written by him? You have to clarify the authorship of some of the so-called Federalist Papers. These papers have been written by Alexander Hamilton, John Jay, and James Madison to convince the citizens of the state of New York to ratify the constitution. Of most papers the author is known, but for twelve papers authorship is contested between Hamilton and Madison. You have at your disposal a data file that contains for each text a number of characteristics that have been extracted with the text analysis program Docuscope. The file contains data on papers written by Hamilton, Madison, and contested papers. Question 3: Classification/Clustering The assignment is to analyse the data in order to indicate the likely author of the contested papers. It can also be a valid conclusion that nothing sensible can be said about the likely authorship. Report your findings and give a justification of the methods of analysis you have applied. 6 Federalist Paper Data Below, we give a description of the available data. For more information on attributes 4-21, please consult the documentation of Docuscope on http://betterwriting.net/projects/fed01 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. TextNumber: Number of text as found on http://thomas.loc.gov/home/histdox/fedpapers.html TextName: Name of file as found on http://thomas.loc.gov/home/histdox/fedpapers.html Group: 1 = Hamilton, 2 = Madison, 3 = authorship contested FirstPerson InnerThinking ThinkPositive ThinkNegative ThinkingAhead ThinkingBack WordPicture SpaceInterval Motion PastEvents ShiftingEvts TimeInterval CueComKnow CuePriorText CueReader CueNotifier CueMovement CueReasoning 7