Data Mining 2010 Assignment 2: Clustering and Classification

advertisement
Data Mining 2010
Assignment 2: Clustering and Classification
General
This assignment consists of 2 parts with a total of 3 questions. The assignment should be made in teams of two students. Solutions should be handed in
ultimately Friday, June 25. Send your report, containing the answers to the
questions below, in Word or PDF format by e-mail to ad@cs.uu.nl. Name
your file dmopdr2.studentid where studentid is replaced by the studentid
of either team member. Put your name and studentnumber on your work,
and in the body of your e-mail. Reports should be written in Dutch unless
you are not a native speaker, in which case you may write your report in
English. The report will be graded pass or fail.
Part I: Animal Dental Records
We have ’dental records’ on 66 animals. For each animal the following 8
attributes have been recorded:
1.
2.
3.
4.
5.
6.
7.
8.
Number
Number
Number
Number
Number
Number
Number
Number
of
of
of
of
of
of
of
of
top incisors
bottom incisors
top canines
bottom canines
top pre-molars
bottom pre-molars
top molars
bottom molars
The objective is to cluster the animals into a number of groups with similar
dental records.
1
Question 1: k-means clustering
One possibility is to use the k-means clustering algorithm for this purpose.
a) Complete the following table by filling in the within-cluster sum of
squares (also called sum of squared errors or SSE for short) for five
different random starts (round to one decimal place). Select cluster
mode Use training set in Weka. For each value of k, report the best
solution found in the final column.
k
2
3
4
5
6
random seed
10 20 30 40 50 best
What number of clusters would you choose on the basis of the results?
b) Right-click the best result for k = 3, and select Visualize cluster assignments. Slide Jitter to its maximum value and enlarge the window
to get a good visualization. Select the following X and Y attributes:
1. X = top-premolars, Y = bottom-premolars
2. X = top-molars, Y = bottom-molars
3. X = bottom-incisors, Y = bottom-canines
4. X = top-incisors, Y = top-canines
Which pair of attributes seems to give the best separation of the clusters? Explain.
c) We can also use Manhattan distance rather than Euclidian distance
to find the clusters. Find the best clusters for k=3, using the same
random seeds as under a). Do you get the same clustering?
d) To get a concise description of the best clustering produced for k = 3,
we are going to give it to a tree classifier. In the Visualize cluster
assignments window, select Save to output the cluster assignment to a
data file. In the data file, replace Cluster by class in
2
@attribute Cluster {cluster1,cluster2,cluster3}
Load this file, and apply J48 (disable pruning and keep the parameter
M on its default value 2). Evaluate on the training sample. Does it
give a good description of the clusters? Visualize the tree. Is it what
you expected considering what you found under question b? Explain.
Select Visualize classifier errors and select the appropriate X and Y
attributes (slide Jitter to maximal and enlarge the window to get a
good visualization). You can right-click on the plotted points to get
additional information. Which animal is assigned to the wrong cluster?
Can you get a perfect description of the clusters by changing the value
of the parameter M?
Question 2: DBScan
(Note: This question requires Weka 3.6. I have requested its installation on
the university computers, but this may not have happened yet on Thursday,
June 10. In that case, proceed with the next question, and return to this
exercise later.)
This question is more open-ended than the previous one. Cluster the data
with the DBScan algorithm. Try different settings of the parameters epsilon
and minPoints. Discuss the clusterings (in particular: the number of clusters) you find with different parameter settings in relation to the meaning of
the parameters. Pick one of the clusterings you find and discuss its ’biological
plausibility’.
3
The Dentition Data
N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Animal
Opossum
Hairy tail mole
Common mole
Star nose mole
Brown bat
Silver hair bat
Pigmy bat
House bat
Red bat
Hoary bat
Lump nose bat
Armadillo
Pika
Snowshoe rabbit
Beaver
Marmot
Groundhog
Prairie Dog
Ground Squirrel
Chipmunk
Gray squirrel
Fox squirrel
Pocket gopher
Kangaroo rat
Pack rat
Field mouse
Muskrat
Black rat
House mouse
Porcupine
Guinea pig
Coyote
I
5
3
3
3
2
2
2
2
1
1
2
0
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
i
4
3
2
3
3
3
3
3
3
3
3
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
C
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
c
1
1
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
P
3
4
3
4
3
2
2
1
2
2
2
0
2
3
2
2
2
2
2
2
1
1
1
1
0
0
0
0
0
1
1
4
4
p
3
4
3
4
3
3
2
2
2
2
3
0
2
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
4
M
4
3
3
3
3
3
3
3
3
3
3
8
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
m
4
3
3
3
3
3
3
3
3
3
3
8
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
Num
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Animal
Wolf
Fox
Bear
Civet cat
Raccoon
Marten
Fisher
Weasel
Mink
Ferrer
Wolverine
Badger
Skunk
River otter
Sea otter
Jaguar
Ocelot
Cougar
Lynx
Fur seal
Sea lion
Walrus
Grey seal
Elephant seal
Peccary
Elk
Deer
Moose
Reindeer
Antelope
Bison
Mountain goat
Musk ox
Mountain sheep
I
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
3
2
2
0
0
0
0
0
0
0
0
0
i
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
3
3
3
3
2
2
0
2
1
3
4
4
4
4
4
4
4
4
4
C
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
0
0
0
0
0
5
c
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
P
4
4
4
4
4
4
4
3
3
3
4
3
3
4
3
3
3
3
3
4
4
3
3
4
3
3
3
3
3
3
3
3
3
3
p
4
4
4
4
4
4
4
3
3
3
4
3
3
3
3
2
2
2
2
4
4
3
3
4
3
3
3
3
3
3
3
3
3
3
M
2
2
2
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
2
1
3
3
3
3
3
3
3
3
3
3
m
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
0
2
1
3
3
3
3
3
3
3
3
3
3
Part II: The Federalist Papers
We consider the problem: can we attribute a text to an author? This problem may occur for several reasons, for example: the author has remained
anonymous, or he/she uses an alias. A controversial example is: have all
works that have been attributed to Shakespeare really been written by him?
You have to clarify the authorship of some of the so-called Federalist
Papers. These papers have been written by Alexander Hamilton, John Jay,
and James Madison to convince the citizens of the state of New York to ratify
the constitution. Of most papers the author is known, but for twelve papers
authorship is contested between Hamilton and Madison. You have at your
disposal a data file that contains for each text a number of characteristics
that have been extracted with the text analysis program Docuscope. The
file contains data on papers written by Hamilton, Madison, and contested
papers.
Question 3: Classification/Clustering
The assignment is to analyse the data in order to indicate the likely author of
the contested papers. It can also be a valid conclusion that nothing sensible
can be said about the likely authorship. Report your findings and give a
justification of the methods of analysis you have applied.
6
Federalist Paper Data
Below, we give a description of the available data. For more information on
attributes 4-21, please consult the documentation of Docuscope on
http://betterwriting.net/projects/fed01
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
TextNumber: Number of text as found on
http://thomas.loc.gov/home/histdox/fedpapers.html
TextName: Name of file as found on
http://thomas.loc.gov/home/histdox/fedpapers.html
Group: 1 = Hamilton, 2 = Madison, 3 = authorship contested
FirstPerson
InnerThinking
ThinkPositive
ThinkNegative
ThinkingAhead
ThinkingBack
WordPicture
SpaceInterval
Motion
PastEvents
ShiftingEvts
TimeInterval
CueComKnow
CuePriorText
CueReader
CueNotifier
CueMovement
CueReasoning
7
Download