Project2 Scores

advertisement
Dr. Eick
Project2 COSC 6335 Fall 2013
Traditional Clustering with K-Means and DBSCAN
Individual Project
Learning Objectives:
1. Learn to use popular clustering algorithms, namely K-means and DBSCAN
2. Learn how to summarize and interpret clustering results
3. Learn to write R functions which operate on the top of clustering algorithms and clustering
results
4. Learning how to make sense of unsupervised data mining results
Deadline: 10/23/2013, 11p; electronic submission
Last Updated: 10/17/2003, 2p
Datasets: In the project we will use the Complex8 and the Yeast
Dataset(http://archive.ics.uci.edu/ml/datasets/Yeast ). The Complex8 dataset is a 2D dataset
and Yeast is an 6D dataset1; the last attribute of each dataset denotes a class variable which
should be ignored when clustering the data sets—however, the class variable will be used in
the post analysis of the clusters which are generated by running K-means and DBSCAN.
Project2 Tasks:
1. Write an R-function purity(a,b,outliers=FALSE) that computes the purity of a clustering
result based on an apriori given set of class lables, where a gives the assignment of objects in
O to clusters, and b is the “ground truth”. Purity is defined as follows: Let
O be a dataset
X={C1,…,Ck} be a clustering of O with CiO (for i=1,…,k), C1…CkO and CiCj=
(for i j)
PUR(X)= (number_of_majority_class_examples(X)/(total_number_examples_in_clusters(X))
If the used clustering algorithm supports outliers, outliers should be ignored in purity
computations; you can assume that cluster 0 contains all the outliers, and clusters 1,2,…,k
represent “true” clusters. If the parameter outliers is set to FALSE, the function just returns a
floting point number of the observed purity, if parameter outliers is set to T the function
returns a vector: (<purity>,<percentage_of_outliers); e.g. if the function returns (0.98, 0.2)
this would indicate that the purity is 98%, but 20% of the objects in dataset O have been
classified as outliers. ***
1
Preprocess the dataset as follows to obtain a 6D dataset, and then cluster the dataframe d:
a<-read.csv("yeast.csv")
c<-data.frame(a=a[,2],b=a[,3],c=a[,4],d=a[,5],e=a[,6],f=a[,7],g=a[,8],h=a[,9],z=factor(a[,10]))
d<-data.frame(a=c[,1],b=c[,2],c=c[,3],d=c[,4],e=c[,7],f=c[,8],z=c[,9]).
Use the “new” attribute names when discussing clustering results.
2. Run K-means for k=8 and k=11 twice for the Complex8 dataset2. Visualize and interpret the
obtained four clusterings! Also compute their purity using the function you developed in
Task1. **
3. Write an R-function agreement(X,Y) *** that computes the agreement between two
clusterings X and Y of dataset O; agreement should be computed, as follows:
Let n be the set of objects in the dataset O and X and Y are clusterings3 of O.
a. Counter:= 0
b. Iterate over all pairs of objects oj and or of the dataset O
i. Case 1: jr (“Increase counter if both objects are in same cluster
in X and Y or if both objects are in a different cluster in X and
Y”)
Increase
Counter
by
one
if
((X$cluster(j)=X$cluster(r)
and
Y$cluster(j)=Y$cluster(r))
or
(X$cluster(j)X$cluster(r)
and
Y$cluster(j)Y$cluster(r)))
ii. Case 2: j=r (“Increase counter if object j is an outlier in both
clustering X and Y or if object j belongs to a cluster in both X
and Y”)
Increase Counter if ((X$cluster(j)=0 and Y$cluster(r)=0) or (X$cluster(j)>0
and Y.$cluster(r)>0))
c. Report (Counter/((n*(n+1)/2)) as the the results of Agreement(X,Y)!
Using the agreement function you compute the agreement for each of 4 DBSCAN
clusterings of the Iris dataset which were obtained using the following
parameter settings:
c1<-dbscan(iris[3:4], 0.15, 3)
c2<-dbscan(iris[3:4], 0.2, 4)
c3<-dbscan(iris[3:4], 0.4, 6)
c4<- dbscan(iris[3:4],0.55, 17)
4. Run K-means for k=5 for the Yeast dataset. Using techniques of your own liking, try to
characterize the contents for the 5 clusters (additionally, you might compare the statistics
of the 5 clusters with those of the complete dataset, and assess the quality of each of the 5
clusters)—what do the objects in belonging to the same cluster have in common? More
sophisticated and thorough approaches4 will obtain more credit. Summarize your findings!
********
5. Run DBSCAN multiple times for the Yeast dataset and report and interpret a single
clustering that contains between 3 and 13 clusters with less than 20% outliers. **
2
It can be found at: http://www2.cs.uh.edu/~ml_kdd/Complex&Diamond/Complex8.data ; it has been visualized
at: http://www2.cs.uh.edu/~ml_kdd/Complex&Diamond/2DData.htm
3
You can assume cluster 0 contains the outliers, if there are any.
4
Some approaches will be discussed in the course lectures in early October. In general, you will need to come up
with your own approaches for Task4 of this project and you have a lot of freedom in choosing methods to
accomplish the goals of Task4; I expect that solutions for Task4 will differ a lot between students; the key points is
to produce something reasonable which characterizes the content of the 5 clusters, and (optionally) assesses the
quality of the obtained 5 clusters.
6. Write a search procedure in R that searches for the best clustering by exploring different
settings for the (MinPoints, epsilon) parameters of DBSCAN. The procedure maximizes
purity of the obtained clustering, subject to the following constraints:
i. There should be between 2 and 20 clusters
ii. The number of outliers should be 10% or less.
The procedure returns the “best” DBSCAN clustering found and the accomplished purity as
its result; please limit the number of tested (MinPoints, epsilon)-pairs to 1000 in your
implementation! *******
Explain how your automated parameter selection method works and demonstrate your
automated procedure using an example!
7. Apply the procedure you developed in Task6 to the Yeast and Complex8 datasets and report
the best clustering you found. Are you happy with the obtained solution?
If you did not succeed in writing the function in Task6, manually seek for the best clustering
and report those two clusterings. *
Deliverables for Project2:
A. A Report5 which contains all deliverables for the six tasks of Project2.
B. An Appendix which describes how to run the procedure that you developed for Task 6.
C. An Appendix which contains the software you developed as part of this project; in particular
the R-functions you wrote for tasks 1, 3, 6 should be included.
D. Delivery of Project2 Reports: send an e-mail to ceick@aol.com using the subject
Project2_<your lastname>_Report and call the attached file <last name>_P2.docx (or <last
name>_P2_.pdf )
5
Single-spaced; please use an 11-point or 12-point font!
COSC 6335 Project2 2013 Scores
1
5
4
1
3
4
5
5
5
5
5
5
5
2
5
5
2
5
5
2
4
5
5
2
5
3
4
--5
5
5
5
2
5
2
4
3
3
3
4
4
4
4
4
4
4
3
3
3
3
3
4
3
3
3
3
4
3
4
3
3
3
3
3
3
4
3
4
3
5
5
5
4
1
5
5
5
5
2
5
1
4
5
5
5
5
4
3
5
5
2
4
5
5
5
4
5
5
5
5
5
5
4
7
7
3
5
6
7
9+1
9
6
10
8
6
3
1
9
4
5
6?
7?
5
--3
4
8+1
6?
9
0
7
5
10+2
7
6?
10+2
5
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2+1
2
2
2
2
2+1
6
7

8
2+1?
34
5
1
27
0
0
13
1
0
18
4
0
21
9
2
34
10
2
38
8
2
35
5
2
29
8
2
33
9
1
34
8
2
27
2
0
16
5
0
21
4
1
29
0
0
16
7
2
30
1?
1
22
3?
1
21
4
2
25
8
2
25
4
1
21
0
0
15
6
2
33
3?
1
23
6
2
31
1
--10
7
2
32
7
2+1?
30
8
2+1
38
7
1
31
3?
1
22
10+2 2+1 38+5
Early
St_id
34
28.5
13
18
22
35.5
38
37
29
33
35.5
28.5
16
22
30.5
16
30
22
21
26.5
25
22
15
34.5
23
31
10
33.5
30
40
32.5
22
?
499
956
975
171
157
370
065
945
756
877
387
049
395
875
445
459
213
815
797
696
9748
967
281
279
506
553
505
957
681
941
378
8748
Remark: Scores are still preliminary, and subject to verification!
Project2 Statistics:
Average (before early submission adjustments): 26.1
Adjustment Early Submission (5% bonus, rounded to the next halfpoint): Zhai, Wu,
Tadakaluru, Sant, Man, Latif, Jana, Kidd, Gopiseeti, Dou, Dhawad, Cao, Bector, Barman
(*:=early submission)
?:= score not finalized and might drop; need to do some checking…
Download