Clusters and outlier

advertisement
Phase (3)
Clusters and outlier
Index:
1) Introduction:……………………………………………………….2
2) Clusters:………………………………………………………………2
2.1) White Wine dataset:………………………………2
2.2) Brest Tissue dataset:………………………………4
3) Outlier:…………………………………………………………………7
3.1) Brest Tissue dataset:………………………………7
3.2) White Wine dataset:………………………………9
4) Conclusion:…………………………………………………………13
1
1) Introduction:
In this phase we introduce how to implement clusters and outliers on two data sets
(White Wine and Brest Tissue). We can use clusters and outliers on unsupervised or
supervised data (because target class is not important in the computation to find
nearest center, so ignore it automatically). Clusters is very important to know how
many group can we have, and which data have the same information. So, we can
make an important decision from these clusters. But outlier is used to find noise data
or find data out of specific range. To detect if the outlier data is new knowledge or an
error.
Note: you can open wineclusers and brestclusterkmodule from clusters folder and
DISTANCEWINE, LOFBREST, wineLOF and distancebrest from outlier folder.
2) Clusters:
2.1) White Wine dataset:
1) K-means was used because all data are numeric.
Analysis: at first we specify number the value of k is four to give four
clusters, but when we plot clusters we see that the data is very close to each
other. You can verify that when you see the plot view figure 2.1.3 and centroid
table in figure 2.1.4. For example if we take alcohol attribute then we see that
the centers of cluster_3, custer_2 and custer_1 are approximately close to
center 10.8. Therefore, we minimize the value of k to 2. So, we see that there
are spaced center from each to other . So, two clusters is very good choice.
Figure 2.1.1: The process of K-means method in wine dataset
Figure 2.1.2: The text view result for four clusters of K-means method
2
Figure 2.1.3: The plot view of k-means method after using SVD method to transfer
dataset from multi dimensional to 2 dimension for four clusters
Figure 2.1.4: The centroid table for four clusters in wine datasets
Figure 2.1.5: The text view result for two clusters of K-means method
3
Figure 2.1.6: The plot view of k-means method after using SVD method to transfer
dataset from multi dimensional to 2 dimension for two clusters
Figure 2.1.7: The centroid table for two clusters in wine datasets
2.2) Brest tissue dataset:
1) K-medodies was used because the dataset has nominal attribute.
Analysis: in this dataset we remove AREA attribute because it has a noise data
(you can see phase one). When we assign k to any number there is a cluster which
contains a one record, this record has noise data in AREA attribute. So, we decide
to remove it. After that we tried to choose the best k. So, we assigned k at first
time to six and the other choice to two. When assign k to six we found a many
nearest centers we can merge it to one cluster see figure 2.2.3 and figure 2.2.4.
From figure 2.2.4 you can see cluster_1, cluster_4 and cluster_5 can be a one
cluster and cluster_0, cluster_2 and cluster_3 can be another cluster. Therefore,
we decide to choose k equaled to two. See figure 2.2.6 and figure 2.2.7.
4
Figure 2.2.1: The process of K-medodies method in Brest tissue dataset
Figure 2.2.2: The text view result for six clusters of K-medodies method
Figure 2.2.3: The centroid table for six clusters in Brest tissue datasets
5
Figure 2.2.4: The plot view of k-medodies method between target class and number of
clusters
Figure 2.2.5: The text view result for two clusters of K-medodies method
Figure 2.2.6: The centroid table for two clusters in Brest tissue datasets
6
Figure 2.2.7: The plot view of k-medodies method between target class and number of
clusters
3) Outlier
3.1) Brest Tissue dataset:
Analysis: From the beginning we choose Distance outlier which need to specify
k number of neighbors and d the distance of outlier by using Euclidian distance. With
start value of k equal to ten and number of outliers equal ten also. Then we have many
outliers that does not have any meaning of errors or new knowledge. So, from phase
one we have only one error on instance 103 in attribute AREA. Therefore we change
number of outlier to one and k still equal ten to have a large number of neighbors. See
figure 3.1.1.2. And when you see figure 3.1.1.3 the table of data view you can see
there is only one error which have a true value in the first row. When we apply the
LOF outlier we ensure that the value of number of outlier is correct because LOF
outlier compute how much outlier is far from the nearest point. See figure 3.1.2.2 and
figure 3.1.2.3. from the table of data view in figure 3.1.2.3 you can see the maximum
distance is also for instance 103. Therefore this instance is an error outlier because it
has an invalid value in AREA attribute maybe at the entered this value.
3.1.1) Distance outlier:
Figure 3.1.1.1: the process distance outlier on Brest Tissue dataset
7
Figure 3.1.1.2: the plot view of Distance outlier on Brest Tissue
Figure 3.1.1.3: the data view of Distance outlier on Brest Tissue
3.1.2) LOF outlier:
Figure 3.1.2.1: the process LOF outlier on Brest Tissue dataset
8
Figure 3.1.2.2: the plot view of LOF outlier on Brest Tissue
Figure 3.1.2.3: the data view of LOF outlier on Brest Tissue
3.2) White Wine dataset:
Analysis:
In this dataset we give a 2000 instances as a sample because there is no enough
memory to process all data.
after that we have a one outlier by using LOF method and we don’t know why? So,
we use a filter example to show only "quality = 7 && alcohol =13" because in this
range we have an outlier, also we need to know why this instance is an outlier. See
figure 3.2.1.1. Then we make a zoom out at point 0.020 in x-axis of figure 3.2.1.2 to
have figure 3.2.1.3. Because the outlier in first figure is not clear.
after that we compare between the instances and an outlier instance see figure 3.2.1.5.
We conclude that the outlier instance is a new knowledge of a new type of white
wine in quality 7 with minimum value of volatile acidity and maximum value of
residual sugar. So, in this case we have a new knowledge outlier not an error outlier.
9
3.2.1) LOFE outlier:
Figure 3.2.1.1: the process of LOF outlier on White wine dataset
Figure 3.2.1.2: the plot view of LOF method on white wine dataset after applying
SDV method to convert from multi dimension to two dimension
(there is an outlier but not clear)
Figure 3.2.1.3: the maximum zoom from previous figure to see clear an outlier in red
color
10
Figure 3.2.1.4: the data view table of LOF outlier that you can see the first row has the
maximum outlier for instance number 1449 and quality equal to seven
Figure 3.2.1.5: in this table show all row of quality 7 and alcohol 13 to compare
between outlier row and the residual rows. Where the last row is the outlier row.
3.2.2) DISTANCE outlier:
Analysis: when we apply Distance outlier on 2000 samples of white wine dataset
we surprised from the results. The outlier appear in LOF outlier does not appear hear.
So, we increase the number of outlier to be 100 and 1000, but we did not found row
1449 as an outlier in this method. Therefore, we conclude that the distance outlier just
found the points which is far from the distance point. See figure 3.2.2.4, figure 3.2.2.5
and figure 3.2.2.6.
Figure 3.2.2.1: the process of Distance outlier on 2000 sample of White wine dataset
where number of outlier is ten and number of neighbors k also 10.
11
Figure 3.2.2.2: The plot view of White wine dataset that appear ten outlier by using
distance outlier.
Figure 3.2.2.3: The data view table of White wine dataset that appear ten true outlier
by using distance outlier.
Figure 3.2.2.4: the process of Distance outlier on 2000 sample of White wine dataset
where number of outlier is 100 and number of neighbors k is 10.
12
Figure 3.2.2.5: The plot view of White wine dataset that appear 100 outlier by using
distance outlier.
Figure 3.2.2.6: The data view table of White wine dataset that appear 100 true outlier
by using distance outlier. But row 1449 not appear.
4) Conclusion:
1- From this experiment we decide that the LOF outlier is more accurate than
Distance outlier.
2- You can use LOF outlier to detect the correct error or knowledge outlier. But
using Distance outlier for just to count the points which are so far from the
distance. So, it is maybe not outlier.
3- when you use K-nearest or K-medodies you must choose k which give a pure
center (i.e. there are no two clusters you can merge it in one cluster).
13
Download