an interval-value approach

advertisement
An Interval-value Clustering Approach to Data Mining
Yunfei Yin
Department of Computer Science
Guangxi Normal University
Guilin 541004, P.R. China
Faliang Huang
Institute of Computer Application
Guangxi Normal University
Guilin 541004, P.R. China
yinyunfei@vip.sina.com
huangfliang@163.com
ABSTRACT
mining tries to find the regular patterns of consumer
Interval-value clustering algorithm is a result of the deep
behavior in retail database, and the mined rules can explain
development of calculation math, and it is widely used in
such patterns as “if people buy creamery, they will also buy
engineering, commerce, aviation and so on. In order to apply
sugar”. However, binary association rule mining restricts the
the research for interval methods and theory to practice, and
application area to binary one.
find more valuable knowledge in mining and analyzing the
In order to find more valuable association rules, bibliography
enterprise data, a kind of data mining method for interval
[6] had mined out clustering association rules by clustering
clustering is provided. After introducing three kinds of
way. But, because of the half-baked information and the
interval clustering methods, offer a method about mining
ambiguity between one thing and another are always existed,
association rules in interval databases. By comparison with
it makes using some interval numbers to represent an object
the traditional method of data mining, this method is more
become the only selective way. For example, “50 percent of
accurate, more effective, more novel and more useful. So
male people aged 40 to 65 and earned 80,000 to 1,000,000
there is much larger space for the development of this
each year own at lease two villas”. In this case, we can only
method, and it will be certain to bring huge realistic
use the interval clustering model to solve it. Based on this,
significance
this article introduces a model method, which is fit for
and
social
significance
in
commercial
information mining and deciding.
mining such database information. Proven by multiple true
examples, this method can find more valuable association
rules.
KEYWORDS
Interval-value,
clustering,
interval
database,
interval
distance, data mining
This paper is organized as follows. In the following section,
we will describe three interval-value clustering methods and
the relevant examples. In section 3, we will offer two
INTRODUCTION
interval-value mining ways in common database and interval
Since Agrawal R. put forward the idea of mining Boolean
database respectively. In section 4, there are 3 experimental
association rules [1], Data mining had been a fairly active
results about three experiments. We will give a brief
branch. During the past ten years, Boolean association rule
conclusion about the research in section 5.
mining has received more and more considerable attention of
famous authorities and scholars, and they have also
INTERVAL-VALUE CLUSTERING METHOD
published a great deal of papers about it. Such as,
Interval value clustering algorithm [7] is a result of the deep
Bibliography [2][12] brought forward a fast algorithm of
development of calculation math, and it is widely used in
mining Boolean association rule which can be used to solve
many fields such as engineering, commerce [13]. The data
commodities arrangement in supermarkets; Bibliography
mining based on interval-vale clustering is one of the
[13] put forward the idea about causality association rule
applications of such model, and this model involves in such
mining; Bibliography [9] offered a useful algorithm about
three
mining negative association rules. Boolean association rule
Number-Interval clustering, Interval-Interval clustering and
interval-value
clustering
methods
as
follow:
Matrix-Interval clustering.
is called the similar confidence of x similarly attributing to
set A. If x is similarly attributed to A1 , A2 ,..., As at the
Number-interval Clusetering Method
same
Suppose
x1 , x2 ,..., x n is n objects, whose actions are
characterized by some interval values. According to the
traditional clustering similarity formula [6], we can get
correlative similarity matrix:
 1

[t  , t  ]

1
21 21


R(ri , j ) 
 ...
...
... 
  



[t n1 , t n1 ] [t n 2 , t n 2 ] ... 1
Matrix R is a symmetry matrix, where
similarity between
rij  [t ij , t ij ] is the
xi and x j , i,j=1,2,…n.
There are three steps about Number-Interval clustering
method [14]: (1) Netting: at the diagonal of R a serial number
is clearly labeled. If
user), the element
t ij  0 (the threshold is offered by
[t ij , t ij ] is replaced by “  ”; if tij  0 ,
[t ij , t ij ] is replaced by space; if 0  [t ij , t ij ]
the element is replaced by “#”. Call “  ” as node, while “#”
the element
as similar node. We firstly drag a longitude and a woof from
the node to diagonal, and use broken lines to draw longitude
and woof from similar node to diagonal as described in
figure 1. (2) relatively certain classification: For each node,
band the longitude and woof which go through it, and the
elements which are at the end of the node are classified as the
same set; finally, the rest is classified as the last set. (3)
similar fuzzy classification: For each similar node, band
them to the relatively certain class which is passed by the
longitude or woof starting from the similar node.
1

2
3

4
5
Figure 1 Netting
As can be seen from the figure 1, relatively certain
classification can clearly classify the objects, while similar
fuzzy classification cannot clearly classify the objects. For
example, objects set U={1,2,3,4,5} can be classified to two
sets as A={1,2,[5]},B={3,4,[5]}. However, which set does
“5” attribute to on earth? So, we introduce the concept of
similar confidence.
Definition
1.
(similar
confidence):
Suppose


A={ x1 , x2 ,...xn , [ x] }, [t i , t i ] is the similar coefficient of
t1  0 t 2  0
t n  0
,
,...,   }
x and x i , and   min{ 
t1  t1 t 2  t 2
tn  tn
time,
and
the
confidences
are
respectively
1 , 2 ,..., s . Take  j  max{ 1 , 2 ,..., s } , when
 j  0.5 , we believe that x should be attributed to set A j ;
when  j  0.5 x should be classified to a set alone.
For example, in the above instance, the corresponding
similar coefficients of objects set U are known as follow:
 1

 [0.8,0.9]

1


1
 [0.5,0.7] [0.6,0.72]



1
[0.6,0.61] [0.65,0.70] [0.78,0.81]

[0.7,0.81] [0.71,0.83] [0.7,0.85] [0.75,0.92] 1
and
0 =0.75.
Then, the confidence of object 5 similarly attributing to A is:
 A =min{
0.81  0.75 0.83  0.75
,
}
0.81  0.7 0.81  0.71
=min{0.5,0.67}
=0.5
The confidence of object 5 similarly attributing to B is:
 B =min{
0.85  0.75 0.92  0.75
,
}
0.85  0.7 0.92  0.75
=min{0.67,1}
=0.67
Take  =max{  A ,  B }=  B =0.67
∵  >0.5
∴ object 5 should be attributed to class B.
That is to say, the last classification is: A={1,2}, B={3,4,5}.
Interval-interval Clustering Method
Iterval-Interval clustering method is the extension and
generalization of Number-Interval; It expresses the threshold
 of Number-Interval as the form of interval.
Interval-Interval clustering method also need netting,
relatively certain classification and similarly clear
classification as three procedures. In order to confirm which
relatively certain set the similar node is attributed to at last,
we extend the concept of similar confidence deeply as
follow:
Suppose

is
interval

value
[
0 , 0
],

A= {x1 , x2 ,..., xn , [ x]} , [t i , t i ] is the similar coefficient
of x and
formula:
xi . According to the following information
ti  0
 t   
ti ti
i
0

 i  1  t   t  log 2
i
i
 0




work

ti  0  ti
if
ti  0
if
.
Then,
Let
[ , ]  min{[  , ], [ , ],..., [ , ]} ,



1

1

2

2

n

n
and call it as the similar confidence of x similarly attributing
to set A. if  1 ,  2 ,...,  s are the confidences which x is
attributed
to
 j  max{ 1 ,  2 ,...,  3 } .
respectively,
If the center of
take
 j  0.5 ,
regard x should be attributed to A j ; if the center of
 j  0.5 , x should be classified alone.
Simply saying, Interval-Interval clustering method is to work
out the  level set of similar matrix whose elements are all
interval values, and  is also a interval value here.
For example, in the above experiment, we letλ=[0.7,0.8].
Then, the confidence of object 5 similarly attributing to A is:
0.810.8
0.81  0.8
log 20.810.7
,1+
0.81  0.7
0.810.7
0.830.8
0.81  0.7
0.83  0.8
0.810.7
0.830.7
log 2
log 2
],[1+
,1]}
0.81  0.7
0.83  0.7
A
rij  t ij  (t ij  t ij )u0 . So, we transform
similar interval matrix R into real matrixes:
ti  0
if
[ i ,  i ]
out
ti  0  ti
if
ti  0
 t   
ti ti
i
0

 i  1  t   t  log 2
i
i
 1

be expressed by
=min{[1+
=min{[0.69,1],[0.51,1]}
=[0.69,1]
The confidence of object 5 similarly attributing to B is:
0.850.8
0.85  0.8
log 20.850.7
=min{[1+
,1+
B
0.85  0.7
0.850.7
0.920.8
0.85  0.7
0.92  0.8
log 20.850.7 ],[1+
log 20.920.75 ,1]}
0.85  0.7
0.92  0.75
=min{[0.47,1],[0.65,1]}
=[0.65,1]
Take  =max{  A ,  B }=  B =[0.65,1]
∵  >0.5
1

r

1
2 ,1


and
M
 ... ... ... 


rn ,1 rn , 2 ... 1
 1

u

1
2 ,1

 , where the real matrix U is
U
... ... 
 ...


un ,1 un , 2 ... 1
made up with different u ij which is related to different
interval value.
Next, after having a composition calculation to M and U
respectively, we get their fuzzy equality relationships; if take
different  value, we can get different classification results,
where classification result is the intersection set of equality
relationship M and U, Finally, choose the reasonable class
according to the fact situation. Simply saying,
Matrix-Interval clustering method is to transform interval
value matrix into two real matrixes, and then, have a cluster
by fuzzy equality relationship clustering method. For
matrix-interval clustering is different with the change of the
value of u ij ( all the values of u ij consist in matrix U),
while the value of u ij is fairly influenced by the field
knowledge, it needs the field experts to give special
directions and can gain satisfied clustering results. But, as
this kind of way changes the interval-value to real one, the
efficiency will have a remarkably increase. Here, the
corresponding example is omitted for its complication.
It is because there exists many more interval-values in reality
that we discuss three kind of interval-value ways, while these
interval-values cannot be correctly processed by the
traditional method of data mining. Next we will discuss the
data mining way in interval-value database.
DATABASE INFORMATION MINING MODEL
There are two types for database information mining: one is
∴ object 5 should be attributed to class B.
the data mining in a common database; the other is the data
The result is same to the previous result of number-interval
mining in an interval-value database.
clustering.
Interval-matrix Clustering Method
For each interval

ij

ij
[t , t ] , it can be equal to
t ij  (t ij  t ij )u,0  u  1 . Given a u 0 , the interval can
Data Mining In a Common Database
In the common database, the records are made up with a
batch of figures, which range in a certain field. The values of
each field are changed in a certain area and their types are the
same, and the popular processing method is to divide them
x1 , x2 ,...xn ), where xi  F ( Di ) (i=1,2,…,n) is
into several intervals according to the actual needs.
However, a hard dividing boundary will be appeared, so we
bring forward the interval clustering method. By doing so,
the classification will be more reasonable and the boundary
will be softened for the thresholds can be changed according
to the actual conditions; it is more important that we can have
an automatic operation by making all the thresholds take the
same value. That is to say, we can break away from the real
problem area, and make all data of each field clustered
automatically (under the control of the same threshold). The
algorithm is as follow:
Algorithm 1. (Data Clustering Algorithm Automatically)
Input: DB: database; Attr_List: attribute set; Thresh_Set:
the threshold used to cluster all the attributes;
Output: the clustering results related all attributes;
Step
1:
for
each
Attr_List
ai 
t=(
Uni _ ai =Transfer_ComparableType( a i );//Transform all
database into comparative type by generalizing and
abstracting[11];
Step 2: In the processed database, work out the interval
distance between two figures for each F Di  , and the
the attributes into comparative types, and save to
Uni _ ai .
Step
2:
while(Thresh_Value( a i )<Thresh_Set( ai , Uni _ ai )){//wor
k out the similarities of all the values of each attribute
Step 3: for any k,j  DB and k,j  ai {
Step 4: Compute_Similiarity(k,j);}//Calculate similarity
Step 5: Gen_SimilarMatrix( a i , M i );//Generate similarity
matrix of values of certain attribute
Step 6: C← M i ;}//C is the array of similarity matrix
Step 7: for each
ci  C C= C +GetValue( c i );
Step 8: Gen_IntervalCluster(Attr_List,C);//Clustering
Step 9: S=statistic(C);//count the support of item set
Step 10: Arrange_Matrix(DB,C);//Merge and arrange the
last mining results
In the above steps, after getting the last clustering results, we
can have a data mining, and this kind of result of data mining
are quantitative association rules [11], which describe the
quantitative relation among items.
Data Mining In an Interval-value Database
This is another kind of information mining, and the
difference with common information mining lies in that it
introduces the concept of interval-value database.
Definition 2. (interval-value relation database): Suppose
D1 , D2 ,..., Dn
are
n
real
fields,
and
F ( D1 ), F ( D2 ),..., F ( Dn ) are respectively some sets
constructed by some interval-values in D1 , D2 ,..., Dn [3].
Regard them as value fields of attributes in which some
relations will be defined. Make a Decare Product:
F ( D1 )  F ( D2 )  ...  F ( Dn ) , and call one of this
Decare set’s subsets as interval relations owned by record
attributes, and now, the database is called interval-value
relation database. A record can be expressed by
interval of
Di .
Definition 2. (closed interval distance): Suppose [a,b], [c,d]
are any two closed intervals, and the distance between two
intervals
is
defined
to
d([a,b],[c,d])= (( a  c)  (b  d ) )
. It is easy to
certify the distance is satisfied with the three conditions of
the definition of distance.
Interval value data mining is to classify F ( Di ) by
2
2 1/ 2
“interval-value clustering method”, and finally merge the
database to reduce the verbose attributes, and transform to
common quantitative database for mining. The algorithm is
as follow:
Step 1: Transform F ( Di ) related to attribute Di in the
distance is regarded as their similar measurement. So a
similar matrix is generated like this;
Step 3: Cluster according to one of the three interval-value
methods;
Step 4: Decide whether the value is fit to the threshold, after
labeling the attribute again, get quantitative attribute;
Step 5: Make a data mining about quantitative attributes;
Step 6: Repeat step 3 and step 4;
Step 7: Arrange and merge the results of data mining.
Step 8: Get the quantitative association rules.
Example Research
There is the information as follow:
Table 1 Career, income questionnaire
Age
Income
Career
Number
of
villas owned
22
2000
Salesman
for 0
books
39
10000
Salesman for IT
2
50
3000
College teacher
1
28
5000
Career training 0
teacher
46
49000
CEO for IT
5
36
15000
CEO
for 1
manufacturing
Firstly, make a clustering for interval values:
The clustering result for ages: I 11 ={22, 28}, I12 ={36, 39},
I 13 ={46, 50};
The clustering result for incomes:
I 21 ={2000, 3000, 5000},
I 22 ={10000, 15000}, I 23 ={49000};
The clustering result for careers: I 31 ={Salesman for books,
Salesman for IT }, I 32 ={ College teacher, Career training
teacher},
I 33 ={ CEO for IT , CEO for manufacturing};
The clustering result for Number of villas owned:
I 41 ={0,
1}, I 42 ={2, 5}.
Then, calculate the supports:
Table 2 statistical table for item sets
Item
Support
I11
2
I12
2
I13
2
I 21
3
I 22
2
I 23
1
I31
2
I 32
2
I 33
2
I 41
4
I 42
2
Thirdly, normalize the database.
Table 3 Transformed transaction database
TID Items
1
I11 , I 21 , I31 , I 41
2
I12 , I 22 , I31 , I 42
3
I13 , I 21 , I 32 , I 41
4
I11 , I 21 , I 32 , I 41
5
I13 , I 23 , I 33 , I 42
6
I12 , I 22 , I 33 , I 41
Finally, if we mine the transformed database, can find such
association rules as “42 percent of male people earned 2,000
to 5,000 each month own at most one villa”, That is to say,
I 21 → I 41 (Support=0.42,Confidence=1).
ALGORITHOM EVALUATION
In order to testify the effect of the above algorithm, we have
made a large quantity of testing work. The imitative and true
data testing expressed the aforementioned algorithm could
improve the effect of data mining dramatically, and found
Table 4 The results of mining for A database named Flags
scale
for
data
set
numbers
of
effective
attributes
194
194
194
17
14
13
values
of
averag
e each
attribu
te
2
6
8
threshold
numbers
of pattern
suppo
rt
10
138
72
0.85
0.7
0.77
con
fide
nce
0.8
0.8
0.8
Experiment Two
This is a result of mining about a data set named Zoo, which
describes the different characters of 101 kinds of animals,
and the characters include: hair, eggs, milk, tail, legs, toothed
and so on. The mined results are described in table 5:
Table 5 The results of mining for A database named Zoo
scale
for
data
set
numbers
of
effective
attributes
101
101
101
18
14
10
values
of
averag
e each
attribu
te
3
2
2
threshold
numbers
of pattern
2
4
4
sup
port
confi
dence
0.8
0.72
0.7
0.7
0.8
0.74
Remarks: In the above experiments, data set comes from
ftp://www.pami.sjtu.edu.cn which are true databases. The
data scale refers to all the records included in database; the
scale for attributes refers to the numbers of attributes in
attribute set; the numbers of effective attributes refers to the
numbers of the rest attributes after reduced; the values of
average each attribute refers to all the different intervals after
being divided; the numbers of pattern refers to the numbers
of pattern after mining.
Additionally, it is needed to explain: In order to enhance the
effect, we labeled all the attributes during doing the
experiments. For example, if there is a rule “1,11→40”, it
represents “if the country is in America and the population is
within 10,000,000, there is not any vertical stripe in its flag”.
The detail explanations on the experiments and other
experimental results can be obtained by visiting
ftp://ftp.glnet.edu.cn.
many valuable association rules.
The following are the results of processing the true
CONCLUSIONS
databases.
The application of interval value clustering in data mining
Experiment One
This is a result of mining about a data set named Flags, which
describes the characters of the national flags of all the
countries in the world, and the characters include area of
country, population, national flag, color, shape, size, special
patterns, layout and so on. The mined results are described in
table 4:
has been discussed in this article. Firstly put forward three
kind of interval-value clustering methods, and explained by
examples; then respectively offered interval-value clustering
mining methods in common database and interval database;
at last there was an active example research; in section 4, the
results of three experiments sufficiently proved that the
application of interval-value clustering in data mining has
5. Lent, B., Swami, A., Widom, J. Clustering association
widely developing prospect.
rules. In Proc.1997 Int. Conf. Data Engineering (ICDE’97),
This kind of data mining method based on interval-value is
Birmingham, England, April, 1997.
mainly fit to mine some numeric data and comparable data;
6. Luo, C.Z. A Guide to Fuzzy Set, Beijing Normal
especially for the interval value database it is more
University Press, Beijing,1989.
significant. For other type data, it has certain value for
7. Moore, R., Yang, C. Interval Analysis I. Technical
reference. In the world, the research about the relation
Document, Lockheed Missiles and Space Division, Number
between interval value clustering and data mining is just at
LMSD-285875,1959.
the primary stage, but its development prospect is quite well.
8. Srikant, R., and Agrawal, R. Mining Quantitative
A series of new technology and software will be produced.
Association Rules in Large Tables. In: Proceedings of ACM
Now, most big commercial companies are all competing the
SIGMOD, 1996: 1-12.
st
supermarket, and 21 century will be a new era where
9. Wu, X.D., Zhang, C.Q., and Zhang, S.C. Mining Both
interval value clustering method is used in data mining.
Positive and Negative Association Rules. In Proceedings of
19th International Conference on Machine Learning, Sydney,
Australia, July 2002:658-665.
REFERENCES
10 Yin, Y.F., Zhang, S.C., Xu, Z.Y. A Useful Model for
1. Agrawal, R., Imieliski, T. and Swami, A. Mining
Software Data Mining, Computer Application 3, 2004:
Association Rules Between Sets of Items in Large
10-13.
Databases. ACM SIGMOD Int. Conf. On Management of
11. Yin, Y.F., Zhong, Z., and Liu, X.S. Data Mining Based
Data, 1993, 1993: 207-216.
Stability Theory. Changjiang River Science Research
2. Agrawal, R. and Srikant, R. Fast algorithm for mining
Institute Journal 2, 2004: 22-24.
association rules in large database. In Reseach Report
12. Zhang, C.Q., and Zhang, S.C. Association Rule Mining
RJ9839, IBM Almaden Research Center,San Jose,CA,June
Models and Algorithms. Springer-Verlag, Berlin Heidelberg,
1994.
2002.
3. He, X.G. Fuzzy Database System, Tsinghua University
13. Zhang, S.C., and Zhang, C.Q. Discovering Causality in
Press, Beijing,1994.
Large Databases, Applied Artificial Intelligence, 2002.
4. Hu, C.Y., Xu, S.Y., and Yang X.G. A Introduction to
14. Zhang, X.F. The Cluster Analysis for Interval-valued
Interval Value Algorithm, Systems Engineering -- Theory &
Fuzzy
Practice, 2003 (4): 59-62.
2001,14(1):5-7.
Sets,
Journal
of
Liaocheng
University,
Download