TV-Tree

advertisement
Top-Down Clustering Method Based On TV-Tree
Zbigniew W. Ras
3/23/2016
1
Heuristic TV-Tree Construction
Heuristic method
• A random attribute “a” is selected from system S=(X,A,V) to
decompose S into systems S1=(X1,A1,V1) and S2 =(X2,A2,V2) based
on certain value from Dom(a) called a split point for S.
• vaDom(a) is an optimal split point for S if it maximizes the total
number of active dimensions in S1 and S2 [{S1, S2} is a balanced
set].
• Active Dimension: If the distance between any two objects in S with
respect to attribute a is less than λa, the attribute a is considered an
active dimension for S.
– Definition: S1 is a - active if | max a – min a |  a
where max a = max{a(x): x  X},
3/23/2016
min a = min{a(x): x  X},
a is a user given threshold for attribute “a” to be active.
2
Heuristic TV-Tree Construction
Heuristic method
• Middle Value is calculated for each active dimension.
• S1, S2 equally weighted: Constructed TV-Tree has to satisfy so called
weighted threshold  defined as [(card(X1) / card(X))   and
(card(X2) / card(X))  ]. Then, TV-Tree is called -balanced.
• Repeat the same process for S1 and S2 by picking another random
attribute from S1 and next from S2.
• This recursive process will end once all dimensions in the smallest
decomposed systems are active.
• Optimal structure of TV-Tree depends on the random attribute
selected.
– If the random attribute selected is strongly related to other non-active
attributes, we may get more dimensions close to the top of tree.
3/23/2016
3
Heuristic TV-Tree Construction
Figure 1
a
X1 3
X2 3
X3 2
X4 3
X5 2
X6 3
X7 2
X8 3
X9 2
X10 3
b
1
4
6
1
9
4
9
1
6
4
c
5
0
8
5
6
7
6
7
8
7
d
6
0
8
6
3
1
4
7
7
1
e
5
5
5
5
5
5
5
5
5
5
S
X1
X2
X4
X5
X7
b
1
4
1
9
9
A
3
3
3
2
2
S1
3/23/2016
c
5
0
5
6
6
d
6
0
6
3
4
E
5
5
5
5
5
A
X3 2
X6 3
X8 3
X9 2
X10 3
b
6
4
1
6
4
C
8
7
7
8
7
d
8
1
7
7
1
E
5
5
5
5
5
S2
4
In Search For Optimal Split
Point
Figure 2
Value
Split
points
3/23/2016
0
5
2.5
6
5.5
7
6.5
8
7.5
5
Layout of TV-Tree
Figure 3
a
a
2.5
b
a b c d e
3/23/2016
c
d
e
5
a b c d e
b
c
d
e
a
2.5
b
a b c d e
c
7.5
d
e
5
a b c d e
6
TV-Tree Construction
Figure 4
Heuristic Algorithm
Random Attribute: C
S
X1
X2
X3
X4
X5
X6
S1
X1
X2
X3
A
2
4
4
B
1
1
3
C
1
4
5
Active Dimensions: A E
Middle Value
: 3 2.5
3/23/2016
D
4
4
7
E
3
2
3
A
2
4
4
5
5
5
B
1
1
3
4
4
3
C
1
4
5
6
7
8
D
4
4
7
8
9
9
Let’s assume this is
the best split point
where we yield
maximum active
dimensions on both
derived tables.
E
3
2
3
3
3
2
S2
X4
X5
X6
A
5
5
5
B
4
4
3
C
6
7
8
D
8
9
9
E
3
3
2
Active Dimensions: A B C D E
Middle Value
: 5 3.5 7 8.5 2.5
7
TV-Tree Construction
Random Attribute: A
S3
X1
A
2
B
1
S1
X1
X2
X3
A
2
4
4
B
1
1
3
C
1
D
4
E
3
Active Dimensions: A B C D
Middle Value
:2 1 1 4
3/23/2016
E
3
C
1
4
5
D
4
4
7
E
3
2
3
S4
X2
X3
Assumed
split
point.
A
4
4
B
1
3
C
4
5
D
4
7
E
2
3
Active Dimensions: A C E
Middle Value
: 4 4.5 2.5
8
TV-Tree Construction
Random Attribute: B
S4
X2
X3
S5
X2
A
4
B
1
C
4
Active Dimensions: A B C
Middle Value
:4 1 4
3/23/2016
A
4
4
D
4
D
4
C
4
5
Assumed
split
point.
D
4
7
E
2
3
E
2
S6
X3
A
4
E
2
Active Dimensions: A B C
Middle Value
:4 3 5
B
1
3
B
3
C
5
D
7
E
3
D
7
E
3
9
Non-Heuristic TV-Tree
Construction
Non-Heuristic method
• Search through all attributes in A to look for the best split point for
S to decompose it into S1 and S2 in a way to maximize the overall
number of active dimensions in S1, S2 additionally assuming that
{S1, S2 } is a well-balanced set.
• The remaining steps are similar to heuristic method.
• This method is more expensive than heuristic method but may
produce more optimal TV-trees and therefore more efficient QAS.
• It works well if information system S does not have a large number
of attributes.
3/23/2016
10
TV-Tree Based Query
Answering
Exact match query
• User specifies attributes and their values.
– Example query: q=[a=5].
Range query
• User specifies attributes, their values, and tolerance αa
for each specified attribute a (if α is a tolerance value
for attribute a, then α  a).
– Example query: q=[a=5,2].
3/23/2016
11
TV-Tree Based Query
Answering
•
Q=[a=5]•[b=3] •[c=7,1]
Root
a=4, λa=2
c=6, λc=2
b=6, λb=2
3/23/2016
b=10, λb=3
b=4, λb=2
12
TV-Tree Based Query
Answering
2
TV-Tree can be built further from a leaf node by reducing system thresholds at the leaf
node if most of queries reaching that leaf node require scanning large amount of data.
Root
Q=[a=5,1]•[b=3,1] •[c=7,1]
a=4, λa=2
c=6, λc=2
b=6, λb=2
3/23/2016
b=10, λb=3
b=4, λb=2
a=4,λa=2
[a, λa=1]
b=4,λb=2
[b,λb=1]
c=6,λc=2
[c ,λc=1]
13
Download