Top-Down Clustering Method Based On TV-Tree Zbigniew W. Ras 3/23/2016 1 Heuristic TV-Tree Construction Heuristic method • A random attribute “a” is selected from system S=(X,A,V) to decompose S into systems S1=(X1,A1,V1) and S2 =(X2,A2,V2) based on certain value from Dom(a) called a split point for S. • vaDom(a) is an optimal split point for S if it maximizes the total number of active dimensions in S1 and S2 [{S1, S2} is a balanced set]. • Active Dimension: If the distance between any two objects in S with respect to attribute a is less than λa, the attribute a is considered an active dimension for S. – Definition: S1 is a - active if | max a – min a | a where max a = max{a(x): x X}, 3/23/2016 min a = min{a(x): x X}, a is a user given threshold for attribute “a” to be active. 2 Heuristic TV-Tree Construction Heuristic method • Middle Value is calculated for each active dimension. • S1, S2 equally weighted: Constructed TV-Tree has to satisfy so called weighted threshold defined as [(card(X1) / card(X)) and (card(X2) / card(X)) ]. Then, TV-Tree is called -balanced. • Repeat the same process for S1 and S2 by picking another random attribute from S1 and next from S2. • This recursive process will end once all dimensions in the smallest decomposed systems are active. • Optimal structure of TV-Tree depends on the random attribute selected. – If the random attribute selected is strongly related to other non-active attributes, we may get more dimensions close to the top of tree. 3/23/2016 3 Heuristic TV-Tree Construction Figure 1 a X1 3 X2 3 X3 2 X4 3 X5 2 X6 3 X7 2 X8 3 X9 2 X10 3 b 1 4 6 1 9 4 9 1 6 4 c 5 0 8 5 6 7 6 7 8 7 d 6 0 8 6 3 1 4 7 7 1 e 5 5 5 5 5 5 5 5 5 5 S X1 X2 X4 X5 X7 b 1 4 1 9 9 A 3 3 3 2 2 S1 3/23/2016 c 5 0 5 6 6 d 6 0 6 3 4 E 5 5 5 5 5 A X3 2 X6 3 X8 3 X9 2 X10 3 b 6 4 1 6 4 C 8 7 7 8 7 d 8 1 7 7 1 E 5 5 5 5 5 S2 4 In Search For Optimal Split Point Figure 2 Value Split points 3/23/2016 0 5 2.5 6 5.5 7 6.5 8 7.5 5 Layout of TV-Tree Figure 3 a a 2.5 b a b c d e 3/23/2016 c d e 5 a b c d e b c d e a 2.5 b a b c d e c 7.5 d e 5 a b c d e 6 TV-Tree Construction Figure 4 Heuristic Algorithm Random Attribute: C S X1 X2 X3 X4 X5 X6 S1 X1 X2 X3 A 2 4 4 B 1 1 3 C 1 4 5 Active Dimensions: A E Middle Value : 3 2.5 3/23/2016 D 4 4 7 E 3 2 3 A 2 4 4 5 5 5 B 1 1 3 4 4 3 C 1 4 5 6 7 8 D 4 4 7 8 9 9 Let’s assume this is the best split point where we yield maximum active dimensions on both derived tables. E 3 2 3 3 3 2 S2 X4 X5 X6 A 5 5 5 B 4 4 3 C 6 7 8 D 8 9 9 E 3 3 2 Active Dimensions: A B C D E Middle Value : 5 3.5 7 8.5 2.5 7 TV-Tree Construction Random Attribute: A S3 X1 A 2 B 1 S1 X1 X2 X3 A 2 4 4 B 1 1 3 C 1 D 4 E 3 Active Dimensions: A B C D Middle Value :2 1 1 4 3/23/2016 E 3 C 1 4 5 D 4 4 7 E 3 2 3 S4 X2 X3 Assumed split point. A 4 4 B 1 3 C 4 5 D 4 7 E 2 3 Active Dimensions: A C E Middle Value : 4 4.5 2.5 8 TV-Tree Construction Random Attribute: B S4 X2 X3 S5 X2 A 4 B 1 C 4 Active Dimensions: A B C Middle Value :4 1 4 3/23/2016 A 4 4 D 4 D 4 C 4 5 Assumed split point. D 4 7 E 2 3 E 2 S6 X3 A 4 E 2 Active Dimensions: A B C Middle Value :4 3 5 B 1 3 B 3 C 5 D 7 E 3 D 7 E 3 9 Non-Heuristic TV-Tree Construction Non-Heuristic method • Search through all attributes in A to look for the best split point for S to decompose it into S1 and S2 in a way to maximize the overall number of active dimensions in S1, S2 additionally assuming that {S1, S2 } is a well-balanced set. • The remaining steps are similar to heuristic method. • This method is more expensive than heuristic method but may produce more optimal TV-trees and therefore more efficient QAS. • It works well if information system S does not have a large number of attributes. 3/23/2016 10 TV-Tree Based Query Answering Exact match query • User specifies attributes and their values. – Example query: q=[a=5]. Range query • User specifies attributes, their values, and tolerance αa for each specified attribute a (if α is a tolerance value for attribute a, then α a). – Example query: q=[a=5,2]. 3/23/2016 11 TV-Tree Based Query Answering • Q=[a=5]•[b=3] •[c=7,1] Root a=4, λa=2 c=6, λc=2 b=6, λb=2 3/23/2016 b=10, λb=3 b=4, λb=2 12 TV-Tree Based Query Answering 2 TV-Tree can be built further from a leaf node by reducing system thresholds at the leaf node if most of queries reaching that leaf node require scanning large amount of data. Root Q=[a=5,1]•[b=3,1] •[c=7,1] a=4, λa=2 c=6, λc=2 b=6, λb=2 3/23/2016 b=10, λb=3 b=4, λb=2 a=4,λa=2 [a, λa=1] b=4,λb=2 [b,λb=1] c=6,λc=2 [c ,λc=1] 13