University of Illinois at Urbana-Champaign
Peixiang Zhao Jeffrey Xu Yu Philip S. Yu
CS@UIUC SEEM@CUHK IBM T. J. Watson Research Center
September 12 th , 2007
VLDB’07 Vienna, Austria
• Graph Containment Query
• Algorithmic Framework
• Indexability of frequent
Tree s
• Discriminative graph feature selection: Δ
VLDB’07 Vienna, Austria
2 of 25 Sept. 12 th , 2007
•
Graph is a mathematical construct and a general data structure representing relations among entities
• The emergence and the dominance of graphs asks for effective graph data management and mining tools so that users can organize, access, and analyze graph data efficiently
• Structural Pattern Mining:
Given a graph database, what are the potentially interesting structural patterns and how can we find them?
• Graph Indexing and Search:
How can we index graphs and perform searching, either exactly or approximately, in large graph databases?
VLDB’07 Vienna, Austria
3 of 25 Sept. 12 th , 2007
• Graph Containment Query
• Given a graph database G = { g
1
, g
2
, …, g
N
} and a query graph q , find the set i
g ,g i i
G }
• NP, since subgraph-isomorphism checking is NP-Complete
• Infeasible to check subgraph isomorphism sequentially for every g i in
G , especially challenging when graphs in G are large, or G is large and diverse
• Graph indexing!
VLDB’07 Vienna, Austria
4 of 25 Sept. 12 th , 2007
• Index construction generates the index feature set F from the graph database G . For each feature f , sup ( f ) is maintained
• Query processing is performed in a filtering-verification fashion:
• The filtering phase uses indexing features contained in q to compute the candidate answer set
Every graph in C q contains all q 's indexed features. Therefore, the query answer set, sup ( q ), is a subset of C q
• The verification phase checks subgraph isomorphism for every graph in
C q
. False positives are pruned and the true answer set sup ( q ) is returned
VLDB’07 Vienna, Austria
5 of 25 Sept. 12 th , 2007
• The cost of processing a graph containment query q upon G , denoted C , can be modeled as below
• C f
: the filtering cost
• C v
: the verification cost (NP-Complete)
• Analysis
1.
The key issue to improve query performance is to minimize |C q
|
2.
The indexing feature set F is quite relevant to C f and |C q
|
3.
Index construction performance: the feature selection cost C fs to construct F from among G
VLDB’07 Vienna, Austria
6 of 25 Sept. 12 th , 2007
•
Path -based Indexing approach
• All existing paths up to a certain length lp are enumerated as indexing features
– Index can be constructed efficiently
– Index size is quite large when lp is not small
– Limited pruning power, mainly because the structural information exhibited in graphs is lost when breaking graphs into paths
•
GraphGrep ( PODS’02 )
• Graph -based Indexing approach
• Subgraphs of G with different characteristics are selected as indexing features
– A costly index construction process
– Compact index structure
– Great pruning power, since structural information of graph is well-preserved
• gIndex ( SIGMOD’04 , PODS’05 ), C-Tree ( ICDE’06 ), GString ( ICDE’07 ), GDIndex
( ICDE’07 ), FG-Index ( SIGMOD’07 )
VLDB’07 Vienna, Austria
7 of 25 Sept. 12 th , 2007
Tree -
• Tree : Better indexability in comparison with path and graph
– The majority of frequent graph-features of
G are usually tree-features indeed
– Frequent tree-features and graph-features share similar distributions and frequent tree-features have similar pruning power like graph-features
– tree mining can be done much more efficiently than graph mining on G
• Δ : On-demand select a small number of discriminative graph-features without conducting costly graph mining beforehand
• Orders of magnitude smaller in index size, but performs much better than existing approaches in indexing construction and query processing
VLDB’07 Vienna, Austria
8 of 25 Sept. 12 th , 2007
• Frequent features (paths, trees, graphs) expose intrinsic characteristics of a graph database, G . They are representatives to discriminate between different groups of graphs in a graph database
• Which one should we index? Path, Tree or Graph?
1. The frequent feature set size: | F |
2. The feature selection cost: C fs
3. the candidate answer set size: |C q
|
VLDB’07 Vienna, Austria
9 of 25 Sept. 12 th , 2007
• Evidences:
• Among all frequent graph-features of G , a majority of them are trees indeed
– All subtrees of a frequent graph are frequent
– There is little chance that subtrees of frequent graph g coincide with those of frequent graph g ’ , due to the structural diversity and label variety
• Frequent paths share a very small portion, because a path-feature has a simple linear structure, which has little variety in structural complexity
• In terms of feature distributions, tree-features and graph-features share a very similar distribution w.r.t.
feature size
VLDB’07 Vienna, Austria
10 of 25 Sept. 12 th , 2007
The Real Dataset
VLDB’07 Vienna, Austria
The Synthetic Dataset
11 of 25 Sept. 12 th , 2007
fs
• Given a graph database,
G , and a minimum support threshold,
σ , to discover the frequent feature set F ( F
P
/ F
T
/ F
G
) from G
Path Tree Graph
Isomorphism
Sub-Isomorphism
O( n )
O( n + m )
O( n ) P or NPC ( ?
)
O( m 3/2 n /log m ) NP-Complete
• Tree
• A good compromise between
– the more expressive, but computationally harder general graph
– the faster but less expressive path
• Specialization of general graph avoiding undesirable theoretical properties and algorithmic complexity incurred by graph
VLDB’07 Vienna, Austria
12 of 25 Sept. 12 th , 2007
q
|
• We define the pruning power power( f ) of a frequent feature f as
• The pruning power of a frequent feature set
S = { f
1
, f
2
, …, f n
}
•
Theorem 1 : Given a frequent graph-feature g , and let its frequent subtree set be T ( g ) = { t
1
, t
2
, …, t n
}. Then, power( g ) ≥ power( T ( g ))
• Theorem 2 : Given a frequent tree-feature t , and let its frequent sub-path set be P ( t ) = { p
1
, p
2
, …, p m
}. Then, power( t ) ≥ power( P ( t ))
VLDB’07 Vienna, Austria
13 of 25 Sept. 12 th , 2007
• The pruning power of all frequent subtree features,
T ( g ), of a frequent graph-feature g can be similar to the pruning power of g
• There is a big gap between the pruning power of a graphfeature g and that of all its frequent sub-path features, P ( g )
VLDB’07 Vienna, Austria
The Real Dataset
14 of 25 Sept. 12 th , 2007
• It is feasible and effective to select F
T
, instead of F
G
, as indexing features for the graph containment query problem
• The frequent tree-feature set, F
T
, dominates F
G
• Discovering frequent tree-features from G can be done much more efficiently than mining frequent general graph-features
• F
T can contribute similar pruning power like that provided by F
G
VLDB’07 Vienna, Austria
15 of 25 Sept. 12 th , 2007
• Consider a query graph q which contains a subgraph g
• If power( T ( g )) ≈ power( g ), there is no need to index the graph-feature g , because its subtrees jointly have the similar pruning power
• if power( g ) >> power( T ( g )), it will be necessary to select g as an index feature because g is more discriminative than T ( g ), in terms of pruning
• Discriminative graph-features ( w.r.t.
its subtree-features, controlled by
ε
0
) are selected from queries on-demand, without mining the whole set of frequent graph-features from G beforehand
• Discriminative graph-features are used as additional indexing features, denoted Δ , which can also be reused further to answer subsequent queries
VLDB’07 Vienna, Austria
16 of 25 Sept. 12 th , 2007
• Given two graphs
, where
• If the gap between power( g’ ) and power( g ) is large enough, g’ will be reclaimed from G ;
• Otherwise , g is discriminative enough for pruning purpose, and there is no need to reclaim g’ in the presence of g
• Approximate the discriminative computation between g’ and g , in the presence of our knowledge on frequent tree-features discovered
VLDB’07 Vienna, Austria
17 of 25 Sept. 12 th , 2007
• The occurrence probability of g in the graph database, G
• the conditional occurrence probability of g’
, w.r.t. g, models the probability to select g’ from G in the presence of g
• The upper and lower bound of
Pr ( g’
| g )
• The conditional occurrence probability of
Pr ( g’|g
), is solely upper-bounded by T ( g’
)
VLDB’07 Vienna, Austria
18 of 25 Sept. 12 th , 2007
• The Real Dataset
• The AIDS antiviral screen dataset from Developmental Theroapeutics
Program in NCI/NIH
• 42390 compounds retrieved from DTP's Drug Information System
• 63 kinds of atoms in this dataset, most of which are C, H, O, S, etc.
• Three kinds of bonds are popular in these compounds: single-bond, double-bond and aromatic-bond
• On average, compounds in the dataset has 43 vertices and 45 edges.
• The graph of maximum size has 221 vertices and 234 edges
VLDB’07 Vienna, Austria
19 of 25 Sept. 12 th , 2007
VLDB’07 Vienna, Austria
20 of 25 Sept. 12 th , 2007
• The real dataset: false positive ratio (| Cq |/|sup( q )|) w.r.t.
the database size (= 1,000; 2,000; 4,000; 8,000; 10,000)
VLDB’07 Vienna, Austria
21 of 25 Sept. 12 th , 2007
• Generated by a widely-used graph generator, which is controlled by the following parameters :
VLDB’07 Vienna, Austria
22 of 25 Sept. 12 th , 2007
VLDB’07 Vienna, Austria
23 of 25 Sept. 12 th , 2007
• Graph indexing plays a critical role in graph containment query processing on large graph databases
• Path-based and graph-based indexing approaches suffer from overly large index size, substantial index construction overhead and expensive query processing cost
• (Tree+Δ) is an effective and efficient graph indexing feature to answer graph containment queries
• (Tree+Δ) holds a compact index structure, achieves good performance in index construction and most importantly, provides satisfactory query performance for answering graph containment queries over large graph databases
VLDB’07 Vienna, Austria
24 of 25 Sept. 12 th , 2007
University of Illinois at Urbana-Champaign
VLDB’07 Vienna, Austria