rSession21

advertisement
Keyword Search in Graphs:
Finding r-cliques
Mehdi Kargar
Aijun An
York University, Toronto, Canada
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Overview
• Keyword Search in Graphs/Relational Databases
• r-clique Definition
• Challenges in Finding r-clique
• Approximation Algorithm for Finding r-cliques
• Enumerating Top-k r-cliques in Polynomial
Delay
• Empirical Results
• Conclusion
2/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Keyword Search in Graphs/Relational Databases
• Keyword search is a well known mechanism for retrieving
relevant information from a set of documents.
• Google is a familiar example !
• What about structured data?
• Such as XML documents or Relational Databases?
• Current enterprise search engines in structured data
requires:
• Knowledge of schema
• Knowledge of a query language
• Knowledge of the role of the keywords
• Do users have all of the above Knowledge ?
• The answer is NO !
3/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Keyword Search in Graphs/Relational Databases
• Users need a simple system that receives some
keywords as input and returns a set of nodes that
together cover all or part of the input keywords as output.
• Relational databases can be modeled using graphs:
• Tuples are nodes of the graph.
• Foreign key relationships are edges that connect two nodes
(tuples) to each other.
4/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Example: Search in Relational Databases
Cities
Organizations
ID
Name
Country
ID
Name
Head Q.
22
Toronto
CA
135
UN
16
16
New York
US
175
EU
81
Countries
Memberships
Code
Name
Country
Org.
CA
Canada
CA
135
US
United States
US
135
Part of Mondial Dataset
5/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
New York is Located in United States
Cities
Organizations
ID
Name
Country
ID
Name
Head Q.
22
Toronto
CA
135
UN
16
16
New York
US
175
EU
81
Countries
Memberships
Code
Name
Country
Org.
CA
Canada
CA
135
US
United States
US
135
Keywords : “New York” “United States”
6/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
New York hosts UN and Canada is a member
Cities
Organizations
ID
Name
Country
ID
Name
Head Q.
22
Toronto
CA
135
UN
16
16
New York
US
175
EU
81
Countries
Memberships
Code
Name
Country
Org.
CA
Canada
CA
135
US
United States
US
135
Keywords : “New York” “Canada”
7/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Previous Approaches
• Most of the works find minimal connected trees that
contain all or part of the input keywords.
• The tree is called Steiner Tree.
• Recently, methods that produce sub-graphs are
proposed. They might provide more informative
answers
• One of the recent approaches is called multi-center community
(ICDE 2009).
• So, what is the problem with previous approaches?
8/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Problems with Previous Approaches
1. There might be some content nodes that are far
away from each other.
• It means that weak relationships among content nodes might exist.
• There is no guarantee on the closeness of the nodes.
• Since all keywords are equally important, all of them should be close
to each other. They are also equally important in the ranking function.
2. While searching for the answers, current methods
explore both content and non-content nodes.
• This might lead to poor performance.
9/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
r-cliques
• To solve the problem of previous approaches, we propose to
find r-cliques.
• An r-clique is a set of content nodes that together contain all
of the input keywords and in which the shortest distance
between each pair of nodes is no longer than r.
• Weight of r-clique: Suppose that the nodes of an r-clique are
denoted as {v1, v2, … , vn}. The weight of the r-clique is
defined as:
• dist(vi,vj) is the shortest distance between vi and vj .
10/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Benefits of Finding r-cliques
• Finding r-cliques as the answers for keyword search in
•
•
•
•
graphs does not have the problems of previous
approaches.
All of the content nodes are reasonably close to each other.
The weight function evaluates all of the content nodes
equally.
The algorithm (to be discussed later) for finding r-cliques
concentrate on the content nodes rather than all of the
nodes in the graph. So, it is faster and more efficient.
For presenting the relationships, the final answer has less
irrelevant nodes than a multi-center community.
11/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
12/28
An Example
Input Keyword: James John Jack
12/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
13/28
r-clique weight: 12
tree weight: 8
community weight: 8
r-clique weight: 14
tree weight: 7
community weight: 7
13/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
14/28
Challenges in Finding r-cliques
• Problem 1: Given a distance threshold r, a graph G and a
set of input keywords, find an r-clique in G whose weight
is minimum.
• Theorem: Problem 1 is NP-hard.
• Proved in the paper by reduction from 3-satisfiability (3-SAT).
• Solution : Approximation algorithm with guaranteed
ratio.
• Total number of answers is exponential regarding the
number of input keywords.
• It is not efficient to generate all answers and then sort them.
• Solution : Enumerating answers in polynomial delay.
14/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
15/28
What We Need …
• Producing r-cliques in a ranking order
• r-cliques with lower weights should be presented before ones with
higher weights.
• Producing top-k r-cliques efficiently with a bound on
approximation ratio
• Each r-clique must be generated efficiently in polynomial time.
• There must be a bound on the quality of a generated r-clique
• The weight of a generated r-clique should be within some factor of the
current optimal solution
• Generating all the r-cliques if needed
• No r-clique should be missed
15/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Heuristic and Approximate Order
Heuristic
Order
It is expected to be close to the
optimal answer.
But, we have no guarantee
Desired
Choice
Approximate
Order
It is close to the optimal answer
with a provable guarantee
16/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
17/28
Enumerating in Approximate Order
• The Lawler’s technique is used for finding the top-k
answers.
• In each iteration, the next r-clique is generated by finding
the top answer under constraints.
• Two problems should be solved
1- What are the constraints?
2- How top answer can be found efficiently under
the constraints?
17/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
18/28
Overview of the System
Input
Keywords
+
Value of k
Find best
Answer with
no
Constraint
Insert the best
r-clique with the
search space in
priority queue
Fetch the best
r-clique from
priority queue
and print it
Top-k already printed
OR
Empty priority queue
?
YES
Terminate
NO
Insert each answer
with the related
search space into
priority queue
Find best r-clique
in each sub-space
with associated
constrains
Divide the related
search space of
the top answer
into sub-spaces
18/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
19/28
Constraints and Search Space
• Let’s do it using an example !
• Suppose that the input keywords are {k1, k2, k3, k4}.
• Ci = {set of nodes that contains keyword ki }.
• The search space that contains the best r-clique can be
represented as {C1 ᵡ C2 ᵡ C3 ᵡ C4}.
• Assume that the best r-clique is (v1, v2, v3, v4), where vi is
a node containing keyword ki .
The whole
search space
19/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
20/28
Finding Best Approximate r-clique
• Step 1: for all content nodes n in the search space, for all
keywords ki, find the closest node in the search space
which contains ki.
• Step 2: for all content nodes n, for all keywords ki,
calculate the sum of distances from n to the holder of ki.
• Step 3: Find the content node with the minimum sum of
distances among other content nodes.
• Step 4: Return the set of content nodes with the minimum
sum of distances.
20/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Properties of the Approximation Algorithm
• Only content nodes are searched for finding the best
answer in the search space.
• The approximation ratio of the algorithm is equal to 2.
• The weight of the answer is at most twice of the weight of the
optimal answer.
• Proof can be found in the paper.
21/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Presenting r-cliques to the User
• To show the relationship between the nodes in an r-clique,
a Steiner tree is found and presented to the user.
Keywords : (in DBLP dataset)
“Parallel” “Algorithm” “Optimization” “Graph”
Distributed Parallel Algorithm
For Nonlinear Optimization
Without Derivatives
w
Guoping
He
community
w
A Binding Number
Computation of Graph
w
w
w
Congying
Han
w
w
r-clique
Xuping
Zhang
Xuping
Zhang
w
w
Irrelevant
A New Non-interior Continuation Method
for Second-Order Cone Programming
Distributed Parallel Algorithm
For Nonlinear Optimization
Without Derivatives
A Binding Number
Computation of Graph
22/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Experimental Results
• The r-clique is compared with the multi-center community
method (it is called com-k).
• Our approximation algorithm is called poly-delay-k.
• Two datasets are used: DBLP and IMDb.
• The set of input keywords and parameters are the same as
the community paper.
23/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Running Time
DBLP Dataset
IMDb Dataset
24/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Quality of the Answers
DBLP Dataset
25/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Search Accuracy from a User Study
• Top-k precision: the percentage of the answers in the top-k
answers that are relevant to the query.
• The users are asked to evaluate the answers using two
methods.
• In the first approach the scores (0-1) are assigned to the nodes.
Then, the average is used as the precision.
• In the second approach, the whole answer is evaluated and a score
is assigned to it.
• The results of both of the methods are similar.
DBLP Dataset
26/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Conclusion
• A novel and efficient approach for keyword search in
•
•
•
•
graphs has been proposed.
All of the content nodes are reasonably close to each
other.
An approximation algorithm with bounded guarantee has
been proposed.
Only content nodes are explored during the search
process.
A Steiner tree which has as small as possible number of
middle nodes has been generated to reveal relations
among content nodes.
27/28
VLDB’11
Keyword Search in Graphs: Finding r-cliques
Thank you!
Any Questions?
28/28
Download