Document

advertisement
Querying and Mining Large
Graph Databases
Jeffrey Xu Yu
Department of Systems Engineering and
Engineering Management
The Chinese University of Hong Kong
Query Processing











Parallel query processing
Distributed query processing
Adaptive query processing
Continuous query processing
Spatial database query processing
Data warehouse, online analytical query processing (OLAP),
datacube, view management, Iceberg queries
Multi-database query processing
Data mining query processing
Object-Oriented query processing
Extensible database query processing
XML/RDF/Graph query processing
2
Books and Surveys on QP










Readings in Database Systems (FouthEdition), Michael Stonebraker and
Joseph M.Hellerstein, Morgan Kaufmann,
http://redbook.cs.berkeley.edu/bib4.html
Principles of Database Query Processing for Advanced Applications,
Clement T. Yu and Weiyi Meng, Morgan Kaufmann, 1998
Matthias Jarke and Jürgen Koch: Query Optimization in Database Systems,
ACM Computing Surveys, 16(2), 111-152, 1984.
Michael V. Mannino, Paicheng Chu, and Thomas Sager: Statistical Profile
Estimation in Database Systems, ACM Computing Surveys, 20(3), 191-221,
1988.
Priti Mishra and Margaret H. Eich: Join Processing in Relational Databases,
ACM Computing Surveys, 24(1), 63-113, 1992.
D. J. DeWitt and J. Gray: Parallel Database Systems: The Future of High
Performance Database Processing, Communications of the ACM, June 1992.
Goetz Graefe: Query Evaluation Techniques for Large Databases, ACM
Computing Surveys, 25(2), 73-170, 1993.
Yannis E. Ioannidis: Query Optimization, ACM Computing Surveys, 28(1), 121123, 1996.
Surajit Chaudhuri: An Overview of Query Optimization in Relational
Systems, PODS, 34-43, 1998.
D. Kossmann: The State of the Art in Distributed Query Processing, ACM
Computing Surveys, 32(4), 422-469, 2000.
3
The Main Theme of DBMSs


Structural behavior and operational behavior
More data structures and more operations?





What to be added?
At which level to support?
How to support?
The more the better?
Can it be complete?
Supporting Structural Behavior




Entity-Relationship Model, Peter P. Chen, ACM TODS,
1976
Database Abstractions: Aggregation and
Generalization, John M. Smith and Diane C.P. Smith,
ACM TODS, 1977
Molecular Objects, Abstract Data Types and Data
Models: A Framework, Don S. Batory and Alejandro P.
Buchman, VLDB’84
Survey of Graph Database Models, Renzo Angles and
Claudio Gutierrez, ACM Computing Surveys, Vol. 40,
No.1, 2008
Adding New Structures into RDBMS (1)

Complex Objects in IBM System-R (Raymond A.
Lorie and Wil Plouffe, SIGMOD’83)



System-R is a database system built at IBM San Jose
Research in the 1970’s.
System-R introduced SQL.
Lore and Plouffe propoosed to add the hidden
pointers to support tree-structured schema.
Adding New Structures into RDBMS (2)

Implementation of Data Abstraction in RDBS
Ingres (James Ong, Dennis Fogg and Michael
Stonebraker, SIGMOD-Record, 1984)


Quel as a Data Type (Michael Stonebraker, et al.,
SIGMOD’84)
Nested Relations (Akifumi Makinouchi, VLDB’77)
OO DBMS Manifesto

Object-Oriented Database System Manifesto
(Malcolm Atkinson, Francois Benchilhon, David
DeWitt, Klaus Dittrich, David Maier and Stanley
Zdonik, DOOD, 1989)

Third-Generation Database System (Michael
Stonebraker, Lawrence A. Rowe, Bruce G.
Lindsay, Jim Gray, Michael J. Carey, Micahel L
Brodie, Philip A Bernstein, and David Beech,
DS-4, 1990.
Semistructured and XML


Querying Semistructured Information (Dallan
Quass, Anand Rajaraman, Yehoshua Sagiv,
Jeffrey D. Ullman, and Jeffifer Widom,
DOOD’950
XML query processing became a hot issue since
1998.
Early Work To Support Graphs
in DBMSs
A Database as a Graph (GOOD)



A graph-oriented object database model (Marc
Gyssens, Jan Paredaens, Jan Van den Bussche and
Dirk Van Gucht, PODS’90)
Both schema and objects are graphs
Data manipulations are graph transformation

Based on graph grammars and pattern matching

4 operators: node/edge insertion/deletion
An abstract operator to abstract objects that share the
same set of properties
Computational Completeness




Relational complete with the 4 operators.
Simulate nested relational algebra with the abstract and the 4
operators
Support Abstract Graph Structures

A Graphical Query Language Supporting
Recursion (Isabel F. Cruz, Alberto O.
Mendelzon, Peter Wood, SIGMOD’87)

Its expressive power is comparable to relational
query languages.
Explicitly and Views

Support graphs on RDBMS explicitly


GraphDB: Modeling and Querying Graphs in
Databases (Ralf Hartmut Guting, VLDB’94)
Graph Views

Database Graph Views: A Practical Model to Manage
Persistent Graph (Alejandro Gutierrez, Philippe
Pucheral, Hermann Steffen and Jean-Marc Thevenin,
VLDB’94)

Build a graph view on a DBMS (RDBMS or OODBMS)
Graph view operations: union, intersection, node-difference,
edge-difference, selection, and projection.
Query processing on graph views: set-oriented and pipelined


Graph Storage



Representing web graphs (Sriram Raghavan,
Hector Garcia-Molina, ICDE’03)
The Web graph is huge.
Construct a two level graph.




Partition nodes into N1, N2, ….
Construct a super-node graph where a node is a
partition Ni.
Maintain page-based connection information inside Ni,
and between Ni and Nj.
How to partition?
Books on Social Networks






Social and Economic Networks
by Matthew O. Jackon
Social Network Data Analysis
by Charu C. Aggarwal
Exploratory Social Network Analysis with Pajek by
Wouter de Nooy, Andrej Mrvar, and Vladimir Batagelj
Networks, Crowds, and Markets: Reasoning about a
Highly Connected World
by David Easley and John Keinberg
Networks An Introduction
by M.E.J. Newman
Managing and Mining Graph Data by Charu C. Aggawal
and Haixun Wang
Some Online Courses



Mining of Massive Datasets (Anand
Rajaraman and Jeff Ullman)
http://infolab.stanford.edu/~ullman/mmds.html
Networks, Crowds, and Markets: Reasoning
about a highly connected world, by David
Easley and Jon Kleinberg
http://www.cs.cornell.edu/home/kleinber/network
s-book
Topics in Data Management & Mining –
Social Networks, Laks V.S. Lakshmanan
http://www.cs.ubc.ca/~laks/534l/cpsc534l.html
Some Recent Tutorials



Mining Heterogeneous Information Networks
by Jiawei Han, Yizhou Sun, Xifeng Yan, and
Philip S. Yu (KDD’10)
Mining Knowledge from Databases: An
Information Network Analysis
by Jiawei Han, Yizhou Sun, Xifeng Yan, and
Philip S. Yu (SIGMOD’10)
Querying Large Graph Databases
by Yiping Ke, James Cheng, and Jeffrey Xu Yu
(DASFAA’10)
Stanford Large Network Dataset Collection
http://snap.stanford.edu/data












Social networks
Communication networks
Citation networks
Collaboration networks
Web graphs
Amazon networks
Internet networks
Road networks
Autonomous systems
Signed networks
Wikipedia networks and metadata
Twitter and Memetracker
Graph Database
http://en.wikipedia.org/wiki/Graph_database




Pregel: Google’s internal graph processing
platform
Trinity: Microsoft Research Asia
Neo4j: commercial graph database
…
Graph Reachability Query
Two Possible Solutions

Traverse G(V, E) to answer reachability
queries


Low query performance: O(|E|) query time
Precompute and store the transitive closure T


Fast query processing
Large storage requirement: O(|V|2)
Interval-based Labeling

For tree-structured data




Assign each node one interval [start, end] such as the pair of
[pre-order, post-order] during DFS
Reachable iff one node’s interval contains the other’s
Space-efficient: O(|V| log |V|)
Extend to DAG [Agrawal et al. SIGMOD89]



To cover all descendants, a node may contain multiple intervals
Space complexity: O(|V|2) in the worst case.
It can be used to support directed graphs.
a [1,16]
a [1,16]
[2,9] b
c
[2,9] b [10,11] c
d [12,15]
d [5,8][12,15]
[10,11]
e
f
[3,4]
[5,6]
g
[7,8]
h
[13,14]
e
f
[3,4]
[5,6]
g
[7,8]
h
[13,14]
2-hop Labeling [Cohen et al.SODA02]



For each node a, maintain two sets of labels (nodes): Lin(a)
and Lout(a)
For each connection (a,b),
 choose a node c on the path from a to b (center node)
 add c to Lout(a) and to Lin(b)
Then (a,b)Transitive Closure T  Lout(a)Lin(b)≠
a
c
 Minimize the sum of the label sizes
(NP-complete  approximation required)
b
2-hop Labeling (Example)
a
Lin
a
c
b
f,d,b,c
b
d
f
e
Lout
c
f,d
d
f
e
h
g
i
d,b
f
g
f
h
f
i
f,h
2-hop Labeling (Example)
a
Lin
a
c
b
f,d,b,c
b
d
f
e
Lout
c
f,d
d
f
e
h
g
i
d,b
f
g
f
h
f
i
f,h
2-hop Labeling (Example)
a
Lin
a
c
b
f,d,b,c
b
d
f
e
Lout
c
f,d
d
f
e
h
g
i
d,b
f
g
f
h
f
i
f,h
2-hop Labeling (Example)
a
Lin
a
c
b
f,d,b,c
b
d
f
e
Lout
c
f,d
d
f
e
h
g
i
d,b
f
g
f
h
f
i
f,h
2-hop Labeling (Example)
a
Lin
a
c
b
f,d,b,c
b
d
f
e
Lout
c
f,d
d
f
e
h
g
i
d,b
f
g
f
h
f
i
f,h
Approximation Algorithm
What are good center nodes?
Nodes that can cover many uncovered connections.
1
2
4
3
5
6
Initial step:
All connections
are uncovered
 Consider the center graph of candidates
initial density:
2
I
1
4
2
5
6
O
Edges
8
8

  1.33
I  O 24 6
density
(We
canofcover
densest
8 connections
subgraph
with 6 same
(here:
cover as
entries)
initial density)
Approximation Algorithm
What are good center nodes?
Nodes that can cover many uncovered connections.
1
2
4
3
5
6
Initial step:
All connections
are uncovered
 Consider the center graph of candidates
I
1
4
2
5
3
6
4
O
initial
density: in subgraph
Cover
connections
with Edges
greatest density
with
12 12
 center
 node
 1.71
corresponding
I O
43
7
density of densest subgraph =
initial density (graph is
complete)
Approximation Algorithm
What are good center nodes?
Nodes that can cover many uncovered connections.
1
2
4
3
5
6
Next step:
Some connections
already covered
 Consider the center graph of candidates
I
1
2
2
O
Repeat this algorithm until all
connections are covered
Theorem: Generated Cover is optimal
up to a logarithmic factor
2-hop Labeling Construction
T=TC
 While T is not empty
 Choose S: elements of T to be processed
 Update Lins and Louts w.r.t. to S
 Remove elements from T’ w.r.t to S

Construction Example
T
a
a
c
b
c
b
d
d
f
f
e
h
e
h
g
g
i
i
Construction Example
T
Bipartite graph Bh
a
a
c
b
c
d
i
h
d
f
e
f
h
h
g
i
A subgraph of Bh: 9/7
34
Construction Example
T
Bipartite graph Bf
a
a
c
b
c
d
d
f
e
f
g
h
i
f
h
The densest subgraph Sf of Bf, 15/8
g
i
The densest among all nodes
35
Construction Example
Sf
a
c
d
f
Lin
g
a
Lout
f
b
h
i
c
f
d
f
e
f
–Update Lins and Louts w.r.t. to Sf,
f
f
g
f
h
f
i
f
f
36
Construction Example
T
Updated T
a
a
c
b
c
b
d
d
f
e
f
h
g
e
h
g
i
i
37
Construction Example
Finding the densest subgraph is not obvious
 An NP-hard problem -> SET COVER
 Use SET COVER heuristics to minimize the size

38
Optimizing Performance [Schenkel et al.
EDBT04, ICDE‘05]



Density of densest subgraph of a node‘s
center graph never increases when
connections are covered
Do not rank all the denest subgraphs in
every iteration.
Precompute estimates, recompute on
demand.
Is that enough?
Transitive Closure: 344,992,370 connections
2-Hop Cover:
1,289,930 entries
 compression factor of ~267
 queries are still fast (~7.6 entries/node)
Computation took 45 hours and 80 GB RAM!
HOPI: Divide and Conquer
Framework of an Algorithm:
(I)
Partition the graph such that the
transitive closures of the partitions fit into
memory and the weight of crossing
edges is minimized
(II)
Compute the 2-hop cover for each
partition
(III)
Combine the 2-hop covers of the
partitions into the final cover
Final Results for Index Creation
Transitive Closure: 344,992,370 connections
2-Hop Cover:
9,999,052 entries
 compression factor of ~34.5
 queries are still ok (~59.2 entries/node)
 build time is good (~23 minutes with 1 CPU
and 1GB RAM)
Min Set-Cover: Least Overlapping vs Min
Cardinality [Cheng et al. EDBT’06]

Tested with a DAG (|V|=2,000, |E|=4,000)
R-tree Approach [Cheng et al. EDBT’06]



Do not generate transitive closure
Map the 2-hop problem onto a two-dimensional
grid, and
Compute 2-hop labeling using operations
against rectangles with help of an R-tree.
An Example
Utilizing Internal Based Labeling
Interval Based Codes
Reachability Map
A Reachability Map Example
A Reachability Map Example
An R-Tree Based Algorithm
Selecting Bipartite Graphs
Hierarchical Partitioning [Cheng et al.
EDBT’08]
Reachability Queries

I/O Cost Minimization: Reachability Queries Processing
over Masive Graphs

Scaling Reachability Computation on Large Graphs

The Exact Distance to Destination in Undirected World

Label-constraint Reachability Queries

K-Reach
Keyword Search in RDB
Find Information from Relational Databases

RDBs are structured
with rich schema
information


Complex Schema
SQL query language

long learning curves
select paper.title from
conference c, paper p, author
a1, author a2, write w1, write
w2
where c.cid =
p.cid AND p.pid = w1.pid AND
p.pid = w2.pid AND w1.aid =
a1.aid AND w2.aid = a2.aid
AND a1.name = “John” AND
a2.name = “John” AND
c.name = SIGMOD
Keyword Search in RDB


Given a relational database, RDB, consider the
RDB as a directed graph G(V, E).
Given a set of keywords, find interconnected
tuple structures.
 Connected Trees

Minimal Total Joining Networks of Tuples
A Simple RDB
Author
AID
a1
a2
Cite
CID
c1
c2
c3
Paper
Name
Jim
Kate
PD1
p1
p5
p5
PD2
p3
p4
p3
PID
Write
Title
WID AID PID
P1 Database System
w1
a1
p5
P2 Algorithm design
w2
a1
p2
P3 Algorithm Analysis
w3
a1
p1
P4 Database Schema
w4
a2
p1
P5 Query Processing
w5
a2
p3
c2
c3
c1
p5
p2
p4
p1
p3
w1
w2
w3
w4
w5
a1
a2
Trees (Jim Database Algorithm)
p2
Author
AID
a1
a2
Cite
CID
c1
c2
c3
Paper
Name
Jim
Kate
PD1
p1
p5
p5
PD2
p3
p4
p3
PID
w2
WID AID PID
a1
P1 Database System
w1
a1
p5
P2 Algorithm design
w2
a1
p2
P3 Algorithm Analysis
w3
a1
p1
P4 Database Schema
w4
a2
p1
P5 Query Processing
w5
a2
p3
c2
p5
Title
p2
c3
p4
p1
Write
c1
p1
a1
c1
p1
w3
w3
w3
p3
w2
w3
w4
a2
p3
w4
w5
c3
w5
p5
a1
p1
a2
a1
c2
w1
p3
w1
p4
a1
p3
Trees (Jim Database Algorithm)
p2
Author
AID
a1
a2
Cite
CID
c1
c2
c3
Paper
Name
Jim
Kate
PD1
p1
p5
p5
PD2
p3
p4
p3
PID
w2
WID AID PID
a1
P1 Database System
w1
a1
p5
P2 Algorithm design
w2
a1
p2
P3 Algorithm Analysis
w3
a1
p1
P4 Database Schema
w4
a2
p1
P5 Query Processing
w5
a2
p3
c2
p5
Title
p2
c3
p4
p1
Write
c1
p1
a1
c1
p1
w3
w3
w3
p3
w2
w3
w4
a2
p3
w4
w5
c3
w5
p5
a1
p1
a2
a1
c2
w1
p3
w1
p4
a1
p3
Trees (Jim Database Algorithm)
p2
Author
AID
a1
a2
Cite
CID
c1
c2
c3
Paper
Name
Jim
Kate
PD1
p1
p5
p5
PD2
p3
p4
p3
PID
w2
WID AID PID
a1
P1 Database System
w1
a1
p5
P2 Algorithm design
w2
a1
p2
P3 Algorithm Analysis
w3
a1
p1
P4 Database Schema
w4
a2
p1
P5 Query Processing
w5
a2
p3
c2
p5
Title
p2
c3
p4
p1
Write
c1
p1
a1
c1
p1
w3
w3
w3
p3
w2
w3
w4
a2
p3
w4
w5
c3
w5
p5
a1
p1
a2
a1
c2
w1
p3
w1
p4
a1
p3
MTJNT (Jim Database Algorithm)
p2
Author
AID
a1
a2
Cite
CID
c1
c2
c3
Paper
Name
Jim
Kate
PD1
p1
p5
p5
PD2
p3
p4
p3
PID
w2
WID AID PID
a1
P1 Database System
w1
a1
p5
P2 Algorithm design
w2
a1
p2
P3 Algorithm Analysis
w3
a1
p1
P4 Database Schema
w4
a2
p1
P5 Query Processing
w5
a2
p3
c2
p5
Title
p2
c3
p4
p1
Write
c1
p1
a1
c1
p1
w3
w3
w3
p3
w2
w3
w4
a2
p3
w4
w5
c3
w5
p5
a1
p1
a2
a1
c2
w1
p3
w1
p4
a1
p3
Two Basic Approaches

Take RDB as it is.



Generate a set of Candidate Networks (CNs)
Evaluate each CN using SQL
Take RDB as a graph by materalization.

Design a graph algorithm to find all answers
Finding Answer Trees

Intuition: travel backwards from keyword nodes till
you hit a common node
Query: sudarshan roy
paper
MultiQuery Optimization
writes
authors
Sudarshan
Prasan Roy
Backward Search: Algorithm
 Run concurrent single source shortest path iterators
from each node matching a keyword
 Traverse the graph edges in reverse direction
 Output next nearest node on each get-next() call
 Do best-first search across iterators
 Output node if in the intersection of sets of nodes
reached from each keyword
Backward Search: Limitations
Wasteful exploration of graph:


Frequently occurring keywords
“Hub” nodes in the graph (high in-degree)
“Shashank Sudarshan Database”
…
Schema Legend
Database
…
author
writes
Shashank
Sudarshan
paper
Bidirectional Search: Motivation
Bidir Search: Intuition

First cut solution:



Don’t go backward if a keyword matches many
nodes
Don’t go backward if a node points to a hub
Instead explore forward from other keywords
Bidir Search: Example
“Shashank Sudarshan Database”
…
Database
…
…
Schema Legend
author
Shashank
writes
Sudarshan
paper
Top-K Minimum Group Steiner Tree
A Parameterized Approach
Dynamic Programming [Ding et al. ICDE’07]

A Naïve Approach
72
Dynamic Programming Equation
73
Dynamic Programming Equation
74
The Order to Compute T(v,p)
75
Finding Maximal Cliques in
Massive Networks by H*-graph
James Cheng, Yiping Ke,
Ada Wai-Chee Fu, Jeffrey Xu Yu,
Linhong Zhu (SIGMOD’10)
Maximal Clique Enumeration (MCE)

A long-standing problem


Find all maximal cliques in a given graph
Applications in graph theory and other areas


Maximal common induced subgraphs, maximal
common edge subgraphs, maximal independent
set, …
Social networks, clustering in dynamic networks,
detection patterns in terrorist networks, computational
biology, financial network analysis, …
77
Why This Problem?



All existing algorithms are in-memory
The best algorithm requires memory space of
(|V|+|E|)
Real-world networks become exceedingly large
and keep growing




The Web: over 1 trillion webpages (Google)
Social networks: 400 million users (Facebook)
Citation graphs: 1.3+ million publications (DBLP)
Existing algorithms fail to handle large networks
78
Our Contributions

We propose the first external-memory MCE
algorithm


At each recursive step




A recursive approach
Pick a subgraph g from the large given graph G
Find all max-cliques related to g
Remove g from G
Two Questions


Which subgraph should we choose at each time?
How to ensure soundness and completeness?
79
What is H?

h-index


“h-index” for a graph


The maximum h for a scientist who has
h publications with citations of at least h
The maximum h for a graph G that has
h vertices with degree of at least h
H = {the h vertices}
a
w
ID
y
x
b
r
z
c
d
t
e
s
q
c
b
a
d
e w
x
y
s
r
z
q
t
Deg 7
6
5
5
5
4
4
3
2
2
2
1
4
h = 5 ( 5 vertices with deg ≥ 5 )
H = {c, b, a, d, e}
80
H-graph and H+-graph



H-graph: subgraph of G induced by H
Hnb : the neighbors of h-vertices in G
H+-graph: subgraph of G induced by (H ∪ Hnb)

The + means the extension from H to neighbors
H+graph
Hgraph
b
r
z
w
a
y
x
c
t
e
d
81
s
q
H*-graph

H*-graph: a graph that “lies” between H-graph and
H+-graph


Exclude edges among h-neighbors from H+-graph
Contain edges incident on at least an h-vertex
H+H*graph
Hgraph
b
r
z
w
a
y
x
c
t
e
d
82
s
q
H*-graph


We use H*-graph as the subgraph of G for
computing local max-cliques in recursive steps
Why H*-graph?


H*-graph gives the core of G as well as the connection
from the core to other parts of G 
Provide rich information for MCE
H*-graph is only a small portion of G 
Small enough to be kept in memory
83
Size of H*-graph

Most real networks are scale-free




Degree distribution follows power law
R: exponent in power law (constant, normally between
-0.8  -0.7)
n: |V(G)|
We obtain theoretical bounds on the size of H*graph for scale-free networks
84
H*-graph v.s. H-graph and H+-graph

Size of H-graph

Size of H+-graph

For a network G with R = -0.7 and n = 1M



H*-graph is within [12%, 15%] of G
Too small!
H-graph is at most 4% of G
H+-graph can be as large as 65% of G Too large!
85
ExtMCE

Recursive step





By one scan of G
Extract H*-graph from G
Compute local max-cliques By existing in-mem algorithm
Extend local max-cliques to global ones
Remove H*-graph from G
Repeat until G becomes empty
Neighborhood-Privacy Protected
Shortest Distance Computing in
Cloud
This is a joint work by Jun Gao, Jeffrey Xu Yu,
Ruoming Jin, Jiashuai Zhou, Tengjiao Wang,
Dongqing Yang (SIGMOD’11)
Graph data management in cloud
Cloud Computing
 Graph data applications
• Social network,
knowledge network ...
 Time consuming graph
operations
• The shortest distance
computing takes O(n2)
• The breadth-first-
search requires O(n+m)
• ......
Can we use the
cloud server to
manage graph
 Advantage of cloud
data, such as to
computing
answer shortest
• High computational power
distance?
• Easy maintenance
• Easy re-provisioning of
resources
• ……
Security issues in graph outsourcing

Attacks on outsourced graph

Structural Pattern Attack


Reconstruction Attack


Use sub-graph to re-identify the target part
Recover the original graph from the outsourced graph.
Security leakage


Regulation of sensitive data violated
Untrusted answers produced by cloud server
We have to strike a balance between the security and
the computational cost saving using cloud server
Framework of graph outsourcing
Client Side
Original
Graph
(2)
Graph
Transformation
Cloud Server
Outsourced
(1)
Graph
Link
graph
Query
Results
(3)
Query
Rewriting
Result
Combination
Query
Evaluation
A reasonable security model on the outsourced graph
An efficient method to transform the original graph into the outsourced
graph
An approach to rewrite the query and combine the results
Structural Anonymization

Structural anonymization in publishing
1-neighborhood [icde08], k-degree [sigmod08], k-automorphism
[vldb09], k-isomorphism [sigmod10], etc
Using the least amount of modifications of the original graph


a 1 b
1
6 2
d 3
9
h
4
3
6
3
6
4
2 2
k
l
o
9
8
a
b
e
c
g
7
d
h
f
g
1
v
j
k
l
v
3
6
u
o
i
p
4 i
8
j
u
e 7 f
5
3
c
2
p
Original graph
4-isomorphism
find 4 sub-graphs
w
x
z
Attacker’s query
No shortest distance preservation
No consideration of edge weight
y
1- Neighborhood-d-Radius Graph

Intuition: Protect the neighborhood information and the
close relationship between nodes.
b
a
6 2
9
h
2 2
4
3
o
3
4 i
9
5
g
7
4
k
b
6
6
8
j
u
e 7 f
5
d 3
3
c
2
1
v
3
6
l
8
Original graph
p
h
3
f
16
16
5
o
10
11 9
10
10
v
w4 x 1 y
3
5
z
6
i
2-radius graph
Attacker’s query
(1-neighborhood): for any node pair u and v ∈ Vo, (u, v) ∉ E
(d-radius): for any node pair u and v ∈ Vo, δG(u, v) >= d.

Privacy protection: Cannot find any meaningful results for
any query pattern
Utilization: Shortest Distance Computation
Outsourced
graph
b
3
16
5
11
5
h
f
16
10
v
9
10
o
i
5
10
Link Graph
1
a
6
Original
graph
d
1
b
2
5
3
9
e
4
h
8
j
u
3
6
f
7
3
o
g
6
i
4
2
2
4
c
2
7
v
1
k
l
9
8
3
6
p
Given a node pair u and v, the shortest distance can be
discovered with
 G u, v  minw(u, x)  G x, y   w( y, v)
……
o
u
93
v
Graph Transformation Problem

Given a graph G = (V,E) and d, the graph
transformation produces outsourced graphs Go =
{G1, ...Gj}, and a local link graph Gl, which achieves
the following objectives:

Security


Utility


Each outsourced graph is a 1-neighborhood-d-radius graph;
The union of Go and Gl can answer the shortest distance in the
original graph;
Local computational cost

The space cost of Gl and the cost of the shortest distance
computation on the client side are minimized.
94
Diversified Ranking

Scalable Diversified Ranking on Large Graphs,
Rong-Hua Li and Jeffrey Xu Yu, ICDM’11

Diversifying Top-K Results,
Lu Qin, Jeffrey Xu Yu, Lijun Chang: VLDB’12
Scalable Diversified Ranking (1)



Relevance of the top-K nodes (denoted by a set S) is
achieved by large Personalized PageRank scores.
Diversity of the top-K nodes is achieved by a large
expansion ratio.
Expansion ratio of S: σ(S)=|N(S)|/n
 Larger expansion ratio implies more diversity
The node in S
Expended nodes by S
σ(S)=0.6
σ(S)=0.9
Scalable Diversified Ranking (2)

The k-step expansion ratio of S: σk(S)=|Nk(S)|/n

Larger expansion ratio implies more diversity
The node in S
Expended nodes by S
σ2(S)=0.8
σ2(S)=1
Graph Clustering Based on
Structural and Attribute Similarities

A desired clustering of attributed graph should
achieve a good balance between the following
two things.

Structural cohesiveness: Vertices within one cluster
are close to each other in terms of structure, while
vertices between clusters are distant from each other.

Attribute homogeneity: Vertices within one cluster
have similar attribute values, while vertices between
clusters have quite different attribute values.
98
A Coauthor Network Example
r1. XML
r3. XML, Skyline
r2. XML
r4. XML
r5. XML
r6. XML
r9. Skyline
r10. Skyline
r11. Skyline
r7. XML
r8. XML
Traditional Coauthor
Structure-based
Attribute-based
Cluster
Cluster
graph
Structural/Attribute
Cluster
99
Different Clustering Approaches on the
Graph with Multiple Attributes

Structure-based Clustering


Attribute-based Clustering


Vertices with heterogeneous values in a cluster
Lose much structure information
Structural/Attribute Cluster


Vertices with homogeneous values in a cluster
Keep most structure information
100
Our Proposed Clustering Solution
101
Attribute Augmented Coauthor Graph
Then we use neighborhood random walk distance on the
augmented graph to combine structural and attribute
similarities
102
Thank You!
Download