Key-Key-Value Stores for Efficiently Processing Graph Data in

advertisement
Key-Key-Value Stores for
Efficiently Processing
Graph Data in the Cloud
Alexander G. Connor
Panos K. Chrysanthis
Alexandros Labrinidis
Advanced Data Management Technologies Laboratory
Department of Computer Science
University of Pittsburgh
Data in social networks
• A social network manages user profiles, updates and
connections
• How to manage this data in a scalable way?
• Key-value stores offer performance under high load
• Some observations about social networks
• A profile view usually includes data from a user’s friends
• Spatial locality
• A friend’s profile is often visited next
• Temporal locality
• Requests might ask for updates from several users
• Web pages might include pieces of several user profiles
• A single request requires connecting to many machines
Connections in a Social Network
Alice
Leveraging Locality
• Can we take advantage of the connections?
• What if we stored connected user’s profiles and data in
the same place?
• Locality can be leveraged
• The number of connections is reduced
• User data can be pre-fetched
• We can think of this as a graph partitioning problem…
• Partitions = machines
• Vertices = user profiles, including update
• Edges = connections
• Objective: minimize the number of edges that cross partitions
Example – graph partitioning
• Many edges cross partitions
• Accessing a vertex’s neighbors
requires accessing many partitions
• Far fewer edges cross partitions
• Accessing a vertex’s neighbors
requires accessing few partitions
• In a social network, requesting
updates from followed users
requires connecting to many
machines
• In a social network, fewer
connections are made and related
user data can be pre-fetched
Key-Key-Value Stores
• Our proposed approach: extend the key-value model
• Data can be stored key-values
• User profiles
• Data can also be stored as key-key-values
• User connections
• “Alice follows Bob”
• Use key-key-values to compute locality
• On-line graph partitioning algorithm
• Assign keys to grid locations based on connections
• Each grid cell represents a data host
• Keys that are related are kept together
Outline
• Introduction
• Data in Social Networks
• Leveraging Locality
• Key-Key-Value Stores
• System Model
• Client API
• Adding a Key-Key-Value
• Load management
• On-line partitioning algorithm
• Simulation Parameters
• Results
• Conclusion
AddressLayer:
Table:
Mapping
Store
Physical
Logical
Layer:
Virtual
Physical
machines
machines
Application
Layer:
Client
API
a transactional,
distributed
hash table as demands change
•• can
Organized
maintain
be added
client
as aorsessions
square
removed
griddynamically
mapsthe
keys
to store
virtualsoftware
machines
• Run
cached
data
KKV
• Manage replication
• Can be moved between physical machines as needed
Application Sessions
Address table
Virtual hosts
Physical hosts
Client API and Sessions
• Clients use a simple API that includes the get, put and
sync commands
• Data is pulled from the logical layer in blocks
• Groups of related keys
• The client API keeps data in an in-memory cache
• Data is pushed out asynchronously to virtual nodes in
blocks
• Push/pull can be done synchronously if requested by the
client
• Offers stronger consistency at the cost of performance
Adding a key-key-value
put(alice, bob, follows)
The
partitioning
algorithm
Alice’smachine
data to Bob’s
Two on-line
users:
Alice
and
Bob
data
todata
thatto
node
Write
theAddress
same
that
node moves
Table
to
determine
the virtual
(node)node
that hosts
Use
the
address
table
to
determine
the
because
they
are
connected
Alice’sthat
data
node
hosts
Bob’s
data
Address table
alice
1,1
8,8
bob
8,8
Virtual hosts
kv(alice, ...)
...
kkv(alice, bob, follows)
1,1
kv(bob, ...)
...
kkv(alice, bob, follows)
8,8
Splitting a Node
Once
the
split
complete,
newnodes
physical
machines
canand
be column
turned on
IfTo
one
maintain
node
becomes
theisgrid
structure,
overloaded,
it can
in the
initiate
same
a split
row
must
Virtual nodes can be transferred to these new machines
also •split
Virtual hosts
Outline
• Introduction
• Data in Social Networks
• Leveraging Locality
• Key-Key-Value Stores
• System Model
• Client API
• Adding a Key-Key-Value
• Load management
• On-line Partitioning Algorithm
• Simulation Parameters
• Results
• Conclusion
On-line Partitioning Algorithm
• Runs periodically in parallel on each virtual node
• Also after a split or merge
For each key stored on a node
Determine the number of connections (key-key-values) with
keys on other nodes
Can also be sum of edge weights
Find the node that has the most connections
If that node is different than the current node
If the number of connections to that node is greater than
the number of connections to the current node
If this margin is greater than some threshold
Move the key to the other node
Update the address table
• Designed to work in a distributed, dynamic setting
• NOT a replacement for off-line algorithms in static settings
Partitioning Example
2,1
1,1
Node
1,1
2,1
1,2
Sum(Edges)
0
2
1
1,2
Partitioning Example
2,1
1,1
1,2
Experimental Parameters
Parameter
Values
No. Vertices (V)
100-400
Branching Factor (b)
10%-100% of V
Distribution of b
Zipf
alpha
1.5
Partitioning Algorithms
On-line, Kernighan-Lin
On-line Workload
Random, pre-generated
history of edge inserts
On-line algorithm run
frequency
Every V/10 inserts
On-line threshold
Improvement > 0
Trials
3 per graph size
Partitioning Quality Results
35,00%
30,00%
25,00%
% Edges in partition
20,00%
On-line
15,00%
KL
10,00%
5,00%
0,00%
0
100
200
300
400
500
Vertices in graph
On-line partitions as well as Kernighan-Lin
Partitioning Performance Results
1600
1400
1200
Vertices moved
1000
On-line
800
KL
600
400
200
0
0
100
200
300
400
500
Vertices in graph
On-line partitions 2x faster than Kernighan-Lin!
Conclusions
• Contributions:
• A novel model for scalable graph data stores that extends the keyvalue model
• Key-key-value store
• A high-level system design
• A novel on-line partitioning algorithm
• Preliminary experimental results
• Our proposed algorithm shows promise in the distributed, dynamic
setting
What’s Ahead?
• Prototype system implementation
• Java, PostgreSQL
• Performance Analysis against MongoDB, Cassandra
• Sensitivity Analysis
• Cloud Deployment
Thank You!
• Acknowledgments
• Daniel Cole, Nick Farnan, Thao Pham, Sean Snyder
• ADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S
PBC
Download