Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud Alexander G. Connor Panos K. Chrysanthis Alexandros Labrinidis Advanced Data Management Technologies Laboratory Department of Computer Science University of Pittsburgh Data in social networks • A social network manages user profiles, updates and connections • How to manage this data in a scalable way? • Key-value stores offer performance under high load • Some observations about social networks • A profile view usually includes data from a user’s friends • Spatial locality • A friend’s profile is often visited next • Temporal locality • Requests might ask for updates from several users • Web pages might include pieces of several user profiles • A single request requires connecting to many machines Connections in a Social Network Alice Leveraging Locality • Can we take advantage of the connections? • What if we stored connected user’s profiles and data in the same place? • Locality can be leveraged • The number of connections is reduced • User data can be pre-fetched • We can think of this as a graph partitioning problem… • Partitions = machines • Vertices = user profiles, including update • Edges = connections • Objective: minimize the number of edges that cross partitions Example – graph partitioning • Many edges cross partitions • Accessing a vertex’s neighbors requires accessing many partitions • Far fewer edges cross partitions • Accessing a vertex’s neighbors requires accessing few partitions • In a social network, requesting updates from followed users requires connecting to many machines • In a social network, fewer connections are made and related user data can be pre-fetched Key-Key-Value Stores • Our proposed approach: extend the key-value model • Data can be stored key-values • User profiles • Data can also be stored as key-key-values • User connections • “Alice follows Bob” • Use key-key-values to compute locality • On-line graph partitioning algorithm • Assign keys to grid locations based on connections • Each grid cell represents a data host • Keys that are related are kept together Outline • Introduction • Data in Social Networks • Leveraging Locality • Key-Key-Value Stores • System Model • Client API • Adding a Key-Key-Value • Load management • On-line partitioning algorithm • Simulation Parameters • Results • Conclusion AddressLayer: Table: Mapping Store Physical Logical Layer: Virtual Physical machines machines Application Layer: Client API a transactional, distributed hash table as demands change •• can Organized maintain be added client as aorsessions square removed griddynamically mapsthe keys to store virtualsoftware machines • Run cached data KKV • Manage replication • Can be moved between physical machines as needed Application Sessions Address table Virtual hosts Physical hosts Client API and Sessions • Clients use a simple API that includes the get, put and sync commands • Data is pulled from the logical layer in blocks • Groups of related keys • The client API keeps data in an in-memory cache • Data is pushed out asynchronously to virtual nodes in blocks • Push/pull can be done synchronously if requested by the client • Offers stronger consistency at the cost of performance Adding a key-key-value put(alice, bob, follows) The partitioning algorithm Alice’smachine data to Bob’s Two on-line users: Alice and Bob data todata thatto node Write theAddress same that node moves Table to determine the virtual (node)node that hosts Use the address table to determine the because they are connected Alice’sthat data node hosts Bob’s data Address table alice 1,1 8,8 bob 8,8 Virtual hosts kv(alice, ...) ... kkv(alice, bob, follows) 1,1 kv(bob, ...) ... kkv(alice, bob, follows) 8,8 Splitting a Node Once the split complete, newnodes physical machines canand be column turned on IfTo one maintain node becomes theisgrid structure, overloaded, it can in the initiate same a split row must Virtual nodes can be transferred to these new machines also •split Virtual hosts Outline • Introduction • Data in Social Networks • Leveraging Locality • Key-Key-Value Stores • System Model • Client API • Adding a Key-Key-Value • Load management • On-line Partitioning Algorithm • Simulation Parameters • Results • Conclusion On-line Partitioning Algorithm • Runs periodically in parallel on each virtual node • Also after a split or merge For each key stored on a node Determine the number of connections (key-key-values) with keys on other nodes Can also be sum of edge weights Find the node that has the most connections If that node is different than the current node If the number of connections to that node is greater than the number of connections to the current node If this margin is greater than some threshold Move the key to the other node Update the address table • Designed to work in a distributed, dynamic setting • NOT a replacement for off-line algorithms in static settings Partitioning Example 2,1 1,1 Node 1,1 2,1 1,2 Sum(Edges) 0 2 1 1,2 Partitioning Example 2,1 1,1 1,2 Experimental Parameters Parameter Values No. Vertices (V) 100-400 Branching Factor (b) 10%-100% of V Distribution of b Zipf alpha 1.5 Partitioning Algorithms On-line, Kernighan-Lin On-line Workload Random, pre-generated history of edge inserts On-line algorithm run frequency Every V/10 inserts On-line threshold Improvement > 0 Trials 3 per graph size Partitioning Quality Results 35,00% 30,00% 25,00% % Edges in partition 20,00% On-line 15,00% KL 10,00% 5,00% 0,00% 0 100 200 300 400 500 Vertices in graph On-line partitions as well as Kernighan-Lin Partitioning Performance Results 1600 1400 1200 Vertices moved 1000 On-line 800 KL 600 400 200 0 0 100 200 300 400 500 Vertices in graph On-line partitions 2x faster than Kernighan-Lin! Conclusions • Contributions: • A novel model for scalable graph data stores that extends the keyvalue model • Key-key-value store • A high-level system design • A novel on-line partitioning algorithm • Preliminary experimental results • Our proposed algorithm shows promise in the distributed, dynamic setting What’s Ahead? • Prototype system implementation • Java, PostgreSQL • Performance Analysis against MongoDB, Cassandra • Sensitivity Analysis • Cloud Deployment Thank You! • Acknowledgments • Daniel Cole, Nick Farnan, Thao Pham, Sean Snyder • ADMT Lab, CS Department, Pitt GPSA, Pitt A&S GSO, Pitt A&S PBC