Sharding Social Networks Quang Duong, Sharad Goel, Jake Hofman, Sergei Vassilvitskii Presented by Andreas Adolfsson and David Perez Social Network Data Growing in Complexity 2 Extra Terms before we continue Shard - a subset of nodes in a graph stored on an individual machine. Sharding - Distributing data across a collection of shards. Neighborhood Queries - Query that retrieves data from a node and all of its neighbors. Replication Ratio ρ - Characterizes the level of node duplication in a system. Query Plan - a set of indices that indicate where to access a node when executing a neighborhood query. Hotspots - shards with much higher-than-average loads. 3 NetworkSharding Problem Given a graph G, the total number of shards, T, and a per shard capacity constraint M, find a valid query plan Q with minimal cost. 4 Stochastic Block Model Random networks with community structure. Nodes belong to K communities(blocks). Probability of an edge (relationship) existing between two nodes depends on their block assignments. For each node i, independently roll a K-sided die with bias π to determine the node's block assignment zi in {1, …, K}. For each ordered pair of nodes ( i, j ), flip a coin with bias Θ+ (resp. Θ- ) for nodes in the same (resp. different) blocks to determine if an edge exists from i to j. 5 Sharding Techniques Random Sharding - randomly assign nodes to separate shards. Geographic Sharding - take geographic distribution of nodes into account. Network Sharding - consider network structure that may exist in the data. 6 Random Sharding Random Sharding refers to distributing nodes to shards at random. This is a very common strategy used in industry due to its simplicity. Degrades into worst-cost neighborhood querying, where each accessed neighbor is stored on different shards. 7 Geographic sharding Geographic sharding refers to distributing users across shards based on available geographical location information. Caching a small number of “local celebrity” users and creating hotspot shards reduces average load time. Unfortunately, geographic information is not always available, and random sharding is often used when Geo-data not found. 8 Network Sharding Attempts to partition blocks of densely connected communities together. Maps blocks to shards via taking advantage of between-block structure to minimize sharding costs. Replication of frequently accessed nodes across multiple shards reduces sharding cost, hotspots, and average load across the system. 9 The Solution Two pronged approach for scalable network sharding, as well as an extensible replication algorithm: VBLabelProp - Identify densely connected regions in the graph. BlockShard - Greedily partition blocks to suitable shards, minimizing sharding cost. NodeRep - Allocate copies of very densely connected region into non-full shards. 10 BlockShard 11 NodeRep 12 Empirical Evaluation Strategy Evaluate Random, Geographic,and Network-Aware sharding strategies. Two Datasets LiveJournal - 5.1 million nodes, 150 million directed edges, 1.6 million user profiles. Twitter - 1.4 billion directed edges, 41 million user profiles. Roughly 1MB of in-memory space per user, 40GB of RAM available of shard storage. Evaluate methods with and without replication. 13 Results Overview Average Load Load Balance Replication 14 Sharding Cost across methods 15 Sharding Cost with Replication 16 Sharding Cost with Replication cont. 17 Load Dispersion across methods. 18 Per-Shard Load 19 Effects of Replication on Load Balance 20 Conclusion Network Sharding considers the structure of a dataset, and provides significant improvements over Random and Geographic Sharding strategies. Two pronged approach with an optional data replication algorithm to minimize sharding cost. 21 Questions? 22 References Quang Duong, Sharad Goel, Jake Hofman, Sergei Vassilvitskii, Sharding Social Networks, Proceedings of the sixth ACM international conference on web search and data mining, New York, New York, USA. 23