Sharding Social Networks Presented by Andreas Adolfsson and David Perez

advertisement
Sharding Social Networks
Quang Duong, Sharad Goel, Jake Hofman, Sergei Vassilvitskii
Presented by Andreas Adolfsson and David Perez
Social Network Data Growing in Complexity
2
Extra Terms before we continue
Shard - a subset of nodes in a graph stored on an individual machine.
Sharding - Distributing data across a collection of shards.
Neighborhood Queries - Query that retrieves data from a node and all of its
neighbors.
Replication Ratio ρ - Characterizes the level of node duplication in a system.
Query Plan - a set of indices that indicate where to access a node when executing a
neighborhood query.
Hotspots - shards with much higher-than-average loads.
3
NetworkSharding Problem
Given a graph G, the total number of shards, T, and a per shard capacity constraint
M, find a valid query plan Q with minimal cost.
4
Stochastic Block Model
Random networks with community structure.
Nodes belong to K communities(blocks). Probability of an edge (relationship) existing
between two nodes depends on their block assignments.
For each node i, independently roll a K-sided die with bias π to determine the node's
block assignment zi in {1, …, K}.
For each ordered pair of nodes ( i, j ), flip a coin with bias Θ+ (resp. Θ- ) for nodes
in the same (resp. different) blocks to determine if an edge exists from i to j.
5
Sharding Techniques
Random Sharding - randomly assign nodes to separate shards.
Geographic Sharding - take geographic distribution of nodes into account.
Network Sharding - consider network structure that may exist in the data.
6
Random Sharding
Random Sharding refers to distributing nodes to shards at random.
This is a very common strategy used in industry due to its simplicity.
Degrades into worst-cost neighborhood querying, where each accessed neighbor is
stored on different shards.
7
Geographic sharding
Geographic sharding refers to distributing users across shards based on available
geographical location information.
Caching a small number of “local celebrity” users and creating hotspot shards reduces
average load time.
Unfortunately, geographic information is not always available, and random sharding is
often used when Geo-data not found.
8
Network Sharding
Attempts to partition blocks of densely connected communities together.
Maps blocks to shards via taking advantage of between-block structure to minimize
sharding costs.
Replication of frequently accessed nodes across multiple shards reduces sharding cost,
hotspots, and average load across the system.
9
The Solution
Two pronged approach for scalable network sharding, as well as an extensible
replication algorithm:
VBLabelProp - Identify densely connected regions in the graph.
BlockShard - Greedily partition blocks to suitable shards, minimizing sharding cost.
NodeRep - Allocate copies of very densely connected region into non-full shards.
10
BlockShard
11
NodeRep
12
Empirical Evaluation Strategy
Evaluate Random, Geographic,and Network-Aware sharding strategies.
Two Datasets
LiveJournal - 5.1 million nodes, 150 million directed edges, 1.6 million user profiles.
Twitter - 1.4 billion directed edges, 41 million user profiles.
Roughly 1MB of in-memory space per user, 40GB of RAM available of shard storage.
Evaluate methods with and without replication.
13
Results Overview
Average Load
Load Balance
Replication
14
Sharding Cost across
methods
15
Sharding Cost with
Replication
16
Sharding Cost with
Replication cont.
17
Load Dispersion
across methods.
18
Per-Shard Load
19
Effects of Replication on Load Balance
20
Conclusion
Network Sharding considers the structure of a dataset, and provides significant
improvements over Random and Geographic Sharding strategies.
Two pronged approach with an optional data replication algorithm to minimize
sharding cost.
21
Questions?
22
References
Quang Duong, Sharad Goel, Jake Hofman, Sergei Vassilvitskii, Sharding Social Networks, Proceedings of the sixth ACM international
conference on web search and data mining, New York, New York, USA.
23
Download