P 0 V 1 - WordPress.com

advertisement
Chonbuk National Univ
Database Laboratory v.2
TurboGraph: A Fast Parallel Graph Engine Handling
Billion-scale Graphs in a Single PC
Wook-Shin Han, Sangyeon Lee, Kyungyeol Park
Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, Hwanjo Yu
POSTECH, DGIST
SIGKDD 2013, ACM
ACM SIGKDD is conference (Knowledge discovery and data mining)
Database laboratory
Regular Seminar
2013-11-11
TaeHoon Kim
1
•
•
•
•
•
•
•
Introduction
Related Work
Efficient Graph Storage
Disk-Based Parallel Graph Computation
Processing Graph Queries*
Experiments
Conclusion
2
• Graphs are used to model many real objects
– Web graph, chemical compound, biological structure
• Very large real graph size
– Facebook reached one billion users on Oct. 4, 2012
– Yahoo Web graph consisting of 1.4 billion vertices and 6.6
billion edges
• However, if there are a billion vertices in the graph database,
the size of the mapping table is too large to fit into memory
• For fast graph retrieval on a single commodity PC, graphs
must be stored in fast external memory, such as FlashSSDs
3
• Proposed to handle big graph efficiently
– GBase is recent graph engine using MapReduce
• If the graph is represented as a compressed matrix-vector,
computation solves many representative
• However, distributed systems based on MapReduce are generally
slow unless there is a sufficient number of machines in a cluster
– Distributed system based on the vertexcentric programing model
Pregel, GraphLab, PowerGraph has been propossed
• However, efficient of graph operations is very difficult
• User needs to be skilled at managing and tuning a distributed
systems in a cluster, which is a nontrivial job for the ordinary users
– Recently, a disk-based graph processing engine on a single PC
called GraphChi has been propossed
• Exploits the novel concept of parallel sliding windows(PSW)
4
• Processing PSW of GraphChi
– 1)Loading a subgraph 2)updating the vertices and edge 3)writing
the updated parts of the subgraph to disk
• We observe that PSW incurs four serious problem
– 1)In order to start updating vertices/edges in a shard file, their inedges must be fully loaded in memory
– 2)All edges in the shard file source and target vertices are in the
same execution interval are processed in sequential order which
hinder full parallelism
– 3)At each iteration, a significant number of updated edges can be
flushed to disk
• If the size of graph is very large and/or there exist many iteration,
GraphChi involve a significant amount of disk I/Os
– 4)Even if a query needs to access a small portion of the data, it
reads the whole graph at the first iteration  poor utilization(h/w)
5
• In this paper, TurboGraph provides
– First truly parallel graph engine on a single PC
– Full parallelism including FlashSSD IO parallelism and multi-core
parallelism
– Full overlap of CPU processing and I/O processing
• We present TurboGraph to process billion-scale graphs very
efficiently by using modern hardware on a single PC
• We present a novel parallel execution model called pin-andslide
• Implements the column view of the matrix-vector multiplication
6
• Distributed synchronous approaches
– PEGASUS and Gbase
• based on MapReduce and support matrix-vector multiplication using
compressed matrices
– All synchronous approaches above could suffer from costly
performance penalties
• Because, the runtime of each step is determined by the slowest
machine in the cluster  cause h/w variability, n/w imbalance
• Distributed asynchronous approaches
– GraphLab is also based on vertex-centric programing model
• Vertex kernel is executed in asynchronous parallel on each vertex
• However, some algorithms based on asynchronous computation
require serializability for correctness
7
• Distributed asynchronous approaches
– PowerGraph
• Basically similar to GraphLab
• It partitions and store graph by exploiting the properties of real-world
graphs of highly skewed power-law degree distribution
• However, efficient graph partitioning in the distributed environment for
all types of graph operation is inherently hard problem
• Single-machine approaches
– GraphChi
• Disk-based single machine system following the asynchronous vetexcentric programing model
• Use PSW
• GraphChi is very efficient, and thus able to problems while using only
a single machine, there are still four serious
8
• Disk-based Graph Representation
– For the vertices 𝜐0 ~ 𝜐5 their adjacency list are stored as
smallrecords in pages p0~p2 while the adjacency list of 𝜐6 is
stored as a large record which spans the two pages p3, p4
– Since the size of this RID table is very small, we can safely make
it resident in memory
The slotted page is
known to be very
good for supporting
efficient updates
A record means
an adjacency list
Buffer pool
offset
Start LRPL
offset
# of LA PAGES
LRPL
offset
9
• In-memory Data Structures and Core Operations
– Invoke PINPAGE : if the page exists in the buffer?(pinCount++)
• Otherwise, it obtains an empty frame by the LRU replacement and
loads the page from disk to the frame
• Return the memory address of the frame where the page was loaded
– UNPINPAGE : (pinCount--)
P0
User defined function for
RID processing
V1
U0
– PINCOMPUTEUNPIN(PageID pid, list<RID> RIDList, User Object u0)
• Provide s asynchronous I/Os to the FlashSSD
P0
Buffer pool
Be resident in Memory
Callback thread
U0 Compute(v1,Iterator
(v1, adj))
1. Execution thread
(PinComputeUnpin)
FlashSSD
Buffer pool : an array of frames
10
•
•
•
•
G = (V,E)
Adjacency Matrix M(G), where vi = i-th vertex in G
Let M(G)i the i-th column vector of M(G)
When we have a column vector X(|X| = |V|), we can
define the matrix-vector multiplication
– between M(G) and X(Y = M(G) × X )
Vi
1
As Y =
|𝑉|
𝑖=0 M
𝐺
𝑖
× Xiin the column view
2
4
5
0.214
7
3
11
• X : input bit vector
• Y : output bit vector
– We use bit vectors for the
graph
• U0 : User object
– User-define Compute as one
of its methods
• Execution thread
• (parallel async I/O)
– PinComputeUnpin
• Callback thread
• (concurrently process the
vertex)
– U0.Compute(v1,Iterator)
12
•1 If the page is fully loaded?
– Pinned of the fully loaded page
•2 If partially loaded? Using
Knapsack 0-1
1
2
– PinComputeUnpin
•3 Ordered from the large adjacency
list and small adjacency
•4 Using parallel processing, and
Compute
•5 To thread safe, we use latch free
approach
3
4
5
13
Th2
•5 Thread safe latch free approach
– 5.1 latch free approach ?
5.1
Th1
14
• Handling General Vectors
– Example 1. We explain our pin-and-slide model handling general
vectors by using a PageRank query
• Step1
– After first reading pages from disk into the buffer {p0, p1, p2}, We read the
first chunk of each attribute vector into memory
– Then we join between block1 and chunk1
• Step2
– We read chunk2 of each attribute vector into memory join between block1
and chunk2 and updates of chunk1 of the output vector
• Step3
– Since we complete the processing for block1, we read new pages from
disk {p3,p4}, then we join block2 and chunk1 and write the results to
chunk2 of the output vector
• Step4
– We do the final join and update chunk2 of the output vector
15
chunk1
V0
outDegree
v4
0.143
v2
v0
0.143
v3
v1
•
0.143
0.143
0.143
p0
0.143
0.143
p1
prevPR
V1
V2
V3
V4
2
2
7
0.143
0.143
0.143
2
2
2
2
0.143
0.143
0.143
0.143
v6
V6
1
buffer pool
p3, p4
output
JOIN
0.143
Example of V0 have V1 and V6
V5
0.143
v5
p2
chunk2
p0
p1
p2
block1
# of edges
V0
0
0.082
V1
0
0.082
V2
0
0.082
V3
0
0.082
V4
0
0.082
V5
0
0.082
V6
0
0
chunk1
Total of vertex
{ 0.85 ×( 0.143 / 2 ) + (0 / 2) } + (0.15 / 7 ) = 0.082
chunk2
1
16
chunk1
V0
outDegree
v4
0.143
v2
v0
0.143
v3
v1
•
0.143
0.143
0.143
p0
0.143
0.143
p1
prevPR
V1
V2
V3
V4
2
2
7
0.143
0.143
0.143
2
2
2
2
0.143
0.143
0.143
0.143
v6
V6
JOIN
buffer pool
p3, p4
output
2
0.143
Example of V0 have V1 and V6
V5
0.143
v5
p2
chunk2
p0
p1
p2
block1
# of edges
V0
0.082
0.099
V1
0.082
0.099
V2
0.082
0.099
V3
0.082
0.099
0.082
0.099
V5
0.082
0.099
V6
0
0
V4
chunk1
Total of vertex
0.85 × { ( 0.143 / 2 ) + (0.143 / 7) } + (0.15 / 7 ) = 0.099
chunk2
2
17
chunk1
V0
outDegree
v4
0.143
v2
v0
0.143
v3
v1
•
0.143
0.143
0.143
p0
0.143
0.143
p1
prevPR
V1
V2
V3
V4
2
2
7
0.143
0.143
0.143
2
2
2
2
0.143
0.143
0.143
0.143
V5
V6
0.143
v6
3
0.143
buffer pool
p3, p4
Example of V6 have recursive V6
output
JOIN
v5
p2
chunk2
p30
p41
p2
SLIDE
block1
block2
# of edges
V0
0.099
0.099
V1
0.099
0.099
V2
0.099
0.099
V3
0.099
0.099
0.099
0.099
V5
0.099
0.099
V6
0
0.386
V4
chunk1
Total of vertex
0.85 × { ( 0.143 / 2 ) + ( 0.143 / 2 ) + … ( 0.143 / 2 ) +
( 0.143 / 0 ) } + 0.15/7 = 0.386
chunk2
3
18
chunk1
V0
outDegree
v4
0.143
v2
v0
0.143
v3
v1
•
0.143
0.143
0.143
p0
0.143
0.143
p1
prevPR
V1
V2
V3
V4
2
2
7
0.143
0.143
0.143
2
2
2
2
0.143
0.143
0.143
0.143
V5
V6
0.143
v6
JOIN
0.143
buffer pool
p3, p4
Example of V6 have recursive V6
output
4
v5
p2
chunk2
p30
p41
p2
block1
block2
# of edges
V0
0.099
0.099
V1
0.099
0.099
V2
0.099
0.099
V3
0.099
0.099
0.099
0.099
V5
0.099
0.099
V6
0
0.403
V4
chunk1
Total of vertex
0.85 × { ( 0.143 / 2 ) + ( 0.143 / 2 ) + … ( 0.143 / 2 ) +
( 0.143 / 7 ) } + 0.15/7 = 0.403
chunk2
4
19
• Targeted Queries( BFSf(Vq) )
– BFS operators
•
•
•
•
•
•
1-step(out-)neighbors
K-step neighbors
Induced subgraph
l-step egonet
K-step egonet
K-core, Cross-Edges
• Global Queries
– We have already explained briefly how our model processes the
PageRank query in Example 1
20
• We use three real datasets for the experiments LiveJournal,
Twitter, and YahooWeb
• The Twitter dataset contains 42M vertices and 1.5B edges
• The YahooWeb dataset contains a web graph from Yahoo!
With 1.4B vertices and 6.6Edges
• Experimentation environment
– Intel i7 6-core 3.2GHz CPU and 12 Gbytes DRAM
– Two 512GB SSDs of Samsung 840 Series
• TurboGraph can be complied in Windows, but GraphChi can
be compiled in Linux
– Considering that disk I/O performance in Ubuntu is better than
that in Windows7
21
• Breadth-First Search
– We additionally perform experiments with a state-of-the-art inmemory graph BFS engine Green-Marl
– Varying the buffer Size
• Green-Mari failed due to lack of
memory
– Varying the Number of Execution Threads
• GraphChi is very hard to pre-load the graph
•
• GraphChi processes all edges serially
22
• Targeted Queries
• Global Queries
23
• In this paper, we presented a fast, parallel graph engine called
TurboGraph for efficiently processing billion-scale graphs on a
single PC
• We proposed a notion of the pin-and-slide model which
implements the column view of the matrix-vector multiplication
– It utilizes two types of thread, execution threads and callback
thread, along with a buffer manager
• We show that TurboGraph outperforms the state-of-the-art
algorithms by up to four orders of magnitude
24
Discussion
 관련연구 GraphChi의 PSW의 단점들을 제안하는 기법 [e.g)pin-and-slide ]을 이용해서
해결하였고, 그에 따른 성능이 기존 연구보다 우수함을 보임
강점
단점
 Flash-SSD 의 비동기 I/O를 execution  그래프 기반의 데이터구조에 쓰일 수
thread를 사용하고, compute를 하기
있음
위해 callback thread를 사용하기
때문에 FULL CPU/ IO processing을
 쓰레드라는 O/S 자원이 필요
하기 때문에, 기존 연구보다 빠르게
처리 가능
 최신 기법인 pin-and-silde 제안
Thank you for listening my presentation : )
25
It is important contents to
understand contents
• Note that, PPT is summary of my thinking
26
Download