Distributed Parallel Inference on Large Factor Graphs Joseph E. Gonzalez Yucheng Low

advertisement
Distributed Parallel Inference
on Large Factor Graphs
Joseph E. Gonzalez
Yucheng Low
Carlos Guestrin
David O’Hallaron
Carnegie Mellon
Exponential Parallelism
Exponentially Increasing
Parallel Performance
1
Constant Sequential
Performance
0.1
Release Date
2010
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
0.01
1988
Processor Speed GHz
10
2
Distributed Parallel Setting
Fast Reliable Network
CPU
Cache
Bus
CPU
Memory
Node
Cache
Bus
Memory
Node
Opportunities:
Access to larger systems: 8 CPUs  1000 CPUs
Linear Increase:
RAM, Cache Capacity, and Memory Bandwidth
Challenges:
Distributed state, Communication and Load Balancing
3
Graphical Models and Parallelism
Graphical models provide a common language for general
purpose parallel algorithms in machine learning
A parallel inference algorithm would improve:
Protein Structure
Prediction
Movie
Recommendation
Computer Vision
Inference is a key step in Learning
Graphical Models
4
Belief Propagation (BP)
Message passing algorithm
Naturally Parallel Algorithm
5
Parallel Synchronous BP
Given the old messages all new messages can be
computed in parallel:
CPU 1
CPU 2
CPU 3
Old
Messages
CPU n
New
Messages
Map-Reduce Ready!
6
Hidden Sequential Structure
7
Hidden Sequential Structure
8
Hidden Sequential Structure
Evidence
Evidence
Running Time:
Time for a single
parallel iteration
Number of Iterations
9
Optimal Sequential Algorithm
Running
Time
Naturally Parallel
2n2/p
Gap
Forward-Backward
p ≤ 2n
2n
p=1
Optimal Parallel
n
p=2
10
Parallelism by Approximation
True Messages
1
2
3
4
5
6
7
8
9
10
τε -Approximation
τε represents the
minimal sequential
structure
1
11
Optimal Parallel Scheduling
Processor 1
Processor 2
Processor 3
In [AIStats 09] we demonstrated that this
algorithm is optimal:
Synchronous Schedule
Parallel
Component
Gap
Optimal Schedule
Sequential
Component
12
The Splash Operation
~
Generalize the optimal chain algorithm:
to arbitrary cyclic graphs:
1) Grow a BFS Spanning tree
with fixed size
2) Forward Pass computing all
messages at each vertex
3) Backward Pass computing all
messages at each vertex
13
Running Parallel Splashes
CPU 1
CPU 2
CPU 3
Splash
Splash Key Challenges:
Splash
1) How do we schedules Splashes?
2) How do we partition the Graph?
Local State
Local State
Local State
Partition the graph
Schedule Splashes locally
Transmit the messages along the boundary of
the partition
14
Where do we Splash?
Assign priorities and use a scheduling queue to
select roots:
?
Splash
?
Splash
How do we assign priorities?
CPU 1
Local State
Scheduling Queue
?
Message Scheduling
Residual Belief Propagation [Elidan et al., UAI 06]:
Assign priorities based on change in inbound messages
Small Change
Small Change
Message
Large Change
Large Change
Message
1
2
Small Change:
Message
Expensive No-Op
Message
Large Change:
Message
Informative
Update
Message
16
Problem with Message Scheduling
Small changes in messages do not imply small
changes in belief:
Small change in
all message
Message
Large change in
belief
Message
Belief
Message
Message
17
Problem with Message Scheduling
Large changes in a single message do not imply
large changes in belief:
Large change in
a single message
Message
Message
Belief
Small change
in belief
Message
Message
18
Belief Residual Scheduling
Assign priorities based on the cumulative change
in belief:
rv =
+
+
1
A vertex whose belief has
changed substantially
since last being updated
will likely produce
informative new messages.
1
1
Message
Change
19
Message vs. Belief Scheduling
Belief scheduling improves accuracy more quickly
Belief scheduling improves convergence
% Converged in 4Hrs
L1 Error in Beliefs
Better
0.06
Message Scheduling 100%
0.05
80%
Belief Scheduling
60%
0.04
40%
0.03
20%
0.02
0
50
Time (Seconds)
100
0%
Belief
Residuals
Message
Residual
20
Splash Pruning
Belief residuals can be used to dynamically
reshape and resize Splashes:
Low
Beliefs
Residual
Splash Size
Running Time (Seconds)
Better
Using Splash Pruning our algorithm is able to
dynamically select the optimal splash size
350
Without Pruning
300
With Pruning
250
200
150
100
50
0
0
10
20
30
40
Splash Size (Messages)
50
60
22
Example
Many
Updates
Synthetic Noisy Image
Few
Updates
Vertex Updates
Algorithm identifies and focuses
on hidden sequential structure
Factor Graph
23
Distributed Belief Residual Splash Algorithm
Fast Reliable Network
Scheduling
Theorem:
Queue
CPU 1
CPU 2
Scheduling
Queue
CPU 3
Scheduling
Queue
Given a uniform partitioning of the chain
Splash
Splash
graphical model, DBRSplash will run in time:
Splash
Local State
Local State
Local State
Partition factor graph over processors
Schedule
locally using belief residuals
retainingSplashes
optimality.
Transmit messages on boundary
24
Partitioning Objective
The partitioning of the factor graph determines:
Storage, Computation, and Communication
Goal:
Balance Computation and Minimize Communication
Ensure
Balance
Comm.
cost
CPU 1
CPU 2
25
The Partitioning Problem
Objective:
Minimize
Communication
Ensure Balance
Depends on:
Update counts are not known!
Work:
Comm:
NP-Hard  METIS fast partitioning heuristic
26
Unknown Update Counts
Determined by belief scheduling
Depends on: graph structure, factors, …
Little correlation between past & future update
counts
Noisy Image
Uninformed
Cut
Update Counts
27
Uniformed Cuts
Uninformed Cut
Update Counts
Optimal Cut
Too Much Work
Too Little Work
Greater imbalance & lower communication cost
Work Imbalance
Communication Cost
3
Better
Better
4
2
1
0
Denoise
UW-Syst.
1.1
1
0.9
0.8
0.7
0.6
0.5
Uninformed
Optimal
Denoise
UW-Syst.
28
Over-Partitioning
Over-cut graph into k*p partitions and randomly assign CPUs
Increase balance
Increase communication cost (More Boundary)
CPU 1
CPU 2
CPU 2
CPU 1
CPU 1
CPU 2
CPU 2
CPU 1
CPU 2
CPU 1
CPU 2
CPU 1
CPU 1
CPU 2
Without Over-Partitioning
k=6
29
Over-Partitioning Results
Provides a simple method to trade between work
balance and communication cost
Communication Cost
Work Imbalance
4
3.5
3.5
Better
Better
3
2.5
3
2.5
2
2
1.5
1.5
0
5
10
Partition Factor k
15
1
0
5
10
Partition Factor k
15
30
CPU Utilization
Over-partitioning improves CPU utilization:
Active CPUs
UW-Systems MLN
Denoise
70
70
60
60
50
50
No Over-Part
40
40
10x Over-Part
30
30
20
20
10
10
0
0
0
100
200
Time (Seconds)
0
20
40
Time (Seconds)
31
DBRSplash Algorithm
Fast Reliable Network
CPU 1
Scheduling
Queue
Splash
Local State
CPU 2
Scheduling
Queue
Splash
Local State
CPU 3
Scheduling
Queue
Splash
Local State
Over-Partition factor graph
Randomly assign pieces to processors
Schedule Splashes locally using belief residuals
Transmit messages on boundary
32
Experiments
Implemented in C++ using MPICH2 as a
message passing API
Ran on Intel OpenCirrus cluster: 120 processors
15 Nodes with 2 x Quad Core Intel Xeon Processors
Gigabit Ethernet Switch
Tested on Markov Logic Networks obtained from
Alchemy [Domingos et al. SSPR 08]
Present results on largest UW-Systems and smallest
UW-Languages MLNs
33
Parallel Performance (Large Graph)
Single Processor
Running Time:
1 Hour
Linear to SuperLinear up to 120
CPUs
Cache efficiency
Better
8K Variables
406K Factors
Linear
100
80
Speedup
UW-Systems
120
60
No Over-Part
40
5x Over-Part
20
0
0
30
60
90
Number of CPUs
120
34
Parallel Performance (Small Graph)
UW-Languages
1.5 Minutes
Linear to SuperLinear up to 30
CPUs
Network costs
quickly dominate
short running-time
Linear
50
No Over-Part
40
Speedup
Single Processor
Running Time:
Better
1K Variables
27K Factors
60
5x Over-Part
30
20
10
0
0
30
60
90
Number of CPUs
120
35
Summary
Splash Operation generalization of the optimal
parallel schedule on chain graphs
Belief-based scheduling
Addresses message scheduling issues
Improves accuracy and convergence
DBRSplash an efficient distributed parallel
inference algorithm
Over-partitioning to improve work balance
Experimental results on large factor graphs:
Linear to super-linear speed-up using up to 120
processors
36
Thank You
Acknowledgements
Intel Research Pittsburgh: OpenCirrus Cluster
AT&T Labs
DARPA
Carnegie Mellon
Exponential Parallelism
From Saman Amarasinghe:
512
Picochip
PC102
256
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
128
64
32
# of
cores 16
Raw
8
Cell
Niagara
Broadcom 1480
4
2
1
Raza Cavium
XLR Octeon
4004
8080
8086
286
386
486
Pentium
8008
1970
1975
1980
1985
1990
Xeon4P
MP
Opteron
Xbox360
PA-8800 Opteron Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
38
AIStats09 Speedup
3D –Video Prediction
Protein Side-Chain Prediction
39
Download