Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low

advertisement
Parallel Splash
Belief Propagation
Joseph E. Gonzalez
Yucheng Low
Carlos Guestrin
David O’Hallaron
Computers which worked on this project:
BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFS
Tashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30,
parallel, gs6167, koobcam (helped with writing)
Carnegie Mellon
Change in the Foundation of ML
Log(Speed in GHz)
Why talkParallel
about parallelism
now?
Future
Performance
Release Date
2010
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
1988
Future Sequential
Performance
2
Why is this a Problem?
Want to be
here
Parallelism
Nearest Neighbor
[Google et al.]
Basic Regression
[Cheng et al.]
Support Vector Machines
[Graf et al.]
Graphical Models
[Mendiburu et al.]
Sophistication
3
Why is it hard?
Algorithmic Efficiency
Parallel Efficiency
Eliminate wasted
computation
Expose independent
computation
Implementation
Efficiency
Map computation to
real hardware
4
The Key Insight
Statistical Structure
• Graphical Model Structure
• Graphical Model Parameters
Computational Structure
• Chains of Computational Dependences
• Decay of Influence
Parallel Structure
• Parallel Dynamic Scheduling
• State Partitioning for Distributed Computation
5
The Result
Splash Belief Propagation
Parallelism
Nearest Neighbor
[Google et al.]
Goal
Basic Regression
[Cheng et al.]
Support Vector Machines
[Graf et al.]
Graphical Models
[Mendiburuetetal.]
al.]
[Gonzalez
Sophistication
6
Outline
Overview
Graphical Models: Statistical Structure
Inference: Computational Structure
τε - Approximate Messages: Statistical Structure
Parallel Splash
Dynamic Scheduling
Partitioning
Experimental Results
Conclusions
7
Graphical Models and Parallelism
Graphical models provide a common language for general
purpose parallel algorithms in machine learning
A parallel inference algorithm would improve:
Protein Structure
Prediction
Movie
Recommendation
Computer Vision
Inference is a key step in Learning
Graphical Models
8
Overview of Graphical Models
Graphical represent of local statistical dependencies
Observed Random Variables
Noisy Picture
Inference
What is the probability
that this pixel is black?
Local Dependencies
Continuity Assumptions
Latent Pixel Variables
“True” Pixel Values
9
Synthetic Noisy Image Problem
Noisy Image
Overlapping Gaussian noise
Assess convergence and
accuracy
Predicted Image
Protein Side-Chain Prediction
Model side-chain interactions as a graphical model
Inference
What is the most likely orientation?
11
Protein Side-Chain Prediction
276 Protein Networks:
Approximately:
700 Variables
1600 Factors
70 Discrete orientations
Strong Factors
Example Degree Distribution
0.15
0.1
0.05
0
6
14
22
30
Degree
38
46
12
Markov Logic Networks
Represent Logic as a graphical model
A: Alice
B: Bob
Smokes(A)
Friends(A,B)
True/False?
Friends(A,B) And Smokes(A)
 Smokes(B)
Smokes(A)  Cancer(A)
Smokes(B)
Smokes(B)  Cancer(B)
Inference
Pr(Cancer(B) = True |
Smokes(A) = True & Friends(A,B) = True) = ?
Cancer(A)
Cancer(B)
13
Markov Logic Networks
UW-Systems Model
8K Binary Variables
406K Factors
Irregular degree
distribution:
A: Alice
B: Bob
Smokes(A)
Friends(A,B)
True/False?
Friends(A,B) And Smokes(A)
 Smokes(B)
Smokes(A)  Cancer(A)
Cancer(A)
Smokes(B)
Smokes(B)  Cancer(B)
Cancer(B)
Some vertices with high
degree
14
Outline
Overview
Graphical Models: Statistical Structure
Inference: Computational Structure
τε - Approximate Messages: Statistical Structure
Parallel Splash
Dynamic Scheduling
Partitioning
Experimental Results
Conclusions
15
The Inference Problem
A: Alice
B: Bob
Friends(A,B)
True/False?
What is the probability
that Bob Smokes given
Alice Smokes?
Smokes(A)
Friends(A,B) And Smokes(A)
 Smokes(B)
Smokes(A)  Cancer(A)
Cancer(A)
Smokes(B)
Smokes(B)  Cancer(B)
What is the best
configuration of the
protein side-chains?
Cancer(B)
What is the probability
that each pixel is black?
NP-Hard in General
Approximate Inference:
Belief Propagation
16
Belief Propagation (BP)
Iterative message passing algorithm
Naturally Parallel Algorithm
17
Parallel Synchronous BP
Given the old messages all new messages can be
computed in parallel:
CPU 1
CPU 2
CPU 3
Old
Messages
CPU n
New
Messages
Map-Reduce Ready!
18
Sequential Computational Structure
19
Hidden Sequential Structure
20
Hidden Sequential Structure
Evidence
Evidence
Running Time:
Time for a single
parallel iteration
Number of Iterations
21
Optimal Sequential Algorithm
Running
Time
Naturally Parallel
2n2/p
Gap
Forward-Backward
p ≤ 2n
2n
p=1
Optimal Parallel
n
p=2
22
Key Computational Structure
Running
Time
Naturally Parallel
2n2/p
Gap
p ≤ 2n
Inherent Sequential Structure
Requires Efficient Scheduling
Optimal Parallel
n
p=2
23
Outline
Overview
Graphical Models: Statistical Structure
Inference: Computational Structure
τε - Approximate Messages: Statistical Structure
Parallel Splash
Dynamic Scheduling
Partitioning
Experimental Results
Conclusions
24
Parallelism by Approximation
True Messages
1
2
3
4
5
6
7
8
9
10
τε -Approximation
τε represents the
minimal sequential
structure
1
25
Tau-Epsilon Structure
Message Approximation Error
in Log Scale
Often τε decreases quickly:
Protein Networks
Markov Logic
Networks
26
Running Time Lower Bound
Theorem:
Using p processors it is not possible to obtain a τε
approximation in time less than:
Parallel
Component
Sequential
Component
27
Proof: Running Time Lower Bound
Consider one direction using p/2 processors (p≥2):
τε
n - τε
…
1
τε
τε
τε
τε
τε
τε
n
τε
We must make n - τε vertices τε left-aware
A single processor can only make k-τε +1 vertices left
aware in k-iterations
28
Optimal Parallel Scheduling
Processor 1
Processor 2
Processor 3
Theorem:
Using p processors this algorithm achieves a τε
approximation in time:
29
Proof: Optimal Parallel Scheduling
All vertices are left-aware of the left most vertex on their processor
After exchanging messages
After next iteration:
After k parallel iterations each vertex is (k-1)(n/p) left-aware
30
Proof: Optimal Parallel Scheduling
After k parallel iterations each vertex is (k-1)(n/p) leftaware
Since all vertices must be made τε left aware:
Each iteration takes O(n/p) time:
31
Comparing with SynchronousBP
Processor 1
Processor 2
Synchronous Schedule
Processor 3
Optimal Schedule
Gap
32
Outline
Overview
Graphical Models: Statistical Structure
Inference: Computational Structure
τε - Approximate Messages: Statistical Structure
Parallel Splash
Dynamic Scheduling
Partitioning
Experimental Results
Conclusions
33
The Splash Operation
~
Generalize the optimal chain algorithm:
to arbitrary cyclic graphs:
1) Grow a BFS Spanning tree
with fixed size
2) Forward Pass computing all
messages at each vertex
3) Backward Pass computing all
messages at each vertex
34
Running Parallel Splashes
CPU 1
CPU 2
CPU 3
Splash
Splash Key Challenges:
Splash
1) How do we schedules Splashes?
2) How do we partition the Graph?
Local State
Local State
Local State
Partition the graph
Schedule Splashes locally
Transmit the messages along the boundary of
the partition
35
Where do we Splash?
Assign priorities and use a scheduling queue to
select roots:
?
Splash
?
Splash
How do we assign priorities?
CPU 1
Local State
Scheduling Queue
?
Message Scheduling
Residual Belief Propagation [Elidan et al., UAI 06]:
Assign priorities based on change in inbound messages
Small Change
Small Change
Message
Large Change
Large Change
Message
1
2
Small Change:
Message
Expensive No-Op
Message
Large Change:
Message
Informative
Update
Message
37
Problem with Message Scheduling
Small changes in messages do not imply small
changes in belief:
Small change in
all message
Message
Large change in
belief
Message
Belief
Message
Message
38
Problem with Message Scheduling
Large changes in a single message do not imply
large changes in belief:
Large change in
a single message
Message
Message
Belief
Small change
in belief
Message
Message
39
Belief Residual Scheduling
Assign priorities based on the cumulative change
in belief:
rv =
+
+
1
A vertex whose belief has
changed substantially
since last being updated
will likely produce
informative new messages.
1
1
Message
Change
40
Message vs. Belief Scheduling
Belief Scheduling improves
accuracy and convergence
L1 Error in Beliefs
Better
Error in Beliefs
% Converged in 4Hrs
0.06
Message Scheduling 100%
0.05
Belief Scheduling
80%
60%
0.04
40%
0.03
20%
0.02
0
50
Time (Seconds)
100
0%
Belief
Residuals
Message
Residual
41
Splash Pruning
Belief residuals can be used to dynamically
reshape and resize Splashes:
Low
Beliefs
Residual
Splash Size
Running Time (Seconds)
Better
Using Splash Pruning our algorithm is able to
dynamically select the optimal splash size
350
Without Pruning
300
With Pruning
250
200
150
100
50
0
0
10
20
30
40
Splash Size (Messages)
50
60
43
Example
Many
Updates
Synthetic Noisy Image
Few
Updates
Vertex Updates
Algorithm identifies and focuses
on hidden sequential structure
Factor Graph
44
Parallel Splash Algorithm
Fast Reliable Network
Scheduling
Theorem:
Queue
CPU 1
CPU 2
Scheduling
Queue
CPU 3
Scheduling
Queue
Given a uniform partitioning of the chain
Splash
Splash
graphical model, Parallel Splash will run in
time:
Splash
Local State
Local State
Local State
Partition factor graph over processors
Schedule Splashes locally using belief residuals
Transmit
messages
on boundary
retaining
optimality.
45
Partitioning Objective
The partitioning of the factor graph determines:
Storage, Computation, and Communication
Goal:
Balance Computation and Minimize Communication
Ensure
Balance
Comm.
cost
CPU 1
CPU 2
46
The Partitioning Problem
Objective:
Minimize
Communication
Ensure Balance
Depends on:
Update counts are not known!
Work:
Comm:
NP-Hard  METIS fast partitioning heuristic
47
Unknown Update Counts
Determined by belief scheduling
Depends on: graph structure, factors, …
Little correlation between past & future update
counts
Noisy Image
Simple Solution:
Update Counts
Uninformed Cut
48
Uniformed Cuts
Uninformed Cut
Update Counts
Optimal Cut
Too Much Work
Too Little Work
Greater imbalance & lower communication cost
Work Imbalance
Communication Cost
3
Better
Better
4
2
1
0
Denoise
UW-Syst.
1.1
1
0.9
0.8
0.7
0.6
0.5
Uninformed
Optimal
Denoise
UW-Syst.
49
Over-Partitioning
Over-cut graph into k*p partitions and randomly assign CPUs
Increase balance
Increase communication cost (More Boundary)
CPU 1
CPU 2
CPU 2
CPU 1
CPU 1
CPU 2
CPU 2
CPU 1
CPU 2
CPU 1
CPU 2
CPU 1
CPU 1
CPU 2
Without Over-Partitioning
k=6
50
Over-Partitioning Results
Provides a simple method to trade between work
balance and communication cost
Communication Cost
Work Imbalance
4
3.5
3.5
Better
Better
3
2.5
3
2.5
2
2
1.5
1.5
0
5
10
Partition Factor k
15
1
0
5
10
Partition Factor k
15
51
CPU Utilization
Over-partitioning improves CPU utilization:
Active CPUs
UW-Systems MLN
Denoise
70
70
60
60
50
50
No Over-Part
40
40
10x Over-Part
30
30
20
20
10
10
0
0
0
100
200
Time (Seconds)
0
20
40
Time (Seconds)
52
Parallel Splash Algorithm
Fast Reliable Network
CPU 1
Scheduling
Queue
Splash
Local State
CPU 2
Scheduling
Queue
Splash
Local State
CPU 3
Scheduling
Queue
Splash
Local State
Over-Partition factor graph
Randomly assign pieces to processors
Schedule Splashes locally using belief residuals
Transmit messages on boundary
53
Outline
Overview
Graphical Models: Statistical Structure
Inference: Computational Structure
τε - Approximate Messages: Statistical Structure
Parallel Splash
Dynamic Scheduling
Partitioning
Experimental Results
Conclusions
54
Experiments
Implemented in C++ using MPICH2 as a
message passing API
Ran on Intel OpenCirrus cluster: 120 processors
15 Nodes with 2 x Quad Core Intel Xeon Processors
Gigabit Ethernet Switch
Tested on Markov Logic Networks obtained from
Alchemy [Domingos et al. SSPR 08]
Present results on largest UW-Systems and smallest
UW-Languages MLNs
55
Parallel Performance (Large Graph)
Single Processor
Running Time:
1 Hour
Linear to SuperLinear up to 120
CPUs
Cache efficiency
Better
8K Variables
406K Factors
Linear
100
80
Speedup
UW-Systems
120
60
No Over-Part
40
5x Over-Part
20
0
0
30
60
90
Number of CPUs
120
56
Parallel Performance (Small Graph)
UW-Languages
1.5 Minutes
Linear to SuperLinear up to 30
CPUs
Network costs
quickly dominate
short running-time
Linear
50
No Over-Part
40
Speedup
Single Processor
Running Time:
Better
1K Variables
27K Factors
60
5x Over-Part
30
20
10
0
0
30
60
90
Number of CPUs
120
57
Outline
Overview
Graphical Models: Statistical Structure
Inference: Computational Structure
τε - Approximate Messages: Statistical Structure
Parallel Splash
Dynamic Scheduling
Partitioning
Experimental Results
Conclusions
58
Summary
Algorithmic Efficiency
Parallel Efficiency
Splash Structure
+
Belief Residual
Scheduling
Independent
Parallel Splashes
Implementation
Efficiency
Distributed Queues
Asynchronous Communication
Over-Partitioning
Experimental results on large factor graphs:
Linear to super-linear speed-up using up to 120
processors
59
Conclusion
Parallelism
We are here
Parallel Splash
Belief Propagation
Sophistication
Questions
61
Protein Results
62
3D Video Task
63
Distributed Parallel Setting
Fast Reliable Network
CPU
Cache
Bus
CPU
Memory
Node
Cache
Bus
Memory
Node
Opportunities:
Access to larger systems: 8 CPUs  1000 CPUs
Linear Increase:
RAM, Cache Capacity, and Memory Bandwidth
Challenges:
Distributed state, Communication and Load Balancing
64
Download