Distributed Parallel Inference on Large Factor Graphs Joseph E. Gonzalez Yucheng Low Carlos Guestrin David O’Hallaron Carnegie Mellon Exponential Parallelism Exponentially Increasing Parallel Performance 1 Constant Sequential Performance 0.1 Release Date 2010 2008 2006 2004 2002 2000 1998 1996 1994 1992 1990 0.01 1988 Processor Speed GHz 10 2 Distributed Parallel Setting Fast Reliable Network CPU Cache Bus CPU Memory Node Cache Bus Memory Node Opportunities: Access to larger systems: 8 CPUs 1000 CPUs Linear Increase: RAM, Cache Capacity, and Memory Bandwidth Challenges: Distributed state, Communication and Load Balancing 3 Graphical Models and Parallelism Graphical models provide a common language for general purpose parallel algorithms in machine learning A parallel inference algorithm would improve: Protein Structure Prediction Movie Recommendation Computer Vision Inference is a key step in Learning Graphical Models 4 Belief Propagation (BP) Message passing algorithm Naturally Parallel Algorithm 5 Parallel Synchronous BP Given the old messages all new messages can be computed in parallel: CPU 1 CPU 2 CPU 3 Old Messages CPU n New Messages Map-Reduce Ready! 6 Hidden Sequential Structure 7 Hidden Sequential Structure 8 Hidden Sequential Structure Evidence Evidence Running Time: Time for a single parallel iteration Number of Iterations 9 Optimal Sequential Algorithm Running Time Naturally Parallel 2n2/p Gap Forward-Backward p ≤ 2n 2n p=1 Optimal Parallel n p=2 10 Parallelism by Approximation True Messages 1 2 3 4 5 6 7 8 9 10 τε -Approximation τε represents the minimal sequential structure 1 11 Optimal Parallel Scheduling Processor 1 Processor 2 Processor 3 In [AIStats 09] we demonstrated that this algorithm is optimal: Synchronous Schedule Parallel Component Gap Optimal Schedule Sequential Component 12 The Splash Operation ~ Generalize the optimal chain algorithm: to arbitrary cyclic graphs: 1) Grow a BFS Spanning tree with fixed size 2) Forward Pass computing all messages at each vertex 3) Backward Pass computing all messages at each vertex 13 Running Parallel Splashes CPU 1 CPU 2 CPU 3 Splash Splash Key Challenges: Splash 1) How do we schedules Splashes? 2) How do we partition the Graph? Local State Local State Local State Partition the graph Schedule Splashes locally Transmit the messages along the boundary of the partition 14 Where do we Splash? Assign priorities and use a scheduling queue to select roots: ? Splash ? Splash How do we assign priorities? CPU 1 Local State Scheduling Queue ? Message Scheduling Residual Belief Propagation [Elidan et al., UAI 06]: Assign priorities based on change in inbound messages Small Change Small Change Message Large Change Large Change Message 1 2 Small Change: Message Expensive No-Op Message Large Change: Message Informative Update Message 16 Problem with Message Scheduling Small changes in messages do not imply small changes in belief: Small change in all message Message Large change in belief Message Belief Message Message 17 Problem with Message Scheduling Large changes in a single message do not imply large changes in belief: Large change in a single message Message Message Belief Small change in belief Message Message 18 Belief Residual Scheduling Assign priorities based on the cumulative change in belief: rv = + + 1 A vertex whose belief has changed substantially since last being updated will likely produce informative new messages. 1 1 Message Change 19 Message vs. Belief Scheduling Belief scheduling improves accuracy more quickly Belief scheduling improves convergence % Converged in 4Hrs L1 Error in Beliefs Better 0.06 Message Scheduling 100% 0.05 80% Belief Scheduling 60% 0.04 40% 0.03 20% 0.02 0 50 Time (Seconds) 100 0% Belief Residuals Message Residual 20 Splash Pruning Belief residuals can be used to dynamically reshape and resize Splashes: Low Beliefs Residual Splash Size Running Time (Seconds) Better Using Splash Pruning our algorithm is able to dynamically select the optimal splash size 350 Without Pruning 300 With Pruning 250 200 150 100 50 0 0 10 20 30 40 Splash Size (Messages) 50 60 22 Example Many Updates Synthetic Noisy Image Few Updates Vertex Updates Algorithm identifies and focuses on hidden sequential structure Factor Graph 23 Distributed Belief Residual Splash Algorithm Fast Reliable Network Scheduling Theorem: Queue CPU 1 CPU 2 Scheduling Queue CPU 3 Scheduling Queue Given a uniform partitioning of the chain Splash Splash graphical model, DBRSplash will run in time: Splash Local State Local State Local State Partition factor graph over processors Schedule locally using belief residuals retainingSplashes optimality. Transmit messages on boundary 24 Partitioning Objective The partitioning of the factor graph determines: Storage, Computation, and Communication Goal: Balance Computation and Minimize Communication Ensure Balance Comm. cost CPU 1 CPU 2 25 The Partitioning Problem Objective: Minimize Communication Ensure Balance Depends on: Update counts are not known! Work: Comm: NP-Hard METIS fast partitioning heuristic 26 Unknown Update Counts Determined by belief scheduling Depends on: graph structure, factors, … Little correlation between past & future update counts Noisy Image Uninformed Cut Update Counts 27 Uniformed Cuts Uninformed Cut Update Counts Optimal Cut Too Much Work Too Little Work Greater imbalance & lower communication cost Work Imbalance Communication Cost 3 Better Better 4 2 1 0 Denoise UW-Syst. 1.1 1 0.9 0.8 0.7 0.6 0.5 Uninformed Optimal Denoise UW-Syst. 28 Over-Partitioning Over-cut graph into k*p partitions and randomly assign CPUs Increase balance Increase communication cost (More Boundary) CPU 1 CPU 2 CPU 2 CPU 1 CPU 1 CPU 2 CPU 2 CPU 1 CPU 2 CPU 1 CPU 2 CPU 1 CPU 1 CPU 2 Without Over-Partitioning k=6 29 Over-Partitioning Results Provides a simple method to trade between work balance and communication cost Communication Cost Work Imbalance 4 3.5 3.5 Better Better 3 2.5 3 2.5 2 2 1.5 1.5 0 5 10 Partition Factor k 15 1 0 5 10 Partition Factor k 15 30 CPU Utilization Over-partitioning improves CPU utilization: Active CPUs UW-Systems MLN Denoise 70 70 60 60 50 50 No Over-Part 40 40 10x Over-Part 30 30 20 20 10 10 0 0 0 100 200 Time (Seconds) 0 20 40 Time (Seconds) 31 DBRSplash Algorithm Fast Reliable Network CPU 1 Scheduling Queue Splash Local State CPU 2 Scheduling Queue Splash Local State CPU 3 Scheduling Queue Splash Local State Over-Partition factor graph Randomly assign pieces to processors Schedule Splashes locally using belief residuals Transmit messages on boundary 32 Experiments Implemented in C++ using MPICH2 as a message passing API Ran on Intel OpenCirrus cluster: 120 processors 15 Nodes with 2 x Quad Core Intel Xeon Processors Gigabit Ethernet Switch Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08] Present results on largest UW-Systems and smallest UW-Languages MLNs 33 Parallel Performance (Large Graph) Single Processor Running Time: 1 Hour Linear to SuperLinear up to 120 CPUs Cache efficiency Better 8K Variables 406K Factors Linear 100 80 Speedup UW-Systems 120 60 No Over-Part 40 5x Over-Part 20 0 0 30 60 90 Number of CPUs 120 34 Parallel Performance (Small Graph) UW-Languages 1.5 Minutes Linear to SuperLinear up to 30 CPUs Network costs quickly dominate short running-time Linear 50 No Over-Part 40 Speedup Single Processor Running Time: Better 1K Variables 27K Factors 60 5x Over-Part 30 20 10 0 0 30 60 90 Number of CPUs 120 35 Summary Splash Operation generalization of the optimal parallel schedule on chain graphs Belief-based scheduling Addresses message scheduling issues Improves accuracy and convergence DBRSplash an efficient distributed parallel inference algorithm Over-partitioning to improve work balance Experimental results on large factor graphs: Linear to super-linear speed-up using up to 120 processors 36 Thank You Acknowledgements Intel Research Pittsburgh: OpenCirrus Cluster AT&T Labs DARPA Carnegie Mellon Exponential Parallelism From Saman Amarasinghe: 512 Picochip PC102 256 Ambric AM2045 Cisco CSR-1 Intel Tflops 128 64 32 # of cores 16 Raw 8 Cell Niagara Broadcom 1480 4 2 1 Raza Cavium XLR Octeon 4004 8080 8086 286 386 486 Pentium 8008 1970 1975 1980 1985 1990 Xeon4P MP Opteron Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20?? 38 AIStats09 Speedup 3D –Video Prediction Protein Side-Chain Prediction 39