1 OPTIMIZATION WITH PARITY CONSTRAINTS: FROM BINARY CODES TO DISCRETE INTEGRATION Stefano Ermon*, Carla P. Gomes*, Ashish Sabharwal+, and Bart Selman* *Cornell University +IBM Watson Research Center UAI - 2013 2 High-dimensional integration • High-dimensional integrals in statistics, ML, physics • Expectations / model averaging • Marginalization • Partition function / rank models / parameter learning • Curse of dimensionality: n dimensional hypercube L L2 L3 L4 • Quadrature involves weighted sum over exponential number of items (e.g., units of volume) Ln 3 Discrete Integration Size visually represents weight 2n Items • We are given • A set of 2n items • Non-negative weights w 1 4 … 0 5 • Goal: compute total weight • Compactly specified weight function: • factored form (Bayes net, factor graph, CNF, …) • Example 1: n=2 variables, sum over 4 items 5 0 2 1 factor Goal: compute 5 + 0 + 2 + 1 = 8 • Example 2: n= 100 variables, sum over 2100 ≈1030 items (intractable) 5 1 0 2 4 EXP PSPACE Hardness 0 1 • 0/1 weights case: • Is there at least a “1”? SAT 0 1 • How many “1” ? #SAT • NP-complete vs. #P-complete. Much harder • General weights: • Find heaviest item (combinatorial optimization, MAP) • Sum weights (discrete integration) P^#P PH NP P Easy 0 3 4 7 • [ICML-13] WISH: Approximate Discrete Integration via Optimization. E.g., partition function via MAP inference • MAP inference often fast in practice: • Relaxations / bounds • Pruning Hard 5 WISH : Integration by Hashing and Optimization Outer loop over n variables MAP inference on model augmented with random parity constraints Repeat log(n) times Aggregate MAP inference solutions Original graphical model σ n binary variables AUGMENTED MODEL σ {0,1}n Parity check nodes enforcing A σ= b (mod 2) The algorithm requires only O(n log n) MAP queries to approximate the partition function within a constant factor 6 Visual working of the algorithm • How it works 1 random parity Function to be integrated 2 random parity constraints constraint …. n times 3 random parity constraints …. …. …. Log(n) times Mode M0 + median M1 ×1 + median M2 ×2 + median M3 ×4 + … 7 Accuracy Guarantees • Theorem [ICML-13]: With probability at least 1- δ (e.g., 99.9%) WISH computes a 16-approximation of the partition function (discrete integral) by solving θ(n log n) MAP inference queries (optimization). • Theorem [ICML-13]: Can improve the approximation factor to (1+ε) by adding extra variables and factors. • Example: factor 2 approximation with 4n variables • Remark: faster than enumeration only when combinatorial optimization is efficient 8 Summary of contributions • Introduction and previous work: • WISH: Approximate Discrete Integration via Optimization. • Partition function / marginalization via MAP inference • Accuracy guarantees • MAP Inference subject to parity constraints: • Tractable cases and approximations • Integer Linear Programming formulation • New family of polynomial time (probabilistic) upper and lower bounds on partition function that can be iteratively tightened (will reach within constant factor) • Sparsity of the parity constraints: • Techniques to improve solution time and bounds quality • Experimental improvements over variational techniques 9 MAP INFERENCE WITH PARITY CONSTRAINTS Hardness, approximations, and bounds 10 Making WISH more scalable • Would approximations to the optimization (MAP inference with parity constraints) be useful? YES • Bounds on MAP (optimization) translate to bounds on the partition function Z (discrete integral) • Lower bounds (local search) on MAP lower bounds on Z • Upper bounds (LP,SDP relaxation) on MAP upper bounds on Z • Constant-factor approximations on MAP constant factor on Z • Question: Are there classes of problems where we can efficiently approximate the optimization (MAP inference) in the inner loop of WISH? 11 Error correcting codes Communication over a noisy channel Bob Alice x 0100|1 Noisy channel y 0110|1 Redundant parity check bit= 0 XOR 1 XOR 0 XOR 0 Parity check bit = 1 ≠ 0 XOR 1 XOR 1 XOR 0 = 0 •Bob: There has been a transmission error! What was the message actually sent by Alice? • Must be a valid codeword • As close as possible to received message y 12 Decoding a binary code • Max-likelihood decoding x 0100|1 Noisy channel y 0110|1 ML-decoding graphical model Our more general case Noisy channel model x Transmitted string must be a codeword More complex probabilistic model Parity check nodes MAP inference is NP hard to approximate within any constant factor [Stern, Arora,..] LDPC Routinely solved: 10GBase-T Ethernet, Wi-Fi 802.11n, digital TV,.. Parity check nodes Max w(x) subject to A x = b (mod 2) Equivalent to MAP inference on augmented model 13 Decoding via Integer Programming • MAP inference subject to parity constraints encoded as an Integer Linear Program (ILP): • Standard MAP encoding • Compact (polynomial) encoding by Yannakakis for parity constraints Parity polytope • LP relaxation: relax integrality constraint • Polynomial time upper bounds • ILP solving strategy: cuts + branching + LP relaxations • Solve a sequence of LP relaxations • Upper and lower bounds that improve over time 14 Iterative bound tightening Polynomial time upper ad lower bounds on MAP that are iteratively tightened over time •Recall: bounds on optimization (MAP) (probabilistic) bounds on the partition function Z. New family of bounds. •WISH: When MAP is solved to optimality (LowerBound = UpperBound), guaranteed constant factor approximation on Z 15 SPARSITY OF THE PARITY CONSTRAINTS Improving solution time and bounds quality 16 Inducing sparsity • Observations: • Problems with sparse A x = b (mod 2) are empirically easier to solve (similar to Low-Density Parity Check codes) • Quality of LP relaxation depends on A and b , not just on the solution space. Elementary row operations (e.g., sum 2 equations) do not change solution space but affect the LP relaxation. 1)Reduce A x = b (mod 2) to row-echelon form with Gaussian elimination (linear equations over finite field) 2)Greedy application of elementary row operations Matrix A in row-echelon form Parity check nodes Equivalent but sparser Parity check nodes 17 Improvements from sparsity • Quality of LP relaxations significantly improves • Finds integer solutions faster (better lower bounds) Without sparsification, fails at finding integer solutions (LB) Upper bound improvement Improvements from sparsification using IBM CPLEX ILP solver for a 10x10 Ising Grid 18 Generating sparse constraints • WISH based on Universal Hashing: • Randomly generate A in {0,1}i×n, b in {0,1}i • Then A x + b (mod 2) is: n We optimize over solutions of A x = b mod 2 (parity constraints) • Uniform over {0,1}i • Pairwise independent i A Given variable assignments x and y , the events A x = b (mod 2) and A y =b (mod 2) are independent. = b (mod 2) x • Suppose we generate a sparse matrix A • At most k variables per parity constraint (up to k ones per row of A) • A x+b (mod 2) is still uniform, not pairwise independent anymore • E.g. for k=1, A x = b mod 2 is equivalent to fixing i variables. Lots of correlation. (Knowing A x = b tells me a lot about A y = b) 19 Using sparse parity constraints • Theorem: With probability at least 1- δ (e.g., 99.9%) WISH with sparse parity constraints computes an approximate lower bound of the partition function. • PRO: “Easier” MAP inference queries • For example, random parity constraints of length 1 (= on a single variable). Equivalent to MAP with some variables fixed. • CON: We lose the upper bound part. Output can underestimate the partition function. • CON: No constant factor approximation anymore 20 MAP with sparse parity constraints • MAP inference with sparse constraints evaluation 10x10 attractive Ising Grid 10x10 mixed Ising Grid • ILP and Branch&Bound outperform message-passing (BP, MP and MPLP) 21 Experimental results • ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW) 22 Experimental results (2) • ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW) 23 Conclusions • [ICML-13] WISH: Discrete integration reduced to small number of optimization instances (MAP) • Strong (probabilistic) accuracy guarantees • MAP inference is still NP-hard • Scalability: Approximations and Bounds • Connection with max-likelihood decoding • ILP formulation + sparsity (Gauss sparsification & uniform hashing) • New family of probabilistic polynomial time computable upper and lower bounds on partition function. Can be iteratively tightened (will reach within a constant factor) • Future work: • Extension to continuous integrals and variables • Sampling from high-dimensional probability distributions 24 Extra slides