Algorithmic Tools Applied to Some Machine Learning and Inference Problems David Karger Conclusion Algorithms research has produced a huge set of tools that can be applied to efficiently solving computational problems Any time one can precisely define a computational problem, it’s worth looking in the algorithms toolbox to see if it can be solved efficiently So, define some problems and talk to your local algorithmicist! Outline Demonstrate by case studies: applying algorithmic tools to four distinct learning/inference problems Near neighbor searching MDPs with nonrecurring rewards Decoding turbo codes Learning low-treewidth graphical models Each is a well defined computational problem Approach 4 separate talks? Commonality not in tools (too many to enumerate) Rather, in reductionist / analogical approach Common heuristics in search for right tools By analogy to similar problems By properties of problem, e.g. Self-reducibility ---recursion, sampling Use of paths, trees, cliques, etc. Natural integer programming formulations By judicious simplification Tools we’ll use today Random sampling Self-reducibility Amortization Dynamic programming Combinatorial graph theory Constant-factor approximation algorithms for NPhard problems Max flow Approximation-preserving reductions Locally checkable structures Linear programming relaxations Randomized rounding of relaxed solutions Lost in Translation Algorithms designed for “clean” problems Measures of performance sometimes artificial Often require simplifying assumptions from real-world problems “Constant approximation factor” nice in theory, but which constant matters in practice So, algorithm may or may not be useful But even if not, can provide insight into practical ways of attacking the problem A Data Structure for Nearest Neighbors on Manifolds Random Sampling Greedy algorithms Local Search Memoization Motivation Common theme in machine learning: find high-fidelity embedding of high-dimensional data into a lower-dimensional representation Isomap [TdSL] and locally linear embedding [RS] postulate data on low-dimensional manifold in high dimensional space To reconstruct manifold, both find each node’s “near neighbors” by brute force Quadratic time in number of points Improve with fast near-neighbor queries? The Original Motivation Peer to peer systems Nodes need to find “nearby” nodes for fast communication/data accesss Internet is (kind-of) low dimensional This near neighbor data structure works well Near Neighbor Search Many optimal data structures in low dimensions (computational geometry) But high dimensional problem hard Subject of much recent attention in algorithms community [Kleinberg]---Two algorithms for NNS in high dimension: STOC 1997 [DISV] et al workshop on near neighbor search: NIPS 2003 But focus on approximate near neighbors First Try Manifold is low dimensional Use low dimensional NN solutions? E.g. Voronoi diagram Problem: Voronoi diagram relies on geometry of space But we don’t know geometry (coordinate system) until after we construct manifold So to reconstruct, are limited to using distance, not geometry Some Generic Ideas Greedy approach Take query q and closest point p found so far Arrange to find something closer to q than p Repeat till done Local Search Want point closer to q Must be pretty close to p So examine points near p till find one close to q Repeat with new point p’ p d d/2 q Some More Ideas Random Sampling For fast search, look only at a few candidates Good candidates depend on query point For any choice, “adversary” can pick bad q So choose random set---probably good for all Memoization Picking random points near p is a NN search! Pick in advance, as part of building data structure E.g., for every possible radius (!), store (a few) random candidates in that ball around p Random Sampling When will it work? Evenly distributed low-dimensional point sets E.g. a grid of points E.g., random sample from low-dim. manifold On grid, if p and q at distance d, then Ball(q,d/2) is 1/16 of points in Ball(p,2d) So, random point has 1/16 chance of improving And trying 16 random points raises odds to 60-40 A Big Fast Data Structure? At each point p, store 16 random points from balls of every possible radius around p Resolve query by iterating improvements Look in right size ball around current point If lucky, find point closer to q, can look in smaller ball next time If unlucky, end up with bigger ball But luck is more likely---drift in right direction Expected ball size shrinks by const factor So, O(log n) steps get to “tiny” ball---at answer Time to Get Real Storing samples for all radii expensive Worry about dependence OK to be sloppy, and store only for powers of 2 Means always have ball of roughly correct size But only log n balls needed per point So O(n log n) space overall What if search cycles back to point already seen? “random samples” are no longer random And how find random samples in first place? Metric Skip Lists Consider randomly ordering list of points For each power-of-two d, point p records next 16 points in list within distance d These points are a random sample from ball Search inspects points forward in list From current p, check 16 following in current d Search always moves forward So always considering “new” points So no cycles, and samples independent in every step (mostly) Construction Randomized Incremental Construction Common idea from computational geometry Adding points to data structure in random order is often easy, yields good data structure Build list by prepending points in random order To set new pointers, must find next 16 points within (each) distance d of new point But this is like NN query (given the way we search) Find answer by slight variation on NN search List so far constructed makes this easy Summary In any low-dimensional space with evenly distributed points, metric skip lists solve nearest neighbor queries O(n log n) space O(n log n) time to build data structure O(log n) time to query Constant depends on dimensionality of space Application: by building this structure, can find NNs for Isomap/LLE in O(n log n) time Experiments Data structure is simple---no hidden “theoretical” aspects Implementation for P2P search works well Experiments on some simple manifolds [JT] work well too Improvements [KL04] There are better ways to pick “improvement candidates” for local search Assouad’s doubling dimension is the maximum number of diameter-d balls needed to cover any diameter-d/2 ball Pick one candidate from each ball in cover Get same bounds as above, but Applies to unevenly distributed points Deterministic Markov Decision Processes with Nonrecurring Rewards Approximation algorithms Reductions to related problems Dynamic programming A Robot Navigation Problem Robot to deliver packages Goal to deliver as quickly as possible Sounds like traveling salesman problem? Mismatches Robot may not go where it plans to (sensor error, motor control error, battery failure….) Some packages matter more Formulate as Markov Decision Process Graph with rewards rv on states (vertices) v, travel times (lengths) on edges From each node, choice of actions (each a probability distribution on next vertex) Choosing sequence of actions produces a random path through graph If arrive at vertex v at time t, receive discounted reward gt rv where g<1 Motivates getting there quickly Goal: maximize total discounted reward MDP Discounting Reward received each time vertex is visited So plain value of infinite path can be infinite Discounting means total reward is bounded by a geometric series, so bounded Alternative: consider average reward per unit time Other reasons for discounting Inflation (money in future less value than now) Uncertainty (what if something happens before I collect future prize?) Mathematical elegance Solution Fixing action at each state produces a Markov Chain. Transition probabilities pvw Can compute expected discounted reward rv if start at state v: rv = rv + Sw pvw gt(v,w) rw Choosing actions to optimize this recurrence is polynomial time solvable Linear programming Dynamic programming (like shortest paths) Solving the wrong problem Package can only be delivered once One solution: expand state space So, incorrect to get reward each time reach target Vertex represents where I am and where I have been before (what packages already delivered) Reward nonzero only on states where current location not included in list of previously visited Now apply MDP algorithm Problem: expanded problem size exponential in original input This is one Instance of a General Problem [KL] Often, MDP has state space with nice small implicit description but huge explicit description How do we accomplish MDP optimization on such instances? Tackle an easier problem Problem has two novel elements for TOC We will ignore second issue Discounting of reward based on arrival time Probability distribution on outcome of actions In practice, robot can control errors Even first issue by itself is hard and interesting First step towards solving whole problem Frantic Salesman Problem Given rewards, travel times, and discount factor, find a path maximizing total discounted reward Approximation Algorithms FSP is NP-complete (thus, so is more general MDP-type problem) Reduction from minimum latency TSP So intractable to solve exactly Goal: approximation algorithm that is guaranteed to achieve at least some constant (<1) multiple of the best possible discounted reward TOC Toolbox Goal seems to be to find a “short” path that visits “lots” of reward Relates to previously studied k-TSP problem Given a root vertex v, find a path of minimum total length that starts at v and visits vertices with (undiscounted) prize at least k Constant factor approximation algorithm known for undirected graphs (so we assume this too) i.e., can find path of at most a constant (>1) multiple of minimum possible total edge length Mismatch Constant factor approximation doesn’t exponentiate well Suppose optimum solution reaches some reward r at time t for reward gtr Constant factor approximation would reach within time 2t for reward g2tr Result: get only gt fraction of optimum discounted reward, not a constant fraction. Idea: Change Objective Function Modify k-TSP to approximate prize collected instead of length: orienteering problem Assume tour of length l collecting prize p Find tour of length l collecting prize p/2 Avoids changing length, so exponentiation doesn’t hurt Drawback: no constant factor approximation previously known Flipping objective/feasibility transforms problem (Our techniques end up resolving this too) Idea: Upper Bounds General tool for approximation algorithms Show close to something no solution can beat Let dv denote shortest path distance to v Define the prize at v as pv=gdv rv Max discounted reward possibly collectable at v So max conceivable reward S pv Potential greedy algorithm: take shortest path to one max-prize vertex Gets at least 1/n of optimum Compare to Upper Bound If given path reaches v at time tv, define excess at v as ev = tv – dv Difference between shortest path and chosen one Then discounted reward at v is gev pv Idea: good solution need not bother visiting nodes at large excess If excess large and node still worth visiting, prize must be huge So forget current path, just go straight to huge prize without discounting Formalize Fact: excess only increases as traverse path Without loss of generality, assume g = ¼ Excess reflects lost time; can’t make it up Just scale edge lengths Claim: at least ½ of optimum path’s discounted reward R is collected before path’s excess rises to ½ Let w be first vertex with ew > ½ Suppose more than R/2 reward follows w Show contradiction Path Improvement ew > ½ but more than R/2 reward follows w Shortcut directly to w then traverse optimum reduces all excesses after w by ½ so improves discounts by (1/g) ½ = 2 so doubles discounted reward collected but this was more than R/2: contradiction w excess 1/2 0 excess 11/2 Discount Discounting We showed large excess can be ignored But if excess is small, discounting by excess can be ignored! (discounted) reward ~ (undiscounted) prize So, just find path with small excess maximizing amount of (undiscounted) prize Gives path with (discounted) reward ~ prize Of course, min-excess gives min-distance But, may be better off approximating excess Improvement on k-TSP: Approximate Excess Recall discounted reward at v is gev pv Prefix of optimum discounted reward path: Find a path of approximately (3 times) minimum excess and prize R/2 Has discounted reward S gev pv > R/2 So has prize S pv > R/2 And has no vertex with large excess (we can guess R/2) Excesses at most 3/2, so gev pv > pv/8 So discounted reward on found path > R/8 The Downside Min-excess problem more useful But harder to solve Approximating min-distance does not approximate min-excess Opt length 1+e length 2 e excess 1 excess Exactly Solvable Case: monotonic paths Suppose optimum goes through vertices in strictly increasing distance order from root Then can find optimum by dynamic program Just as can solve longest path in an acyclic graph Build table: is there a monotonic path from v with length l and prize p? To answer, look for a u after v with a path of length l – dvu and prize p - pv Works because monotonic path won’t go back through v Dynamic Program (3,1) (4,2) 1 4 (8,2) (7,2) (8,3) (13,3) (15,4) 2 (9,2) (6,2) (11,3) 1 2 2 (4,1) (5,2) (9,2) (2,0) 5 (7,1) Approximable case: wiggly paths Length of path to v is tv = dv + ev If ev > dv then tv > ev > tv / 2 i.e., take twice as long as necessary to reach end So if approximate tv to constant factor, also approximate ev to twice that constant factor But finding approximately optimum tv is k-TSP problem Constant factor approximation known Decompose into easy cases monotone monotone wiggly monotone wiggly Divides into independent problems > 2/3 of each wiggly path is excess Decomposition Analysis 2/3 of each wiggly path is excess That excess accumulates into whole path So, total excess of wiggly paths upper bounded by excess of whole path Conclude total length of wiggly paths upper bounded by 3/2 of path excess Use k-TSP to find approximately shortest wiggles collecting right amount of prize Approximates length, so approximates excess Over all wiggly parts, approximates total excess Dynamic program For each pair of vertices and each (discretized) prize value, find Shortest monotonic path collecting desired prize Approximately shortest wiggly path collecting desired prize Note: polynomially many subproblems Use dynamic programming to find optimum pasting together of subproblems Summary Showed maximum discount prize can be approximated by minimum excess path Showed how to approximate min-excess path Also solves “orienteering problem” Also solves “dual” of k-TSP where length is fixed Also solves “tree versions” of all these problems, e.g. prize-approximate k-MST Open Questions Directed graphs? We used k-TSP, only solved for undirected For directed, even standard TSP has no known constant factor approximation We only use k-TSP/undirectedness in wiggly parts Stochastic actions? Stochastic seems to imply directed Special case: forget rewards. Given choice of actions, choose actions to minimize cover time of graph Decoding Turbo Codes via Linear Programming Linear Programming Relaxations Basic Coding Problem Goal: transmit message across noisy channel that randomly perturbs parts of message Binary symmetric channel: each transmitted bit flipped independently with probability p < ½ Approach also works for AWGN channel Method: introduce redundancy in messages to cope with perturbations Compute encoding function mapping each information word ũ {0,1}k to codeword ỹ {0,1}n Receive and decode randomly perturbed ŷ Definitions Encoder Information word ũ Length k Codeword ỹ Length n Word error rate: is u = ũ ? Noisy Channel Bit error rate p Decoded info word u Decoder Decoded codeword y Corrupt codeword ŷ Decoding Received perturbed codeword usually not a codeword. Need rule for picking “best” decoding possibility Performance measure: probability of giving wrong answer Known as word error rate (WER) Won’t be zero, since with nonzero probability channel replaces codeword with a different, valid codeword (at that point, only natural to answer with wrong information word) Maximum Likelihood Decoding Specific decoding rule Choose info word whose codeword maximizes probability of received word Pr(ŷ | y) Assuming binary symmetric channel errors, this is just codeword at minimum hamming distance from received word More generally, linear cost function on codewords Under general assumptions, this is the “best” way to decode Drawback: generally NP-complete Turbo Codes A particular encoding approach Numerous fast heuristics for decoding Introduced in 1993 [Berrou, Glavieux, Thitimajshima] Simple linear time encoding (state machine) E.g. belief propagation But may not converge Fantastic in practice, but unclear why “distance” of code is bad (multiple codewords with few different bits: easily confused with each other) Some asymptotic analysis of “random turbo codes” Decoding by Linear Programming Describe polynomial time linear-programming approach to decoding (arbitrary) turbo codes Precise analysis for RA(2) codes Generalizes to LDPC codes Certifies when finds ML codeword Repeat accumulate are special type of turbo 2 means rate ½ : two code bits per info bit Based on analysis, explanation of how to build good RA(2) code Gives code with inverse-polynomial error bound ML Decoding as Integer Linear Programming Given corrupted codeword ŷ, want find codeword y of minimum hamming distance Note |y - ŷ| = Σ(1- 2ŷi)y + constant (linear function) Code C {0,1}n Polytope P as convex hull of C ML decoding wants minimum-cost vertex of P Errors in channel change ŷ, perturb objective Good code: small objective perturbations leave optimum vertex unchanged Linear Programming Relaxation Optimizing over P intractable Generally no tractable way to work with constraints (facets) defining P Find a tractable polytope “similar” to P, optimize over that instead New polytope will generally have additional, non-integral vertices P Relax P Q Properties of a Good Relaxation Relaxed LP should be tractable Relaxed polytope Q should contain original P Preferably few constraints Combinatorial structure to aid solution Ensures that true optimum is feasible in relaxation So Q optimum is lower bound on P optimum Relaxation should be “tight” Vertices of Q should be “close” to P Increases hope that optimum in Q will be valid in P Makes it easier to “round” Q solution into P Repeated Accumulate Codes Particular type of turbo code Accumulator: accumulates sum of bits mod 2 RA(m) code Make m copies of info word (total km bits) Apply given fixed permutation to bits Feed resulting sequence through accumulator Output sequence of accumulator values (km outputs) We focus on RA(2) code Trellis 0 0 1 ũ1 0 0 0 0 0 0 0 ỹ1 1 ỹ2 1 ỹ3 1 ỹ41 1 1 1 1 0 ũ3 0 ũ2 0 ũ3 Circles represent state of accumulator Each info bit appears at two transitions Sequence of states is codeword 0 0 0 ỹ5 1 1 0 0 ũ1 ũ2 0 ỹ6 1 Encoding with Trellis 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 ũ1 ũ3 ũ2 ũ3 ũ1 ũ2 0 1 1 1 0 1 Encode 011 Output 010110 0 1 Decoding with Trellis 0 0 0 0 1 0 1 1 ũ1 ỹ2=0 ỹ1=1 0 0 1 0 ũ3 =1 1 0 0 1 0 ũ2 1 0 1 0 ũ3 1 0 0 0 1 1 0 0 ũ1 ũ2 At each step, codeword says which state should transition to Read off ũi label on transition arc 1 Handling Errors Not every path through trellis is a codeword Need both occurrences of each info bit to agree (cause same transition) Call such a path agreeable Want path with fewest transitions not matching received codeword Give cost 0 to transitions matching received word, cost 1 to transitions not matching Shortest agreeable path is ML codeword Relaxation Shortest agreeable path is NP-complete Relax to min-cost agreeable flow Flow is convex combination of paths (exponential number of variables, but tractable) Alternatively, poly-size LP based on balance of incoming and outgoing flow Agreeability is a constraint that certain groups of transition edges carry same amount of flow Properties of Relaxation Any agreeable path is an agreeable flow Conclude: correct answer is feasible in relaxation Conclude: if min-cost agreeable flow is a path, then it is shortest agreeable path Certificate property: min-cost agreeable flow decoder will know it has found ML codeword Relaxation is tractable to solve Can directly solve via LP But actually exists reduction to standard min-cost flow, so can use specialized fast algorithms Performance When is correct codeword (path) the optimum for MCAF? Intuition from residual graphs in flow Build new graph by subtracting from each edge’s capacity amount of flow currently on edge A minimum cost flow is optimal if and only if there is no negative cost cycle in the residual graph (i.e., of positive residual capacity) Can push flow around cycle, reduce cost of flow Says local optimum is global optimum Generalize to MCAF True codeword is optimal if no negative cost agreeable cycle in residual graph What does agreeable cycle look like? Must diverge from correct codeword path at some point (traverses other label) Then traverses same labels as correct for a while Then may return to codeword path (again by traversing “opposite” label Each time traverses opposite label at some bit, agreeability requires it also traverse opposite label at “other copy” of that bit Trellis 0 0 1 ũ1 0 0 0 0 0 0 0 ỹ1 1 ỹ2 1 ỹ3 1 ỹ41 1 1 1 1 0 ũ3 0 ũ2 0 ũ3 Suppose all 0s sent If cycle uses 1-edge at second layer must also use 1-edge at 4th layer 0 0 0 ỹ5 1 1 0 0 ũ1 ũ2 0 ỹ6 1 An Easier Representation: the Tanner Graph Draw path along codeword Add “matching” edge between vertices corresponding to copies of same info bit Edge cost -1 if received bit flipped 1 otherwise All edge costs 0 Cycles in residual graph correspond to cycles in this graph G, and have same cost Hamiltonian edge = use same transition as codeword Matching edge = use opposite transition Picture ỹ6 ũ2 ũ1 1 -1 0 ỹ5 1 ỹ1 ũ3 0 -1 ỹ2 0 ũ1 ũ2 1 ỹ4 1 ũ3 ỹ3 Connection 0 ũ2 Circle paths == simple cycles in ỹ5 1 trellis Matching edges ũ1 for agreement 0 1 ũ1 0 0 0 0 ỹ1 ũ1 ỹ6 1 -1 0 0 ỹ4 1 1 0 ũ3 0 0 ỹ2 1 ỹ3 1 ỹ41 1 1 1 1 0 ũ2 -1 ỹ2 0 ỹ1 1 0 ũ3 ũ3 0 ũ3 ũ2 ỹ3 0 0 0 ỹ5 1 1 0 0 ũ1 ũ2 0 ỹ6 1 Main Theorem The true codeword is the min-cost agreeable flow solution iff there is no negative cost cycle in the Tanner graph G Suggests Good Codes Recall edge cost is -1 if bit flipped, which occurs with probability < ½ Intuition: if cycle is large, unlikely that more than half its edges flip So, good idea to build graph in which all cycle are large Achieve by building graph with larger girth=length of shortest cycle Erdos gives (Path+Matching) graph with girth log n (and this is best possible) Analysis Theorem: Using RA(2) code from Erdos Graph, if p < 2-4(ε+log 24)/2) then WER=Pr[negative cycle] < n-ε Proof: Negative cycle has length at least log n Break into subpaths of length log n One must be negative What is Pr[negative length log n path]? probability at most n-2-ε Degree 3 graph, so only n2 paths of length log n Add up Experiments Summary Combinatorial algorithm for decoding RA(2) codes Analysis of its error probability “Recommended code” based on analysis That code has polynomially small WER Experiments show good performance in practice But slow, since solving LP This approach gives insight, not algorithms Extensions Discovered that Wainright’s “Tree Reweighted Max Product” is solving dual of our LP Thus, TRMP has same performance Gives a belief-propogation flavored decoder with same performance as LP decoder Techniques extend to arbitrary LDPC codes Relaxation by intersecting “parity polytopes” for each parity check LP decoder (but flow decoder doesn’t extend) Breaking Result [FMSSW] Analysis of LP decoder for LDPC codes Proof that when code based on expander graph, LP decoder handles constant fraction of bits corrupted Gives proof of exponentially small error rate for polynomial-time decoder of LDPC codes. Learning Markov Random Fields Randomized Rounding Overview Fundamentals of Markov networks Maximum likelihood markov network structure as a maximum hypertree problem Tree-width and hypertrees An approximation algorithm for maximum hypertrees Reducing maximum hypertrees to/from maximum likelihood markov networks Density Estimation T observations x1,…,xT Each x1= x11,…, x1n a vector of n values Estimate joint probability distribution P(X1,…,Xn) from which samples were taken. (Assume observations i.i.d) Maximum likelihood approach Postulate “best fit” is maximum likelihood distribution Distribution that maximizes likelihood of data To avoid over-fitting, limit choice to within some (parametric) class. Equivalent to projecting onto our class to find “nearest neighbor” of empirical distribution Our approach applies to this general distribution projection problem Markov Random Networks Representations of joint distributions with limited dependence. Variables x1…xn Graph with variables as vertices Edges represent dependencies Conditioned on (any specific values of) its neighbors, xi is independent of all other variables. More generally, two sides of any separator are independent conditioned on separator Problems Markov net inference Markov net learning Given data and specific Markov net, find parameter settings that best fit data Given data, find Markov net structure that best fits data (under best parameter settings) For us, “best fits”=“maximum likelihood” Hammersley Clifford Theorem Cliques in Markov network are “special”: no restriction on their marginal distribution A distribution P is a Markov network over graph G iff P factorizes over the cliques of G: P( x) j ( x ) h hClique(G ) h Note jh assigns a value to each possible setting xh of values of variables in clique h Value of HC HC gives a “concise” representation of any MRF probability distribution function Only need to specify clique potentials If variables take s values, then potential on size-k clique represented by sk values If cliques are small (constant size) then representation is small (linear size) Limits of HC Cannot use HC factorization to compute important quantities Normalization of j (can’t even tell if needed) Marginal probability distributions (even 1 variable) Conditional probability distributions (ditto) Maximum likelihood parameter settings (finding j to fit data) Triangulated Markov Networks No minimal cycles of more then three vertices. X1 X2 X5 X3 X4 X6 Benefits of Triangulation Efficient (linear time) exact calculations Marginals Conditional distributions Just about anything else (via canonical dynamic program) Explicit Hammersley-Clifford factorization Efficient inference: calculation of maximum likelihood j to fit observed data Efficient Calculation Triangulated graph has elimination ordering Canonical dynamic program Order of deleting vertices such that when vertex is deleted, its surviving neighbors form a clique Run backwards, adds each vertex to a clique Memoize values on each clique (eg, distribution) Gives necessary info to add new vertex Memo-table size exponential in clique sizes Happy if max clique size small HC Factorization of Triangulated Graphs PX ( x) j ( x ) h h (as before) Cliques h P( xh ) j h ( xh ) j h' ( xh' ) h ' h (new: explicit j) ML Inference on Triangulated Graphs Fixed network given Choose parameters (potentials j) to maximize likelihood of observations In triangulated network, do so by making marginals correct on cliques i.e., want derived P(xh) equal to empirical distribution P*(xh) for each clique h Achieve by plugging P* into explicit formula for j Inference on Triangulated Graphs PX ( x) * (x ) j h h (as before) Cliques h P*( xh ) j h* ( xh ) j h*' ( xh' ) h ' h (new: explicit j) Triangulation Non-triangulated Markov network can be triangulated by adding edges Adding edges removes independence constraints, so broadens class of models Find large minimal cycles, add a chords So, only increases “fit” to data (maximum likelihood) And, makes computations tractable Treewidth Could just add all edges (complete graph is triangulated) Drawbacks: Dynamic programs exponential in clique size Number of model parameters exponential in clique size: leads to over-fitting Treewidth of a graph: minimum over all triangulations, of the maximum clique size of the triangulation, minus one Markov Net Learning Given data, wish to find MRN of tree-width at most k that maximizes likelihood of observed data Equivalently, since triangulation can only increase likelihood, wish to find maximum likelihood triangulated graph with clique size at most k+1 We call such a graph a k-hyperforest If maximal, k-hypertree Computing (Log-)Likelihood log P r[x1 , , xT ] log P r[x t ] t log t j h hClique( G ) t h (x ) t x ( j log h h) * P T h ( xh ) logjh ( xh ) hClique( G ) t hClique( G ) xh Additive Weights For triangulated G, HC gives explicit formulation for maximum likelihood j : * P ( xh ) j ( xh ) * jh' ( xh' ) * h h ' h Key: each j*h is independent of graph choice! * * Set wh Ph ( xh ) logjh ( xh ) Then log Pr[x1 ,, xT ] T wh hClique( G ) New formulation Max-likelihood value of G is just sum of weights of cliques it contains So, given weights on cliques of size up to k+1, want to find a hypertree (triangulated graph of treewidth at most k) containing maximum weight of cliques This is the maximum hypertree problem Chow Liu (1968): k=1 Treewidth 1 is just a tree Edges are cliques of size 2 Weight wh on edge turns out to be mutual information between the variables on its endpoints. Maximum likelihood tree is the maximum spanning tree with: w(v, u) I ( X v ; X u ) Polynomial time solvable Larger Treewidths Theorem: for k >1, maximum hypertree problem is NP-complete Reduction from SAT [S]: conclude ML MRN NP-complete So, seek approximation algorithm: Given optimum has value w, find (in polynomial time) some solution with value at least w/a We give an algorithm with a=8kk!(k+1)! “Constant” for any fixed k Idea: Locally Testable Structure “Treewidth k” is a global constraint, hard to aim for Define windmill, an object with local characterization Every hypertree contains a windmill with at least 1/(k+1)! of its weight Algorithm to find windmill of approximately (factor 1/8kk!) maximum weight Star graphs Covering 11 of the 15 edges of a tree with disjoint stars: Partitioning a tree into two sets of disjoint stars Conclusion: some set of disjoint stars contains at least 1/2 the edge weight of any given tree. Windmills A k-windmill is defined by a depth-k rooted tree Its hyperedges are all the paths from the root It has treewidth k 1-windmill is star Windmill Theorem Windmill farm: collection of disjoint windmills Theorem: any weighted k-hypertree contains a k-windmill-farm with at least a 1/(k+1)! fraction of the weight k-color the hypertree so no edge has repeated colors All edges that get colors “in same order” form a windmill farm Only k! orders Idea: Randomized Rounding Already saw idea of ILP relaxations Define integer linear program solving problem (not convex) Ignore integrality constraints; solve LP (convex) Now, want integral solution Round fractional solution to integral by setting variable of value 0 < x < 1 equal to 1 with probability x and 0 otherwise Gives integer solution where all constraints satisfied in expectation Special Case: Layered 2-windmills Variable yuv set to 1 if u is root and v is child of u Variable zuvx set to 1 if x is a child of v which is a child of u Objective function S wuvx zuvx Consistency constraint zuvx yuv Single-parent constraint Su yuv 1 Rounding Consider constraint Su yuv 1 Means yuv form a probability distribution on choices of parent for v Choose parents according to this distribution Objective function was S wuvx zuvx We can “keep” wuvx if yuv was set to 1 So expected kept value is S wuvx yuv > S wuvx zuvx i.e., rounded solution matches LP objective value! (white lie) Open problems We have a constant factor approximation for constant k, but it’s a pretty bad constant! Two separate reasons for badness Windmill theorem may be very loose (but we have examples of gap exceeding k) Better windmill approximations? Multilevel facility location? Use idea of “restricted, locally testable class of MRNs” to find other, practical and tractable subsets Hardness of approximation? Conclusion Near Neighbor Structure Concepts Random sampling Memoization Greedy improvement Local search Impact Simple Good constants Good experimental performance So likely useful in practice One-shot MDPs Concepts Approximation Algorithms Reductions to related problems Dynamic Programming Impact Solved an open theory problem Unlikely to be a good practical performer But suggests heuristics And poses good next questions Turbo Decoding Concepts LP relaxations Flows Graph theory Impact Probably don’t want to use LP in DSP chips Strengthens theoretical grounds/explanations for good performance of certain codes Provides insight for improving codes and decoding algorithms Hypertrees Concepts Randomized rounding of LP relaxations Impact Unlikely to be useful in practice Possible application to branch and bound optimization methods Suggests some useful “special” classes of lowtreewidth graphs (windmills) that may be more tractable to construct Working with Algorithms Many NIPS problems would be trivial given infinite computational resources Good algorithms can simulate such resources To apply algorithmic methods, want welldefined algorithmic problem Large collection of available tools Large collection of people who know them Avoid defining problem via algorithmic solution Dialogue can help define true problem Even oversimplified solution may give insight Acknowledgements Near Neighbor Searching MDPs With Avrim Blum, Shuchi Chawla, Terran Lane, Adam Meyerson, Maria Minkoff Turbo Decoding With Matthias Ruhl With Jon Feldman, Martin Wainright Learning Markov Models With Nati Srebro Conculsion Lots of interesting NIPS problems! Techniques from theoretical computer science can be applied Toolbox of prior approximation algorithms Combinatorial structure of problems Wanted: more problems Value in both definition and solution http://theory.lcs.mit.edu/~karger/Talks/NIPS.ppt