Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech Submodular functions Submodularity: V={1,2, …, n} f : 2V ! R f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,Tµ V Equivalent Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) 8SµTµV, xT Examples: • Vector Spaces Let V={v1,,vn}, each vi 2 Rn. For each S µ V, let f(S) = rank(V[S]) • Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|) Submodular functions Submodularity: V={1,2, …, n} f : 2V ! R f(S)+f(T) ¸ f(S Å T) + f(S [ T) 8 S,Tµ V Equivalent Decreasing marginal values: f(S [ {x})-f(S) ¸ f(T [ {x})-f(T) Monotone: f(S) · f(T), 8 S µ T Non-negative: f(S) ¸ 0, 8SµV 8SµTµV, xT Submodular functions • Strong connection between optimization and submodularity • e.g.: minimization [C’85,GLS’87,IFF’01,S’00,…], maximization [NWF’78,V’07,…] • Algorithmic game theory • Submodular utility functions • Much interest in Machine Learning community recently • Tutorials at major conferences: ICML, NIPS, etc. • www.submodularity.org is a Machine Learning site • Interesting to understand their learnability Exact Learning with value queries Goemans, Harvey, Iwata, Mirrokni x1 SODA 2009 Algorithm f(x1) f : {0,1}n R g : {0,1}n R • Algorithm adaptively queries xi and receives value f(xi), for i=1,…,q, where q=poly(n). • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) • Goal: g(x)·f(x)·®¢g(x) 8x 2 {0,1}n ® as small as possible Exact Learning with value queries Goemans, Harvey, Iwata, Mirrokni SODA 2009 Theorem: (Upper bound) 9 an alg. for learning a submodular function ~ 1/2). with ® = O(n Theorem: (Lower bound) Any alg. for learning a submodular function ~ 1/2 must have ® = (n ). • Algorithm adaptively queries xi and receives value f(xi), for i=1,…,q • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) • Goal: g(x)·f(x)·®¢g(x) 8x 2 {0,1}n ® as small as possible Problems with this model • In learning theory, usually only try to predict value of most points • GHIM lower bound fails if goal is to do well on most of the points • To define “most” need a distribution on {0,1}n Is there a distributional model for learning submodular functions? Our Model Distribution D on {0,1}n f: {0,1}n R+ xi f(xi) Algorithm g : {0,1}n R+ • Algorithm sees examples (x1,f(x1)),…, (xq,f(xq)) where xi’s are i.i.d. from distribution D • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) Our Model Distribution D on {0,1}n x f : {0,1}n R+ Is f(x) ¼ g(x)? Algorithm g : {0,1}n R+ • Algorithm sees examples (x1,f(x1)),…, (xq,f(xq)) where xi’s are i.i.d. from distribution D • Algorithm produces “hypothesis” g. (Hopefully g ¼ f) • Prx1,…,xq[ Prx[g(x)·f(x)·®¢g(x)] ¸ 1-² ] ¸ 1-± • “Probably Mostly Approximately Correct” Our Model Distribution D on {0,1}n x f : {0,1}n R+ Is f(x) ¼ g(x)? Algorithm g : {0,1}n R+ • “Probably Mostly Approximately Correct” • Impossible if f arbitrary and # training points ¿ 2n • Possible if f is a non-negative, monotone, submodular function Example: Concave Functions h • Concave Functions Let h : R ! R be concave. Example: Concave Functions V ; • Concave Functions Let h : R ! R be concave. For each SµV, let f(S) = h(|S|). • Claim: f is submodular. • We prove a partial converse. Theorem: Every submodular function looks like this. Lots of approximately usually. V ; Theorem: Every submodular function looks like this. Lots of approximately usually. V ; Theorem: matroid rank function Let f be a non-negative, monotone, submodular, 1-Lipschitz function. There exists a concave function h : [0,n] ! R s.t., for any ²>0, for every k2{0,..,n}, and for a 1-² fraction of SµV with |S|=k, we have: h(k) · f(S) · O(log2(1/²))¢h(k). In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k. Proof: Based on Talagrand’s Inequality. Learning Submodular Functions under any product distribution Product Distribution D on {0,1}n f: {0,1}n R+ xi f(xi) Algorithm g : {0,1}n R+ q f(x ) / q • Algorithm: Let ¹ = §i=1 i • Let g be the constant function with value ¹ • This achieves approximation factor O(log2(1/²)) on a 1-² fraction of points, with high probability. • Proof: Essentially follows from previous theorem. Learning Submodular Functions under an arbitrary distribution? V ; • Same argument no longer works. Talagrand’s inequality requires a product distribution. • Intuition: A non-uniform distribution focuses on fewer points, so the function is less concentrated on those points. A General Upper Bound? • Theorem: (Our upper bound) 9 an algorithm for learning a submodular function w.r.t. an arbitrary distribution that has approximation factor O(n1/2). Computing Linear Separators + + – + + + – + – – + + – – – – – • Given {+,–}-labeled points in Rn, find a hyperplane cTx = b that separates the +s and –s. • Easily solved by linear programming. Learning Linear Separators + + + + – Error! + – + – – + + – – – – – • Given random sample of {+,–}-labeled points in Rn, find a hyperplane cTx = b that separates most of the +s and –s. • Classic machine learning problem. Learning Linear Separators + + + + – Error! + – + – – + + – – – – – • Classic Theorem: [Vapnik-Chervonenkis 1971?] ~ O( n/²2 ) samples suffice to get error ². Submodular Functions are Approximately Linear • Let f be non-negative, monotone and submodular • Claim: f can be approximated to within factor n by a linear function g. • Proof Sketch: Let g(S) = §s2S f({s}). Then f(S) · g(S) · n¢f(S). Submodularity: f(S)+f(T)¸f(SÅT)+f(S[T) 8S,TµV Monotonicity: f(S)·f(T) 8SµT Non-negativity: f(S)¸0 8SµV Submodular Functions are Approximately Linear n¢f g f V – – – + – + – + + – n¢f – g + f + + V • Randomly sample {S1,…,Sq} from distribution • Create + for f(Si) and – for n¢f(Si) • Now just learn a linear separator! n¢f g f V • Theorem: g approximates f to within a factor n on a 1-² fraction of the distribution. • Can improve to factor O(n1/2) by GHIM lemma: ellipsoidal approximation of submodular functions. A Lower Bound? V ; • A non-uniform distribution focuses on fewer points, so the function is less concentrated on those points • Can we create a submodular function with lots of deep “bumps”? • Yes! A General Lower Bound Theorem: (Our general lower bound) No algorithm can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factor õ(n1/3). Plan: Use the fact that matroid rank functions are submodular. Construct a hard family of matroids. Pick A1,…,Am ½ V with |Ai| = n1/3 and m=nlog n X X A1 X X A2 A3 High=n1/3 Low=log2 n … … …. …. AL Matroids • Ground Set V • Family of Independent Sets I • Axioms: • • • ;2I “nonempty” J½I2I ) J2I “downwards closed” J, I 2 I and |J|<|I| ) 9x2InJ s.t. J+x 2 I “maximum-size sets can be found greedily” • Rank function: r(S) = max { |I| : I2I and IµS } V ; |S| (if |S| · k) f(S)= = min{ |S|, k } r(S) k (otherwise) A V ; r(S) = |S| (if |S| · k) k-1 (if S=A) k (otherwise) A1 A2 A3 Am V ; A = {A1,,Am}, |Ai|=k 8i |S| (if |S| · k) r(S) = k-1 (if S 2 A) k (otherwise) Claim: r is submodular if |AiÅAj|·k-2 8ij r is the rank function of a “paving matroid” A1 A2 A3 Am V ; A = {A1,,Am}, |Ai|=k |S| r(S) = k-1 k 8i, |AiÅAj|·k-2 8ij (if |S| · k) (if S 2 A) (otherwise) If algorithm sees only these examples Then f can’t be predicted here A1 A2 A3 Am V ; r(S) = |S| (if |S| · k) k-1 (if S 2 A and wasn’t deleted) k (otherwise) Delete half of the bumps at random. If m large, alg. cannot learn which were deleted ) any algorithm to learn f has additive error 1 A1 A2 A3 Am V ; Can we force a bigger error with bigger bumps? Yes! Need to generalize paving matroids A needs to have very strong properties The Main Question • Let V = A1[[Am and b1,,bm2N • Is there a matroid s.t. • r(Ai) · bi 8i • r(S) is “as large as possible” for SAi (this is not formal) Next: formalize this • If Ai’s are disjoint, solution is partition matroid • If Ai’s are “almost disjoint”, can we find a matroid that’s “almost” a partition matroid? Lossless Expander Graphs U V • Definition: G =(U[V, E) is a (D,K,²)-lossless expander if – Every u2U has degree D – |¡ (S)| ¸ (1-²)¢D¢|S| 8SµU with |S|·K, where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E } “Every small left-set has nearly-maximal number of right-neighbors” Lossless Expander Graphs U V • Definition: G =(U[V, E) is a (D,K,²)-lossless expander if – Every u2U has degree D – |¡ (S)| ¸ (1-²)¢D¢|S| 8SµU with |S|·K, where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E } “Neighborhoods of left-vertices are K-wise-almost-disjoint” Trivial Case: Disjoint Neighborhoods U V • Definition: G =(U[V, E) is a (D,K,²)-lossless expander if – Every u2U has degree D – |¡ (S)| ¸ (1-²)¢D¢|S| 8SµU with |S|·K, where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E } • If left-vertices have disjoint neighborhoods, this gives an expander with ²=0, K=1 Main Theorem: Trivial Case A1 u1 U u2 u3 • • • • · b1 ·V b2 A2 Suppose G =(U[V, E) has disjoint left-neighborhoods. Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }. Let b1, …, bm be non-negative integers. Partition matroid Theorem: X I =If = I :f jII \: [jI j \2 JAAj j jj ·· bj 8jbj g8J g j 2J is family of independent sets of a matroid. Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi ¸ 4²D 8i A1 · b1 · b2 A2 Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi ¸ 4²D 8i • “Desired Theorem”: I is a matroid, where X I = f I : jI \ [ j 2 J Aj j· bj 8J g j 2J Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } • Let b1, …, bm satisfy bi ¸ 4²D 8i • Theorem: I is a matroid, where ³ X X I = f I : jI \ [ j 2 J Aj j· bj ¡ j 2J jA j j ¡ j [ j 2J 8J s.t . jJ j · K ^ jI j · ²D K g ´ j 2J Aj j Main Theorem • Let G =(U[V, E) be a (D,K,²)-lossless expander • Let A={A1,…,Am} be defined by A = { ¡(u) : u2U } =0 • Let b1, …, bm satisfy bi ¸ 4²D 8i • Theorem: I is a matroid, where ³ X X I = f I : jI \ [ j 2 J Aj j· bj ¡ j 2J =0 jA j j ¡ j [ j 2J = 1 8J s.t . jJ j · K = 1 ^ jI j · ²D K g • Trivial case: G has disjoint neighborhoods, i.e., K=1 and ²=0. ´ j 2J Aj j LB for Learning Submodular Functions n1/3 A1 log2 n A2 ; • How deep can we make the “valleys”? V LB for Learning Submodular Functions • Let G =(U[V, E) be a (D,K,²)-lossless expander, where Ai = ¡(ui) and – |V|=n – D = K = n1/3 − |U|=nlog n − ² = log2(n)/n1/3 • Such graphs exist by the probabilistic method • Lower Bound Proof: – Delete each node in U with prob. ½, then use main theorem to get a matroid – If ui2U was not deleted then r(Ai) · bi = 4²D = O(log2 n) – Claim: If ui deleted then Ai 2 I (Needs a proof) ) r(Ai) = |Ai| = D = n1/3 – Since # Ai’s = |U| = nlog n, no algorithm can learn a significant fraction of r(Ai) values in polynomial time Summary • PMAC model for learning real-valued functions • Learning under arbitrary distributions: – Factor O(n1/2) algorithm – Factor (n1/3) hardness (info-theoretic) • Learning under product distributions: – Factor O(log(1/²)) algorithm • New general family of matroids – Generalizes partition matroids to non-disjoint parts Open Questions • Improve (n1/3) lower bound to (n1/2) • Explicit construction of expanders • Non-monotone submodular functions – Any algorithm? – Lower bound better than (n1/3) • For algorithm under uniform distribution, relax 1-Lipschitz condition