Learning Submodular Functions

advertisement
Learning Submodular Functions
Nick Harvey
University of Waterloo
Joint work with
Nina Balcan, Georgia Tech
Submodular functions
Submodularity:
V={1,2, …, n}
f : 2V ! R
f(S)+f(T) ¸ f(S Å T) + f(S [ T)
8 S,Tµ V
Equivalent
Decreasing marginal values:
f(S [ {x})-f(S) ¸ f(T [ {x})-f(T)
8SµTµV, xT
Examples:
• Vector Spaces Let V={v1,,vn}, each vi 2 Rn.
For each S µ V, let f(S) = rank(V[S])
• Concave Functions Let h : R ! R be concave.
For each S µ V, let f(S) = h(|S|)
Submodular functions
Submodularity:
V={1,2, …, n}
f : 2V ! R
f(S)+f(T) ¸ f(S Å T) + f(S [ T)
8 S,Tµ V
Equivalent
Decreasing marginal values:
f(S [ {x})-f(S) ¸ f(T [ {x})-f(T)
Monotone:
f(S) · f(T), 8 S µ T
Non-negative:
f(S) ¸ 0,
8SµV
8SµTµV, xT
Submodular functions
• Strong connection between optimization and submodularity
• e.g.: minimization [C’85,GLS’87,IFF’01,S’00,…],
maximization [NWF’78,V’07,…]
• Algorithmic game theory
• Submodular utility functions
• Much interest in Machine Learning community recently
• Tutorials at major conferences: ICML, NIPS, etc.
• www.submodularity.org is a Machine Learning site
• Interesting to understand their learnability
Exact Learning with value queries
Goemans, Harvey, Iwata, Mirrokni
x1
SODA 2009
Algorithm
f(x1)
f : {0,1}n  R
g : {0,1}n  R
• Algorithm adaptively queries xi and receives
value f(xi), for i=1,…,q, where q=poly(n).
• Algorithm produces “hypothesis” g. (Hopefully g ¼ f)
• Goal: g(x)·f(x)·®¢g(x) 8x 2 {0,1}n
® as small as possible
Exact Learning with value queries
Goemans, Harvey, Iwata, Mirrokni
SODA 2009
Theorem: (Upper bound)
9 an alg. for learning a submodular function
~ 1/2).
with ® = O(n
Theorem: (Lower bound)
Any alg. for learning a submodular function
~ 1/2
must have ® = (n
).
• Algorithm adaptively queries xi and receives
value f(xi), for i=1,…,q
• Algorithm produces “hypothesis” g. (Hopefully g ¼ f)
• Goal: g(x)·f(x)·®¢g(x) 8x 2 {0,1}n
® as small as possible
Problems with this model
• In learning theory, usually only try to predict value of
most points
• GHIM lower bound fails if goal is to do well on most of
the points
• To define “most” need a distribution on {0,1}n
Is there a distributional model
for learning submodular functions?
Our Model
Distribution D
on {0,1}n
f:
{0,1}n
 R+
xi
f(xi)
Algorithm
g : {0,1}n  R+
• Algorithm sees examples (x1,f(x1)),…, (xq,f(xq))
where xi’s are i.i.d. from distribution D
• Algorithm produces “hypothesis” g. (Hopefully g ¼ f)
Our Model
Distribution D
on {0,1}n
x
f : {0,1}n  R+
Is f(x) ¼ g(x)?
Algorithm
g : {0,1}n  R+
• Algorithm sees examples (x1,f(x1)),…, (xq,f(xq))
where xi’s are i.i.d. from distribution D
• Algorithm produces “hypothesis” g. (Hopefully g ¼ f)
• Prx1,…,xq[ Prx[g(x)·f(x)·®¢g(x)] ¸ 1-² ] ¸ 1-±
• “Probably Mostly Approximately Correct”
Our Model
Distribution D
on {0,1}n
x
f : {0,1}n  R+
Is f(x) ¼ g(x)?
Algorithm
g : {0,1}n  R+
• “Probably Mostly Approximately Correct”
• Impossible if f arbitrary and # training points ¿ 2n
• Possible if f is a non-negative, monotone,
submodular function
Example: Concave Functions
h
• Concave Functions Let h : R ! R be concave.
Example: Concave Functions
V
;
• Concave Functions Let h : R ! R be concave.
For each SµV, let f(S) = h(|S|).
• Claim: f is submodular.
• We prove a partial converse.
Theorem: Every submodular function looks like this.
Lots of
approximately
usually.
V
;
Theorem: Every submodular function looks like this.
Lots of
approximately
usually.
V
;
Theorem:
 matroid rank function
Let f be a non-negative, monotone, submodular, 1-Lipschitz function.
There exists a concave function h : [0,n] ! R s.t., for any ²>0,
for every k2{0,..,n}, and for a 1-² fraction of SµV with |S|=k,
we have:
h(k) · f(S) · O(log2(1/²))¢h(k).
In fact, h(k) is just E[ f(S) ], where S is uniform on sets of size k.
Proof: Based on Talagrand’s Inequality.
Learning Submodular Functions
under any product distribution
Product Distribution
D on {0,1}n
f:
{0,1}n
 R+
xi
f(xi)
Algorithm
g : {0,1}n  R+
q f(x ) / q
• Algorithm: Let ¹ = §i=1
i
• Let g be the constant function with value ¹
• This achieves approximation factor O(log2(1/²))
on a 1-² fraction of points, with high probability.
• Proof: Essentially follows from previous theorem.
Learning Submodular Functions
under an arbitrary distribution?
V
;
• Same argument no longer works.
Talagrand’s inequality requires a product distribution.
• Intuition:
A non-uniform distribution focuses on fewer points,
so the function is less concentrated on those points.
A General Upper Bound?
• Theorem: (Our upper bound)
9 an algorithm for learning a submodular function
w.r.t. an arbitrary distribution that has
approximation factor O(n1/2).
Computing Linear Separators
+
+
–
+
+
+
–
+
–
–
+
+
–
–
–
–
–
• Given {+,–}-labeled points in Rn, find a hyperplane
cTx = b that separates the +s and –s.
• Easily solved by linear programming.
Learning Linear Separators
+
+
+
+
–
Error!
+
–
+
–
–
+
+
–
–
–
–
–
• Given random sample of {+,–}-labeled points in Rn,
find a hyperplane cTx = b that separates most of
the +s and –s.
• Classic machine learning problem.
Learning Linear Separators
+
+
+
+
–
Error!
+
–
+
–
–
+
+
–
–
–
–
–
• Classic Theorem: [Vapnik-Chervonenkis 1971?]
~
O(
n/²2 ) samples suffice to get error ².
Submodular Functions are
Approximately Linear
• Let f be non-negative, monotone and submodular
• Claim: f can be approximated to within factor n
by a linear function g.
• Proof Sketch: Let g(S) = §s2S f({s}).
Then f(S) · g(S) · n¢f(S).
Submodularity: f(S)+f(T)¸f(SÅT)+f(S[T) 8S,TµV
Monotonicity:
f(S)·f(T)
8SµT
Non-negativity:
f(S)¸0
8SµV
Submodular Functions are
Approximately Linear
n¢f
g
f
V
–
–
–
+ – +
–
+
+
–
n¢f
–
g
+
f
+
+
V
• Randomly sample {S1,…,Sq} from distribution
• Create + for f(Si) and – for n¢f(Si)
• Now just learn a linear separator!
n¢f
g
f
V
• Theorem: g approximates f to within a factor n
on a 1-² fraction of the distribution.
• Can improve to factor O(n1/2) by GHIM lemma:
ellipsoidal approximation of submodular functions.
A Lower Bound?
V
;
• A non-uniform distribution focuses on fewer points,
so the function is less concentrated on those points
• Can we create a submodular function with lots of
deep “bumps”?
• Yes!
A General Lower Bound
Theorem: (Our general lower bound)
No algorithm can PMAC learn the class of non-neg., monotone,
submodular fns with an approx. factor õ(n1/3).
Plan:
Use the fact that matroid rank functions are submodular.
Construct a hard family of matroids.
Pick A1,…,Am ½ V with |Ai| = n1/3 and m=nlog n
X
X
A1
X
X
A2
A3
High=n1/3
Low=log2 n
… … …. ….
AL
Matroids
• Ground Set V
• Family of Independent Sets I
• Axioms:
•
•
•
;2I
“nonempty”
J½I2I ) J2I
“downwards closed”
J, I 2 I and |J|<|I| ) 9x2InJ s.t. J+x 2 I
“maximum-size sets can be found greedily”
• Rank function: r(S) = max { |I| : I2I and IµS }
V
;
|S| (if |S| · k)
f(S)= = min{ |S|, k }
r(S)
k (otherwise)
A
V
;
r(S) =
|S| (if |S| · k)
k-1 (if S=A)
k (otherwise)
A1
A2
A3
Am
V
;
A = {A1,,Am}, |Ai|=k 8i
|S| (if |S| · k)
r(S) = k-1 (if S 2 A)
k (otherwise)
Claim: r is submodular if |AiÅAj|·k-2 8ij
r is the rank function of a “paving matroid”
A1
A2
A3
Am
V
;
A = {A1,,Am}, |Ai|=k
|S|
r(S) = k-1
k
8i, |AiÅAj|·k-2 8ij
(if |S| · k)
(if S 2 A)
(otherwise)
If algorithm sees
only these examples
Then f can’t be
predicted here
A1
A2
A3
Am
V
;
r(S) =
|S| (if |S| · k)
k-1 (if S 2 A and wasn’t deleted)
k (otherwise)
Delete half of the bumps at random.
If m large, alg. cannot learn which were deleted
) any algorithm to learn f has additive error 1
A1
A2
A3
Am
V
;
Can we force a bigger error with bigger bumps?
Yes!
Need to generalize paving matroids
A needs to have very strong properties
The Main Question
• Let V = A1[[Am and b1,,bm2N
• Is there a matroid s.t.
• r(Ai) · bi 8i
• r(S) is “as large as possible” for SAi
(this is not formal)
Next: formalize this
• If Ai’s are disjoint, solution is partition matroid
• If Ai’s are “almost disjoint”, can we find a
matroid that’s “almost” a partition matroid?
Lossless Expander Graphs
U
V
• Definition:
G =(U[V, E) is a (D,K,²)-lossless expander if
– Every u2U has degree D
– |¡ (S)| ¸ (1-²)¢D¢|S|
8SµU with |S|·K,
where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E }
“Every small left-set has nearly-maximal
number of right-neighbors”
Lossless Expander Graphs
U
V
• Definition:
G =(U[V, E) is a (D,K,²)-lossless expander if
– Every u2U has degree D
– |¡ (S)| ¸ (1-²)¢D¢|S|
8SµU with |S|·K,
where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E }
“Neighborhoods of left-vertices are
K-wise-almost-disjoint”
Trivial Case: Disjoint Neighborhoods
U
V
• Definition:
G =(U[V, E) is a (D,K,²)-lossless expander if
– Every u2U has degree D
– |¡ (S)| ¸ (1-²)¢D¢|S|
8SµU with |S|·K,
where ¡ (S) = { v2V : 9u2S s.t. {u,v}2E }
• If left-vertices have disjoint neighborhoods,
this gives an expander with ²=0, K=1
Main Theorem: Trivial Case
A1
u1
U
u2
u3
•
•
•
•
· b1
·V
b2
A2
Suppose G =(U[V, E) has disjoint left-neighborhoods.
Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }.
Let b1, …, bm be non-negative integers. Partition matroid
Theorem:
X
I =If =
I :f jII \: [jI j \2 JAAj j jj ·· bj 8jbj g8J g
j 2J
is family of independent sets of a matroid.
Main Theorem
• Let G =(U[V, E) be a (D,K,²)-lossless expander
• Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }
• Let b1, …, bm satisfy bi ¸ 4²D 8i
A1
· b1
· b2
A2
Main Theorem
• Let G =(U[V, E) be a (D,K,²)-lossless expander
• Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }
• Let b1, …, bm satisfy bi ¸ 4²D 8i
• “Desired Theorem”: I is a matroid, where
X
I = f I : jI \ [
j 2 J Aj
j·
bj 8J g
j 2J
Main Theorem
• Let G =(U[V, E) be a (D,K,²)-lossless expander
• Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }
• Let b1, …, bm satisfy bi ¸ 4²D 8i
• Theorem: I is a matroid, where
³ X
X
I = f I : jI \ [
j 2 J Aj
j·
bj ¡
j 2J
jA j j ¡ j [
j 2J
8J s.t . jJ j · K
^ jI j · ²D K g
´
j 2J
Aj j
Main Theorem
• Let G =(U[V, E) be a (D,K,²)-lossless expander
• Let A={A1,…,Am} be defined by A = { ¡(u) : u2U }
=0
• Let b1, …, bm satisfy bi ¸ 4²D 8i
• Theorem: I is a matroid, where
³ X
X
I = f I : jI \ [
j 2 J Aj
j·
bj ¡
j 2J
=0
jA j j ¡ j [
j 2J
= 1 8J s.t . jJ j · K = 1
^ jI j · ²D K g
• Trivial case: G has disjoint neighborhoods,
i.e., K=1 and ²=0.
´
j 2J
Aj j
LB for Learning Submodular Functions
n1/3
A1
log2 n
A2
;
• How deep can we make the “valleys”?
V
LB for Learning Submodular Functions
• Let G =(U[V, E) be a (D,K,²)-lossless expander,
where Ai = ¡(ui) and
– |V|=n
– D = K = n1/3
− |U|=nlog n
− ² = log2(n)/n1/3
• Such graphs exist by the probabilistic method
• Lower Bound Proof:
– Delete each node in U with prob. ½, then use main
theorem to get a matroid
– If ui2U was not deleted then r(Ai) · bi = 4²D = O(log2 n)
– Claim: If ui deleted then Ai 2 I
(Needs a proof)
) r(Ai) = |Ai| = D = n1/3
– Since # Ai’s = |U| = nlog n, no algorithm can learn
a significant fraction of r(Ai) values in polynomial time
Summary
• PMAC model for learning real-valued functions
• Learning under arbitrary distributions:
– Factor O(n1/2) algorithm
– Factor (n1/3) hardness (info-theoretic)
• Learning under product distributions:
– Factor O(log(1/²)) algorithm
• New general family of matroids
– Generalizes partition matroids to non-disjoint parts
Open Questions
• Improve (n1/3) lower bound to (n1/2)
• Explicit construction of expanders
• Non-monotone submodular functions
– Any algorithm?
– Lower bound better than (n1/3)
• For algorithm under uniform distribution,
relax 1-Lipschitz condition
Download