pptx

advertisement
Online
Algorithms
Lecturer: Yishay Mansour
Elad Walach
Alex Roitenberg
Introduction
 Up
until now, our algorithms start with
input and work with it
 suppose input arrives a little at a time,
need instant response
Oranges example




Suppose we are to build a robot that removes
bad oranges from a kibutz packaging line
After classification the kibutz worker looks at
the orange and tells our robot if his
classification was correct
And repeat indefinitely
Our model:



Input: unlabeled orange 𝑥
Output: classification (good or bad) 𝑏
The algorithm then gets the correct
classification 𝐶𝑡 (𝑥)
Introduction





At every step t, the algorithm predicts the
classification based on some hypothesis 𝐻𝑡
The algorithm receives the correct
classification𝐶𝑡 (𝑥)
A mistake is an incorrect prediction: H𝑡 𝑥 ≠
𝐶𝑡 (𝑥)
The goal is to build an algorithm with a bound
number of mistakes
Number of mistakes Independent of the input
size
Linear Separators
Linear saperator
 The
goal: find 𝑤0 and 𝑤 defining a hyper
plane 𝑤 ∗ 𝑥 = 𝑤0
 All positive examples will be on the one side
of the hyper plane and all the negative on
the other
 I.E. 𝑤 ∗ 𝑥 > 𝑤0 for positive 𝑥 only
 We will now look at several algorithms to find
the separator
Perceptron
 The
Idea: correct? Do nothing
 Wrong? Move separator towards mistake
 We’ll
2
scale all x’s so that 𝑥 = 1, since
this doesn’t affect which side of the plane
they are on
The perceptron algorithm
1.
2.
3.
initialize 𝑤1 = 0, 𝑡 = 0
Given 𝑥𝑡 , predict positive IFF 𝑤1 ∗ 𝑥𝑡 >0
On a mistake:
1.
2.
Mistake on positive 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥𝑡
Mistake on negative 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥𝑡
The perceptron algorithm
 Suppose
a positive sample 𝑥
 If we misclassified 𝑥, then after the update
we’ll get 𝑤𝑡+1 ∗ 𝑥 = (𝑤𝑡 + 𝑥) ∗ 𝑥 = 𝑤𝑡 ∗ 𝑥 + 1
 𝑋 was positive, but since we made a
mistake 𝑤𝑡 ∗ 𝑥 was negative, so a
correction was made in the right direction
Mistake Bound Theorem



Let 𝑆 =< 𝑥𝑖 > consistent with
𝑤∗ ∗ 𝑥 > 0 ⟺ 𝑙 𝑥 = 1
M= 𝑖: 𝑙 𝑥𝑖 ≠𝑏𝑖 is the number of mistakes
Then 𝑀 ≤
𝑤∗
1
𝛾2
where 𝛾 =
𝑤 ∗ ∗𝑥
min
𝑥
𝑥𝑖 ∈𝑆
the margin of

the minimal distance of the samples in S from 𝑤 ∗ (after
normalizing both 𝑤 ∗ and the samples)
Mistake Bound Proof
 WLOG,
the algorithm makes a mistake on
every step (otherwise nothing happens)
 Claim 1: 𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾
 Proof:


𝑥 > 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 + 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ + 𝑥 ∗
𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾
𝑥 < 0: 𝑤𝑡+1 ∗ 𝑤 ∗ = 𝑤𝑡 − 𝑥 ∗ 𝑤 ∗ = 𝑤𝑡 ∗ 𝑤 ∗ − 𝑥 ∗
𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾 𝑏𝑦 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝛾
Proof Cont.
 Claim

𝑥 > 0: 𝑤𝑡+1
2𝑤 𝑡
2
∗ 𝑥 = 𝑤𝑡
 𝑤𝑡

2: 𝑤𝑡+1
< ||𝑤𝑡 ||2 + 1
= ||𝑤𝑡 + 𝑥
2
+
2𝑤 𝑡
|2
= ||𝑤𝑡
|2
∗ 𝑥 + 1 ≤ 𝑤𝑡
+ 𝑥
2
2
+
+1
∗ 𝑥 < 0 since the algorithm made a mistake
𝑥 < 0: 𝑤𝑡+1
2
2𝑤 𝑡 ∗ 𝑥 = 𝑤𝑡
 𝑤𝑡
2
= ||𝑤𝑡 − 𝑥 |2 = ||𝑤𝑡 |2 + 𝑥
2
− 2𝑤 𝑡 ∗ 𝑥 + 1 ≤ 𝑤𝑡
2
2
−
+1
∗ 𝑥 > 0 since the algorithm made a mistake
Proof Cont.
 From
Claim 1:
 From Claim 2:
 Also: w t  w *  w t

Since
w M 1  w  M 
*
w M 1 
M
w 1
*
 Combining: M   w M 1  w *  w M 1 

M  1

2
M
The world is not perfect
 What
if there is no perfect separator?
The world is not perfect






Claim 1(reminder):𝑤𝑡+1 ∗ 𝑤 ∗ ≥ 𝑤𝑡 ∗ 𝑤 ∗ + 𝛾
previously we made γ progress on each
mistake
now we might be making negative progress
𝑇𝐷𝛾 = total distance we would have to move the
points to make them separable by a mragin γ
*
So: w M 1  w  M   TD 
With claim 2: M   TD   w  w  w  M
*
M 1
M 1

M  1

2

2

TD 
The world is not perfect
 The
 Alt.
1
𝛾
total hinge loss of 𝑤 ∗ : 𝑇𝐷𝛾
definition:
 Hinge
max( 0 ,1  y ),  y 
loss illustration:
l( x)  x  w

*
Perceptron for maximizing
margins

the idea: update 𝑤𝑡
whenever the correct
classification margin is less
𝛾
than
2

No. of steps polynomial in

Generalization:
1
𝛾
𝛾
2

Update margin: → (1 − 𝜀)𝛾

No. of steps polynomial in
1
𝜀𝛾
Perceptron Algorithm
(maximizing margin)

Assuming∀𝑥𝑖 ∈ 𝑆, 𝑥𝑖 = 1



Init: 𝑤1 ← 𝑙 𝑥1 𝑥
Predict:

𝑤𝑡 ∗𝑥
𝑤𝑡
≥ 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝛾

𝑤𝑡 ∗𝑥
𝑤𝑡
≤ − 2 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒

𝑤𝑡 ∗𝑥
𝑤𝑡
∈ − 2 , 2 → 𝑚𝑎𝑟𝑔𝑖𝑛 𝑚𝑖𝑠𝑡𝑎𝑘𝑒
𝛾
𝛾 𝛾
On mistake (prediction or margin), update:

𝑤𝑡+1 ← 𝑤𝑡 + 𝑙 𝑥 𝑥
Mistake Bound Theorem

Let 𝑠 =< 𝑥𝑖 > consistent with :
𝑤∗ ∗ 𝑥 > 0


𝑙 𝑥 =1
M=No. of mistakes + No. of margin mistakes
Then 𝑀 ≤
𝑤∗
12
𝛾2
where 𝛾 =
𝑤 ∗ ∗𝑥
min
𝑥
𝑥𝑖 ∈𝑆
the margin of
similar to the perceptron proof.
*
*
Claim 1 remains the same: w t 1  w  w t  w  
We only have to bound w t  1
Mistake bound proof

WLoG, the algorithm makes a mistake on every
step

Claim2: 𝑤𝑡+1 ≤ ||𝑤𝑡 +

Proof: |𝑤𝑡+1 | = 𝑤𝑡
𝑤𝑡
1+
𝛾
𝑤𝑡
+
1+
1
𝑤𝑡
2

And since 1 + 𝛼 ≤ 1 +

≤ 𝑤𝑡
1+
𝛾
2 𝑤𝑡
1
2 𝑤𝑡
+
𝛼
2
1
2 𝑤𝑡
2
+
𝛾
2
2𝑙 𝑥 𝑤𝑡 ∗𝑥
𝑤𝑡
2
+
1
𝑤𝑡
2
≤
Proof Cont.
 Since

𝑙 𝑥 𝑤𝑡
𝑤𝑡
 And

the algorithm made a mistake on t
≤
𝛾
2
thus:
𝑤𝑡+1 ≤ 𝑤𝑡
1
+
2 𝑤𝑡
𝛾
+
2
Proof Cont.
 So:
 If
𝑤𝑡+1 ≤ 𝑤𝑡 +
1
2 𝑤𝑡
+
𝛾
2
𝑤𝑡 ≥ 2𝛾, 𝑤𝑡+1 < 𝑤𝑡 +
2
𝛾
𝑤𝑀+1 ≤ 1 + +
 From
𝑤𝑀+1
3𝛾
𝑀
4
3𝛾
4
→
Claim 1 as before: M𝛾 ≤ 𝑤𝑀+1 ∗ 𝑤 ∗ ≤
 Solving
we get: 𝑀 ≤
12
𝛾2
The mistake bound model
Con Algorithm
∈ 𝐶 set of concepts consistent on
𝑥1 , 𝑥2 . . 𝑥𝑖−1
 At step t
 𝐶𝑡


Randomly choose concept c
Predict 𝑏𝑡 = 𝑐 𝑥𝑖
CON Algorithm
 Theorem:
For any concept class C, Con
makes the most |𝐶| − 1 mistakes
 Proof: at first 𝐶1 = 𝐶 .
 After each mistake|𝐶𝑡 | decreases by at
least 1
 |𝐶𝑡 | >= 1,since 𝑐𝑡 ∈ 𝐶 at any t
 Therefore number of mistakes is bound by
|𝐶| − 1
The bounds of CON
 This
bound is too high!
 There
 We
𝑛
are 22 different functions on 0,1
can do better!
𝑛
HAL – halving algorithm
∈ 𝐶 set of concepts consistent on
𝑥1 , 𝑥2 . . 𝑥𝑖−1
 At step t
 𝐶𝑡


Conduct a vote amongst all c
Predict 𝑏𝑡 with accordance to the majority
HAL –halving algorithm
 Theorem:
For any concept class C, Con
makes the most log 2 |𝐶| − 1 mistakes
 Proof: 𝐶1 = 𝐶. After each mistake 𝐶𝑡+1 ≤
1
𝐶 sine majority of concepts were
2 𝑡
wrong.
 Therefore number of mistakes is bound by
log 2 |𝐶|
Mistake Bound model and
PAC
 Generates
strong online algorithms
 In the past we have seen PAC
 Restrictions for mistake bound are much
harsher than PAC
 If
we know that A learns C in mistake
bound model , should A learn C in PAC
model?
Mistake Bound model and
PAC





A – mistake bound algorithm
Our goal: to construct Apac a pac algorithm
Assume that after A gets xi he construct
hypothesis hi
Definition : A mistake bound algorithm A is
conservative iff for every sample xi if 𝑐𝑡 𝑥𝑖 =
ℎ𝑖−1 (𝑥𝑖 ) then in the ith step the algorithm will
make a choice ℎ𝑖 = ℎ𝑖−1
Mistake made
change hypothesis
Conservative equivalent of
Mistake Bound Algorithm



Let A be an algorithm whose mistake is bound by
M
Ak is A’s hypothesis after it had seen {𝑥1 , 𝑥2 , . . 𝑥𝑛 }
Define A’


Initially ℎ0 = 𝐴0 .
At 𝑥𝑖 update:





Guess ℎ𝑖−1 𝑥𝑖
If 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 , ℎ𝑖 = ℎ𝑖−1
Else ℎ𝑖 = 𝐴
If we run A onS = {𝑥𝑡 : 𝑐𝑡 𝑥𝑖 = ℎ𝑖−1 𝑥𝑖 } it would
make |𝑆| mistakes ⇒
𝐴′ makes as many mistakes as A
Building Apac
𝑖
𝑀
𝛿

𝑘𝑖 = 𝜀 ln

Apac algorithm:

Run A’ over a sample of size 𝜀 ln( 𝛿 ) divided to M equal
blocks
Build hypothesis ℎ𝑘 for each block
𝑖
Run the hypothesis on the next block
If there are no mistakes output ℎ𝑘



,0 ≤ 𝑖 ≤ 𝑀 − 1
𝑀
hk 0
h k1
inconsistent
M 
ln 


  
𝑀
consistent
inconsistent
…
consistent
1
hk 0
h k1
Building Apac





If A’ makes at most M mistakes then Apac
guarantees to finish
𝑀 → APAC outputs a perfect classifier
What happens otherwise?
Theorem: Apac learns PAC
Proof: Pr ℎ𝑘𝑖 𝑠𝑢𝑐𝑐𝑒𝑒𝑑𝑠 𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 𝑤ℎ𝑖𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝜀 −

M -1
Pr( A PAC outputs  - bad h )  Pr(  0  i  M  1 s.t. h k i is  - bad ) 

i0
Pr( h k i is  - bad) 
M -1

i0
M


Disjunction of Conjuctions
Disjunction of Conjunctions
 We
have proven that every algorithm in
mistake bound model can be converted
to PAC
 Lets
look at some algorithms in the
mistake bound model
Disjunction Learning

Our goal: classify the set of disjunctions
e.g. 𝑥1 ⋁𝑥2 ⋁ 𝑥6 ⋁𝑥8

Let L be the hypothesis set {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 …𝑥𝑛 , 𝑥𝑛 }
h = ⋁𝑥: 𝑥 ∈ 𝐿
Given a sample y do:
If our hypothesis does a mistake (ht (𝑦) ≠ ct (𝑦)) Than:
𝐿 ← 𝐿\S 𝑤ℎ𝑒𝑟𝑒 𝑆
1.
2.
3.
= {𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑎𝑙𝑙 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑎𝑛𝑑 𝑥𝑖 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑦𝑖 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒}
1.
2.
Else do nothing
Return to step 1 ( update our hypothesis)
Example

If we have only 2 variables


L is {𝑥1 , 𝑥1 , 𝑥2 , 𝑥2 }
ℎ𝑡 = 𝑥1 ∨ 𝑥1 ∨ 𝑥2 ∨ 𝑥2

Assume the first sample is y=(1,0)

ℎ𝑡 𝑦 = 1
If 𝑐𝑡 𝑦 = 0



we update 𝐿 = 𝑥2 , 𝑥1
ℎ𝑡+1 = 𝑥1 ∨ 𝑥2
Mistake Bound Analysis
 The
number of mistakes is bound by n+1
 n is the number of variables
 Proof:
 Let
R be the set of literals in 𝑐𝑡
 Let 𝐿𝑡 be the hypothesis after t samples 𝑦𝑡
Mistake Bound Analysis



We prove by induction that 𝑅 ⊆ 𝐿𝑡
For t=0 it is obvious that 𝑅 ⊆ 𝐿0
Assume after t-1 samples 𝑅 ⊆ 𝐿𝑡−1

If 𝑐𝑡 𝑦𝑡 = 1 than 𝑐𝑡 𝑦𝑡 = ℎ𝑡 𝑦𝑡 and we dont update
If𝑐𝑡 𝑦𝑡 = 1 than ofcourse S and R don’t intersect.
Either way 𝑅 ⊆ 𝐿𝑡

Thus we can only make mistakes when 𝑐𝑡 𝑦 = 0


Mistake analysis proof
 At
first mistake we eliminate n literals
 At any further mistake we eliminate at
least 1 literal
 L0 has 2n literals
 So
we can have at most n+1 mistakes
k-DNF

Definition: k-DNF functions are functions that
can be represented by a disjunction of
conjunctions in which there are at most k
literals
 E.g.

3-DNF
(𝑥1 ∧ 𝑥2 ∧ 𝑥6 ) ∨ (𝑥1 ∧ 𝑥3 ∧ 𝑥5 )
The number of conjunctions of i terms is

We choose i variables (
we choose a sign (2𝑖 )
𝑛 𝑖
2
𝑖
𝑛
) for each of which
𝑖
k-DNF classification

We can learn this class by changing the
previous algorithm to deal with terms instead
of variables

Reducing the space 𝑋 = 0,1
𝑋 gives a disjunction on 𝑌


𝑛
to 𝑌 = 0,1
2 usable algorithms
ELIM for PAC
 The previous algorithm (In mistake bound
model) which has 𝑂 𝑛𝑘 mistakes

𝑛𝑘
Winnow
 Monotone
Disjunction: Disjunctions
containing only positive literals.
e.g. 𝑥1 ∨ 𝑥3 ∨ 𝑥5
 Purpose: to learn the class of monotone
disjunctions in a mistake-bound model
 We look at winnow which is similar to
perceptron
 One main difference: it uses multiplicative
steps rather than additive
Winnow



Same classification scheme as perceptron

ℎ 𝑥 = 𝑥 ∗ 𝑤 ≥ 𝜃 ⇒ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛

ℎ 𝑥 = 𝑥 ∗ 𝑤 < 𝜃 ⇒ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛
Initialize 𝑤0 = 1,1, … , 1
Update scheme:


On positive misclassification
(ℎ(𝑥) =1, 𝑐𝑡(𝑥)=0)
𝑤
∀𝑥𝑖 = 1: 𝑤𝑖 ←
𝑖
2
On negative misclassification : ∀𝑥𝑖 = 1: 𝑤𝑖 ← 2𝑤𝑖
Mistake bound analysis
 Similar
to perceptron if the margin is
bigger than 𝛾 then we can prove the error
1
rate is Θ( 2)
𝛾
Winnow Proof:Definitions
 Let
𝑆 = {𝑥𝑖1 , 𝑥𝑖2 , . . , 𝑥𝑖𝑟 } be the set of relevant
variables in the target concept
 I.e. 𝐶𝑡 = 𝑥𝑖1 ∨ 𝑥𝑖2 ∨. . 𝑥𝑖𝑟
 We
define 𝑊𝑟 = 𝑤𝑖1 , 𝑤𝑖2 , . . , 𝑤𝑖𝑟 the weights
of the relevant variables
 Let 𝑤 𝑡 be the weight w at time t
 Let TW(t) be the total weight of all w(t) of
both relevant and irrelevant variables
Winnow Proof: Positive
Mistakes






Lets look at the positive mistakes
Any mistake on a positive example doubles (at
least) 1 of the relevant weights
∃𝑤 ∈ 𝑊𝑟 𝑠. 𝑡. 𝑤 𝑡 + 1 = 2𝑤(𝑡)
If ∃𝑤𝑖 𝑠. 𝑡. 𝑤𝑖 ≥ 𝑛 we get 𝑥 ∗ 𝑤 ≥ 𝑛 therefore always a
positive classification
So, ∀𝑤𝑖 : 𝑤𝑖 can only be doubled at most 1 + log 𝑛
times
Thus, we can bind the number of positive mistakes:
𝑀+ ≤ 𝑟(1 + log 𝑛 )
Winnow Proof: Positive
Mistakes
 For
a positive mistake

ℎ 𝑥 = 𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 < 𝑛

𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 +(𝑤1 𝑡 𝑥1 . … + 𝑤𝑛 𝑡 𝑥𝑛 )
 (1)𝑇𝑊
𝑡 + 1 < 𝑇𝑊 𝑡 + 𝑛
Winnow Proof: Negative
Mistakes



On negative examples none of the relevant
weight change
Thus ∀𝑤 ∈ 𝑊𝑟 , 𝑤 𝑡 + 1 ≥ w(t)
For a negative mistake to occur:

𝑤1 𝑡 𝑥1 +. . +𝑤𝑛 𝑡 𝑥𝑛 ≥ 𝑛

𝑇𝑊 𝑡 + 1 = 𝑇𝑊 𝑡 −

⇒ (2)𝑇𝑊 𝑡 + 1 ≤ 𝑇𝑊 𝑡 −
𝑤1 𝑡 𝑥1 +..+𝑤𝑛 𝑡 𝑥𝑛
2
𝑛
2
Winnow Proof:Cont.
 Combining

the equations (1),(2):
(3)0 < 𝑇𝑊 𝑡 ≤ 𝑇𝑊 0 + 𝑛𝑀+ −
𝑛
2
 At
𝑀−
the beginning all weight are 1 so
𝑇𝑊 0 = 𝑛
 (3)(4)
⇒ 𝑀− < 2 + 2𝑀+ ≤ 2 +
2𝑟 𝑙𝑜𝑔𝑛 + 1
⇒
𝑀− + 𝑀+ ≤ 2 + 3𝑟(𝑙𝑜𝑔𝑛 + 1)
What should we know? I
 Linear
Separators

Perceptron algorithm : 𝑀 ≤

Margin Perceptron : 𝑀 ≤
 The


1
𝛾2
12
𝛾2
mistake bound model
CON algorithm : 𝑀 < 𝐶 − 1 but C may be
very large!
HAL the halving algorithm: 𝑀 < log 2 |𝐶|
What should you know? II
 The
relation between PAC and the
mistake bound model
 Basic algorithm for learning disjunction of
conjunctions
 Learning K-DNF functions
 Winnow algorithm :𝑀 ≤ 2 + 3𝑟(log 𝑛 + 1)
Questions?
Download