Convex Risk Minimization for Binary Classification Sam Watson April 28, 2009

advertisement
Convex Risk Minimization for Binary Classification
Sam Watson
University of Mississippi
April 28, 2009
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
The Paper
The material for this presentation was drawn from the paper
Statistical Behavior and Consistency of Classification Methods
Based on Convex Risk Minimization, Annals of Statistics, 2004
Tong Zhang
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
The Binary Classification Problem
We want to identify objects as belonging to one of two classes
based on information about the object.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
The Binary Classification Problem
We want to identify objects as belonging to one of two classes
based on information about the object.
We encode the input information as a vector x ∈ Rp and the
classes are denoted y = −1 and y = 1.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
The Binary Classification Problem
We want to identify objects as belonging to one of two classes
based on information about the object.
We encode the input information as a vector x ∈ Rp and the
classes are denoted y = −1 and y = 1.
We record N observations {(xi , yi )}N
i=1 , for which the
categories yi ∈ {−1, 1} are known.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
The Binary Classification Problem
We want to identify objects as belonging to one of two classes
based on information about the object.
We encode the input information as a vector x ∈ Rp and the
classes are denoted y = −1 and y = 1.
We record N observations {(xi , yi )}N
i=1 , for which the
categories yi ∈ {−1, 1} are known.
We look to quantify the relationship between input and
output with a function f : Rp → R. We classify to 1 if
f (x) ≥ 0 and −1 if f (x) < 0.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
The Binary Classification Problem
Given set C of functions (for example, the set
{αT x + β : α ∈ Rp , β ∈ R} of linear functions), we seek to
identify a function f ∈ C for which the empirical risk
N
X
`(f (xi ), yi )
i=1

 1 wy < 0
1 w = 0 & y = −1
is minimized, where `(w , y ) :=

0 otherwise
is the penalty for a misclassification.
`(wy)
wy
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
Convex Minimization
Because this function is not convex, minimizing the risk is
typically NP-hard.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
Convex Minimization
Because this function is not convex, minimizing the risk is
typically NP-hard.
One heuristic solution is to replace the true risk function `
with a convex function φ ≥ `.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
Convex Minimization
Because this function is not convex, minimizing the risk is
typically NP-hard.
One heuristic solution is to replace the true risk function `
with a convex function φ ≥ `.
Examples:
φ(wy) = max(1 − wy)2
φ(wy) = (1 − wy)2
Least Squares
φ(wy) = max(1 − wy, 0)
SVM
wy
wy
Sam Watson
wy
Least Squares Modified
φ(wy) = ln(1 + e−wy )
Logistic Regression
wy
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
Convex Minimization
Efficient algorithms exist for minimizing
N
X
φ(f (xi )yi ) with φ
i=1
convex.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
Convex Minimization
Efficient algorithms exist for minimizing
N
X
φ(f (xi )yi ) with φ
i=1
convex.
In practice, these algorithms achieve greater success than
might be expected from a numerically-motivated
approximation to the true classification error.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
The Problem
Convex Minimization
Convex Minimization
Efficient algorithms exist for minimizing
N
X
φ(f (xi )yi ) with φ
i=1
convex.
In practice, these algorithms achieve greater success than
might be expected from a numerically-motivated
approximation to the true classification error.
We provide a theoretical explanation for this success, proving
that (under certain general circumstances), the classification
rule obtained by minimizing the empirical risk under a convex
loss function produces a nearly optimal error rate.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Analysis of estimation
We assume that the samples {(xi , yi )}N
i=1 have been
independently draw from an unknown underlying joint
distribution on the random variables X and Y .
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Analysis of estimation
We assume that the samples {(xi , yi )}N
i=1 have been
independently draw from an unknown underlying joint
distribution on the random variables X and Y .
The risk associated with a predictor f is
L(f ) := EX ,Y `(f (X ), Y ),
which we approximate by
PN
Sam Watson
i=1 `(f (xi ), yi ).
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Analysis of estimation
We assume that the samples {(xi , yi )}N
i=1 have been
independently draw from an unknown underlying joint
distribution on the random variables X and Y .
The risk associated with a predictor f is
L(f ) := EX ,Y `(f (X ), Y ),
which we approximate by
PN
i=1 `(f (xi ), yi ).
The ideal predictor is called the Bayes classifier, which
classifies to 1 if η(x) := P(Y = 1|X = x) ≥ 0.5 and −1
otherwise. The Bayes error is L(f ) for any Bayes estimator f .
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Analysis of estimation
We assume that the samples {(xi , yi )}N
i=1 have been
independently draw from an unknown underlying joint
distribution on the random variables X and Y .
The risk associated with a predictor f is
L(f ) := EX ,Y `(f (X ), Y ),
which we approximate by
PN
i=1 `(f (xi ), yi ).
The ideal predictor is called the Bayes classifier, which
classifies to 1 if η(x) := P(Y = 1|X = x) ≥ 0.5 and −1
otherwise. The Bayes error is L(f ) for any Bayes estimator f .
One example of a Bayes estimator 2η(x) − 1.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Some definitions
We begin by rewriting Q as
h
i
Q(f ) = EX η(X )φ(f (X )) + (1 − η(X ))φ(−f (X ))
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Some definitions
We begin by rewriting Q as
h
i
Q(f ) = EX η(X )φ(f (X )) + (1 − η(X ))φ(−f (X ))
This motivates the definition of a function Q : [0, 1] × R → R
defined by
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Some definitions
We begin by rewriting Q as
h
i
Q(f ) = EX η(X )φ(f (X )) + (1 − η(X ))φ(−f (X ))
This motivates the definition of a function Q : [0, 1] × R → R
defined by
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
We define the functions fφ∗ : [0, 1] → R and Q ∗ : [0, 1] → R by
fφ∗ (η) = argmin Q(η, w )
w ∈R
∗
Q (η) = Q(η, fφ∗ (η))
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Convex Risk Q
Recall
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Convex Risk Q
Recall
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable
functions f ), since it minimizes Q pointwise.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Convex Risk Q
Recall
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable
functions f ), since it minimizes Q pointwise.
Q(η, ·) inherits convexity from φ. For λ ∈ [0, 1],
Q(η, λw1 + (1 − λ)w2 )
= ηφ(λw1 + (1 − λ)w2 ) + (1 − η)φ(−λw1 − (1 − λ)w2 )
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Convex Risk Q
Recall
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable
functions f ), since it minimizes Q pointwise.
Q(η, ·) inherits convexity from φ. For λ ∈ [0, 1],
Q(η, λw1 + (1 − λ)w2 )
= ηφ(λw1 + (1 − λ)w2 ) + (1 − η)φ(−λw1 − (1 − λ)w2 )
≤ η [λφ(w1 ) + (1 − λ)φ(w2 )] + (1 − η) [φ(−w1 ) + (1 − λ)φ(−w2 )]
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Convex Risk Q
Recall
Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ).
Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable
functions f ), since it minimizes Q pointwise.
Q(η, ·) inherits convexity from φ. For λ ∈ [0, 1],
Q(η, λw1 + (1 − λ)w2 )
= ηφ(λw1 + (1 − λ)w2 ) + (1 − η)φ(−λw1 − (1 − λ)w2 )
≤ η [λφ(w1 ) + (1 − λ)φ(w2 )] + (1 − η) [φ(−w1 ) + (1 − λ)φ(−w2 )]
= λQ(η, w1 ) + (1 − λ)Q(w2 )
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Examples
For example, let φ(v ) = (1 − v )2 . We will compute fφ∗ and
Q ∗.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Examples
For example, let φ(v ) = (1 − v )2 . We will compute fφ∗ and
Q ∗.
We calculate
Q(η, w ) = ηφ(w ) + (1 − η)φ(−w )
= η(1 − w )2 + (1 − η)(1 + w )2
= 1 + (−4η + 2)w + w 2
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
Examples
For example, let φ(v ) = (1 − v )2 . We will compute fφ∗ and
Q ∗.
We calculate
Q(η, w ) = ηφ(w ) + (1 − η)φ(−w )
= η(1 − w )2 + (1 − η)(1 + w )2
= 1 + (−4η + 2)w + w 2
As a function of w , this is a parabola whose minimum is
achieved at fφ∗ (η) = − −4η+2
= 2η − 1. Plugging back in,
2
∗
Q (η) = 4η(1 − η).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
More Examples
fφ∗ (η)
fφ∗ (η)
fφ∗ (η)
1
1
1
1
η
1
−1
−1
Least Squares
SVM
Sam Watson
η
1
−1
η
Logistic
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
More Examples
These examples suggest why the corresponding classifiers
achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when
η > 21 .
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
More Examples
These examples suggest why the corresponding classifiers
achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when
η > 21 .
Some technical issues to work out
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
More Examples
These examples suggest why the corresponding classifiers
achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when
η > 21 .
Some technical issues to work out
fφ∗ (η(x)) does not necessarily belong to set C of candidate
functions.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
More Examples
These examples suggest why the corresponding classifiers
achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when
η > 21 .
Some technical issues to work out
fφ∗ (η(x)) does not necessarily belong to set C of candidate
functions.
fφ∗ (η(x)) is not necessarily the unique minimizer of Q(f ).
Therefore, if the convex minimization algorithm converges to a
different minimizer of Q, the resulting estimator may not
achieve the Bayes error rate.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
A theorem relating L(f ) and Q(f )
To deal with these issues we need a theorem relating the risk
L(f ) under the true penalty ` and the risk Q(f ) under the
convex penalty φ.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
A theorem relating L(f ) and Q(f )
To deal with these issues we need a theorem relating the risk
L(f ) under the true penalty ` and the risk Q(f ) under the
convex penalty φ.
Define ∆Q(f ) := Q(f ) − Q(fφ∗ (η(x))) to be the difference
between the convex risk of a function f and its minimum risk,
and define the pointwise analogue
∆Q(η, w ) := Q(η, w ) − Q ∗ (η).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
A theorem relating L(f ) and Q(f )
To deal with these issues we need a theorem relating the risk
L(f ) under the true penalty ` and the risk Q(f ) under the
convex penalty φ.
Define ∆Q(f ) := Q(f ) − Q(fφ∗ (η(x))) to be the difference
between the convex risk of a function f and its minimum risk,
and define the pointwise analogue
∆Q(η, w ) := Q(η, w ) − Q ∗ (η).
Theorem 1. If fφ∗ (η) > 0 for all η > 12 and if there exist
c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we have
| 12 − η|s ≤ c s ∆Q(η, 0), then
L(f ) − L∗ ≤ 2c∆Q(f )1/s ,
where L∗ is the Bayes error rate.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ),
we may always assume that fφ∗ (1 − η) = −fφ∗ (η).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ),
we may always assume that fφ∗ (1 − η) = −fφ∗ (η).
We will show that under the hypotheses of the theorem,
(2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ),
we may always assume that fφ∗ (1 − η) = −fφ∗ (η).
We will show that under the hypotheses of the theorem,
(2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ).
Three cases:
η > 12 . We get w ≤ 0 from (2η(x) − 1)w ≤ 0, and by
hypothesis fφ∗ (η) > 0. We choose λ ∈ [0, 1] so that
λp + (1 − λ)fφ∗ (η) = 0 and apply the convexity of Q(η, ·).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ),
we may always assume that fφ∗ (1 − η) = −fφ∗ (η).
We will show that under the hypotheses of the theorem,
(2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ).
Three cases:
η > 12 . We get w ≤ 0 from (2η(x) − 1)w ≤ 0, and by
hypothesis fφ∗ (η) > 0. We choose λ ∈ [0, 1] so that
λp + (1 − λ)fφ∗ (η) = 0 and apply the convexity of Q(η, ·).
η < 12 . The antisymmetry of fφ∗ shows that fφ∗ < 0, and the
rest of the argument proceeds as in the case η > 21 .
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ),
we may always assume that fφ∗ (1 − η) = −fφ∗ (η).
We will show that under the hypotheses of the theorem,
(2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ).
Three cases:
η > 12 . We get w ≤ 0 from (2η(x) − 1)w ≤ 0, and by
hypothesis fφ∗ (η) > 0. We choose λ ∈ [0, 1] so that
λp + (1 − λ)fφ∗ (η) = 0 and apply the convexity of Q(η, ·).
η < 12 . The antisymmetry of fφ∗ shows that fφ∗ < 0, and the
rest of the argument proceeds as in the case η > 21 .
η = 12 . Since fφ∗ (η) = 0, we have Q(η, 0) ≤ Q(η, w ) by the
definition of fφ∗ (η).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A. Use the formula
L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain:
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A. Use the formula
L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain:
L(f ) − L(2η(x) − 1)
= E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X ))
≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1|
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A. Use the formula
L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain:
L(f ) − L(2η(x) − 1)
= E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X ))
≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1|
≤ 2 E(2η(X )−1)f (X )≤0 η(X ) −
Sam Watson
1/s
1 s
,
2
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A. Use the formula
L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain:
L(f ) − L(2η(x) − 1)
= E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X ))
≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1|
≤ 2 E(2η(X )−1)f (X )≤0 η(X ) −
Sam Watson
1/s
Z
1/s
Z
1 s
s
|f | dµ
, from |f | dµ ≤
2
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A. Use the formula
L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain:
L(f ) − L(2η(x) − 1)
= E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X ))
≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1|
≤ 2 E(2η(X )−1)f (X )≤0 η(X ) −
1/s
Z
1/s
Z
1 s
s
|f | dµ
, from |f | dµ ≤
2
1/s
≤ 2c E(2η(X )−1)f (X )≤0 ∆Q(η(X ), 0)
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
The Proof
We use the notation EA , where A is a subset of the underlying
state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the
indicator function on the set A. Use the formula
L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain:
L(f ) − L(2η(x) − 1)
= E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X ))
≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1|
≤ 2 E(2η(X )−1)f (X )≤0 η(X ) −
1/s
Z
1/s
Z
1 s
s
|f | dµ
, from |f | dµ ≤
2
1/s
≤ 2c E(2η(X )−1)f (X )≤0 ∆Q(η(X ), 0)
Finally observe that
E(2η(X )−1)f (X )≤0 ∆Q(η(X ), 0) ≤ EX ∆Q(η(X ), f (X )) = ∆Q(f ). Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
An Example
Recall the theorem statement: If fφ∗ (η) > 0 for all η > 12 and
if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we
have | 21 − η|s ≤ c s ∆Q(η, 0), then
L(f ) − L∗ ≤ 2c∆Q(f )1/s ,
where L∗ is the Bayes error rate.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
An Example
Recall the theorem statement: If fφ∗ (η) > 0 for all η > 12 and
if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we
have | 21 − η|s ≤ c s ∆Q(η, 0), then
L(f ) − L∗ ≤ 2c∆Q(f )1/s ,
where L∗ is the Bayes error rate.
Also, recall that for least squares we have fφ∗ (η) = 2η − 1 and
Q ∗ (η) = 4η(η − 1). Therefore,
Q(η,w )
Q ∗ (η)
z
}|
{ z }| {
∆Q(η, w ) = 1 + (−4η + 2)w + w 2 − 4η(1 − η) = (2η−1−w )2
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Theoretical Framework
Examples
The Main Theorem
An Example
Recall the theorem statement: If fφ∗ (η) > 0 for all η > 12 and
if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we
have | 21 − η|s ≤ c s ∆Q(η, 0), then
L(f ) − L∗ ≤ 2c∆Q(f )1/s ,
where L∗ is the Bayes error rate.
Also, recall that for least squares we have fφ∗ (η) = 2η − 1 and
Q ∗ (η) = 4η(η − 1). Therefore,
Q(η,w )
Q ∗ (η)
z
}|
{ z }| {
∆Q(η, w ) = 1 + (−4η + 2)w + w 2 − 4η(1 − η) = (2η−1−w )2
2
This implies η − 12 = 14 ∆Q(η, 0), so in the theorem
statement wepmay take c = 1/2 and s = 2. We find
L(f ) − L∗ ≤ ∆Q(f ).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Implications of the theorem
Ensuring that ∆Q(f ) is small guarantees that the
classification error is close to the Bayes error rate.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Implications of the theorem
Ensuring that ∆Q(f ) is small guarantees that the
classification error is close to the Bayes error rate.
If (fn )∞
n=1 is a sequence of estimators for which ∆Q(fn ) → 0,
then the error rate of the predictor fn approaches the Bayes
rate as n → ∞.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Implications of the theorem
Ensuring that ∆Q(f ) is small guarantees that the
classification error is close to the Bayes error rate.
If (fn )∞
n=1 is a sequence of estimators for which ∆Q(fn ) → 0,
then the error rate of the predictor fn approaches the Bayes
rate as n → ∞.
Let
Pn fn be the predictor resulting from minimizing
i=1 φ(f (xi )yi ). To ensure that ∆Q(fn ) → 0, the function
class C needs to be:
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Implications of the theorem
Ensuring that ∆Q(f ) is small guarantees that the
classification error is close to the Bayes error rate.
If (fn )∞
n=1 is a sequence of estimators for which ∆Q(fn ) → 0,
then the error rate of the predictor fn approaches the Bayes
rate as n → ∞.
Let
Pn fn be the predictor resulting from minimizing
i=1 φ(f (xi )yi ). To ensure that ∆Q(fn ) → 0, the function
class C needs to be:
Expressive enough to admit a function close to fφ∗ , or some
other function for which ∆Q(f ) is small (Approximation error).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Implications of the theorem
Ensuring that ∆Q(f ) is small guarantees that the
classification error is close to the Bayes error rate.
If (fn )∞
n=1 is a sequence of estimators for which ∆Q(fn ) → 0,
then the error rate of the predictor fn approaches the Bayes
rate as n → ∞.
Let
Pn fn be the predictor resulting from minimizing
i=1 φ(f (xi )yi ). To ensure that ∆Q(fn ) → 0, the function
class C needs to be:
Expressive enough to admit a function close to fφ∗ , or some
other function for which ∆Q(f ) is small (Approximation error).
Not so expressive that the estimator fn experiences a large
variance due to the finite sample size used in the empirical risk
(Estimation error).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Approximation Error
To make sure the function class is sufficiently large, it is
enough that to ensure that the function class is dense in the
set of continuous functions.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Approximation Error
To make sure the function class is sufficiently large, it is
enough that to ensure that the function class is dense in the
set of continuous functions.
More specifically, let U ⊂ R p be Borel measurable, and let
C (U) be the Banach space of continuous functions on U
under the uniform-norm topology.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Approximation Error
To make sure the function class is sufficiently large, it is
enough that to ensure that the function class is dense in the
set of continuous functions.
More specifically, let U ⊂ R p be Borel measurable, and let
C (U) be the Banach space of continuous functions on U
under the uniform-norm topology.
Theorem 2 If C is dense in C (U), then for any regular
probability measure µ for which µ(U) = 1 and any measurable
conditional probability η(x) = P(Y = 1|X = x), we have
inf ∆Q(f ) = 0, where (X , Y ) is distributed according to
f ∈C
(µ, η).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Estimation error
We use reproducing kernel Hilbert spaces to define a chain
C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators
f1 , f2 , f3 .
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Estimation error
We use reproducing kernel Hilbert spaces to define a chain
C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators
f1 , f2 , f3 .
More specifically, we begin with a positive symmetric kernel
K (a, b) and define the RKHSs as closures of the linear
combinations of functions K (xi , ·) (where xi are the input
values).
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Estimation error
We use reproducing kernel Hilbert spaces to define a chain
C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators
f1 , f2 , f3 .
More specifically, we begin with a positive symmetric kernel
K (a, b) and define the RKHSs as closures of the linear
combinations of functions K (xi , ·) (where xi are the input
values).
We then calculate the estimators fn using regularization (using
the Hilbert space norm).
" n
#
1X
λn
2
fn = argmin
φ(f (xi ), yi ) + kf k ,
n
2
f ∈Cn
i=1
where λn is a regularization parameter.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Estimation error
We use reproducing kernel Hilbert spaces to define a chain
C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators
f1 , f2 , f3 .
More specifically, we begin with a positive symmetric kernel
K (a, b) and define the RKHSs as closures of the linear
combinations of functions K (xi , ·) (where xi are the input
values).
We then calculate the estimators fn using regularization (using
the Hilbert space norm).
" n
#
1X
λn
2
fn = argmin
φ(f (xi ), yi ) + kf k ,
n
2
f ∈Cn
i=1
where λn is a regularization parameter.
This infinite dimensional optimization problem reduces to a
finite dimensional one.
Sam Watson
Convex Risk Minimization for Binary Classification
Introduction
True Risk vs. Convex Risk
Producing a Consistent Estimator
Implications of theorem
Approximation Error
Estimation error
Questions?
Sam Watson
Convex Risk Minimization for Binary Classification
Download