Convex Risk Minimization for Binary Classification Sam Watson University of Mississippi April 28, 2009 Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization The Paper The material for this presentation was drawn from the paper Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization, Annals of Statistics, 2004 Tong Zhang Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization The Binary Classification Problem We want to identify objects as belonging to one of two classes based on information about the object. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization The Binary Classification Problem We want to identify objects as belonging to one of two classes based on information about the object. We encode the input information as a vector x ∈ Rp and the classes are denoted y = −1 and y = 1. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization The Binary Classification Problem We want to identify objects as belonging to one of two classes based on information about the object. We encode the input information as a vector x ∈ Rp and the classes are denoted y = −1 and y = 1. We record N observations {(xi , yi )}N i=1 , for which the categories yi ∈ {−1, 1} are known. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization The Binary Classification Problem We want to identify objects as belonging to one of two classes based on information about the object. We encode the input information as a vector x ∈ Rp and the classes are denoted y = −1 and y = 1. We record N observations {(xi , yi )}N i=1 , for which the categories yi ∈ {−1, 1} are known. We look to quantify the relationship between input and output with a function f : Rp → R. We classify to 1 if f (x) ≥ 0 and −1 if f (x) < 0. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization The Binary Classification Problem Given set C of functions (for example, the set {αT x + β : α ∈ Rp , β ∈ R} of linear functions), we seek to identify a function f ∈ C for which the empirical risk N X `(f (xi ), yi ) i=1 1 wy < 0 1 w = 0 & y = −1 is minimized, where `(w , y ) := 0 otherwise is the penalty for a misclassification. `(wy) wy Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization Convex Minimization Because this function is not convex, minimizing the risk is typically NP-hard. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization Convex Minimization Because this function is not convex, minimizing the risk is typically NP-hard. One heuristic solution is to replace the true risk function ` with a convex function φ ≥ `. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization Convex Minimization Because this function is not convex, minimizing the risk is typically NP-hard. One heuristic solution is to replace the true risk function ` with a convex function φ ≥ `. Examples: φ(wy) = max(1 − wy)2 φ(wy) = (1 − wy)2 Least Squares φ(wy) = max(1 − wy, 0) SVM wy wy Sam Watson wy Least Squares Modified φ(wy) = ln(1 + e−wy ) Logistic Regression wy Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization Convex Minimization Efficient algorithms exist for minimizing N X φ(f (xi )yi ) with φ i=1 convex. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization Convex Minimization Efficient algorithms exist for minimizing N X φ(f (xi )yi ) with φ i=1 convex. In practice, these algorithms achieve greater success than might be expected from a numerically-motivated approximation to the true classification error. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator The Problem Convex Minimization Convex Minimization Efficient algorithms exist for minimizing N X φ(f (xi )yi ) with φ i=1 convex. In practice, these algorithms achieve greater success than might be expected from a numerically-motivated approximation to the true classification error. We provide a theoretical explanation for this success, proving that (under certain general circumstances), the classification rule obtained by minimizing the empirical risk under a convex loss function produces a nearly optimal error rate. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Analysis of estimation We assume that the samples {(xi , yi )}N i=1 have been independently draw from an unknown underlying joint distribution on the random variables X and Y . Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Analysis of estimation We assume that the samples {(xi , yi )}N i=1 have been independently draw from an unknown underlying joint distribution on the random variables X and Y . The risk associated with a predictor f is L(f ) := EX ,Y `(f (X ), Y ), which we approximate by PN Sam Watson i=1 `(f (xi ), yi ). Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Analysis of estimation We assume that the samples {(xi , yi )}N i=1 have been independently draw from an unknown underlying joint distribution on the random variables X and Y . The risk associated with a predictor f is L(f ) := EX ,Y `(f (X ), Y ), which we approximate by PN i=1 `(f (xi ), yi ). The ideal predictor is called the Bayes classifier, which classifies to 1 if η(x) := P(Y = 1|X = x) ≥ 0.5 and −1 otherwise. The Bayes error is L(f ) for any Bayes estimator f . Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Analysis of estimation We assume that the samples {(xi , yi )}N i=1 have been independently draw from an unknown underlying joint distribution on the random variables X and Y . The risk associated with a predictor f is L(f ) := EX ,Y `(f (X ), Y ), which we approximate by PN i=1 `(f (xi ), yi ). The ideal predictor is called the Bayes classifier, which classifies to 1 if η(x) := P(Y = 1|X = x) ≥ 0.5 and −1 otherwise. The Bayes error is L(f ) for any Bayes estimator f . One example of a Bayes estimator 2η(x) − 1. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Some definitions We begin by rewriting Q as h i Q(f ) = EX η(X )φ(f (X )) + (1 − η(X ))φ(−f (X )) Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Some definitions We begin by rewriting Q as h i Q(f ) = EX η(X )φ(f (X )) + (1 − η(X ))φ(−f (X )) This motivates the definition of a function Q : [0, 1] × R → R defined by Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Some definitions We begin by rewriting Q as h i Q(f ) = EX η(X )φ(f (X )) + (1 − η(X ))φ(−f (X )) This motivates the definition of a function Q : [0, 1] × R → R defined by Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). We define the functions fφ∗ : [0, 1] → R and Q ∗ : [0, 1] → R by fφ∗ (η) = argmin Q(η, w ) w ∈R ∗ Q (η) = Q(η, fφ∗ (η)) Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Convex Risk Q Recall Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Convex Risk Q Recall Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable functions f ), since it minimizes Q pointwise. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Convex Risk Q Recall Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable functions f ), since it minimizes Q pointwise. Q(η, ·) inherits convexity from φ. For λ ∈ [0, 1], Q(η, λw1 + (1 − λ)w2 ) = ηφ(λw1 + (1 − λ)w2 ) + (1 − η)φ(−λw1 − (1 − λ)w2 ) Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Convex Risk Q Recall Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable functions f ), since it minimizes Q pointwise. Q(η, ·) inherits convexity from φ. For λ ∈ [0, 1], Q(η, λw1 + (1 − λ)w2 ) = ηφ(λw1 + (1 − λ)w2 ) + (1 − η)φ(−λw1 − (1 − λ)w2 ) ≤ η [λφ(w1 ) + (1 − λ)φ(w2 )] + (1 − η) [φ(−w1 ) + (1 − λ)φ(−w2 )] Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Convex Risk Q Recall Q(η, w ) := ηφ(w ) + (1 − η)φ(−w ). Observe that fφ∗ (η(x)) minimizes Q(f ) (among all measurable functions f ), since it minimizes Q pointwise. Q(η, ·) inherits convexity from φ. For λ ∈ [0, 1], Q(η, λw1 + (1 − λ)w2 ) = ηφ(λw1 + (1 − λ)w2 ) + (1 − η)φ(−λw1 − (1 − λ)w2 ) ≤ η [λφ(w1 ) + (1 − λ)φ(w2 )] + (1 − η) [φ(−w1 ) + (1 − λ)φ(−w2 )] = λQ(η, w1 ) + (1 − λ)Q(w2 ) Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Examples For example, let φ(v ) = (1 − v )2 . We will compute fφ∗ and Q ∗. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Examples For example, let φ(v ) = (1 − v )2 . We will compute fφ∗ and Q ∗. We calculate Q(η, w ) = ηφ(w ) + (1 − η)φ(−w ) = η(1 − w )2 + (1 − η)(1 + w )2 = 1 + (−4η + 2)w + w 2 Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem Examples For example, let φ(v ) = (1 − v )2 . We will compute fφ∗ and Q ∗. We calculate Q(η, w ) = ηφ(w ) + (1 − η)φ(−w ) = η(1 − w )2 + (1 − η)(1 + w )2 = 1 + (−4η + 2)w + w 2 As a function of w , this is a parabola whose minimum is achieved at fφ∗ (η) = − −4η+2 = 2η − 1. Plugging back in, 2 ∗ Q (η) = 4η(1 − η). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem More Examples fφ∗ (η) fφ∗ (η) fφ∗ (η) 1 1 1 1 η 1 −1 −1 Least Squares SVM Sam Watson η 1 −1 η Logistic Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem More Examples These examples suggest why the corresponding classifiers achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when η > 21 . Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem More Examples These examples suggest why the corresponding classifiers achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when η > 21 . Some technical issues to work out Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem More Examples These examples suggest why the corresponding classifiers achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when η > 21 . Some technical issues to work out fφ∗ (η(x)) does not necessarily belong to set C of candidate functions. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem More Examples These examples suggest why the corresponding classifiers achieve the Bayes error rate: in each case, fφ∗ (η) > 0 when η > 21 . Some technical issues to work out fφ∗ (η(x)) does not necessarily belong to set C of candidate functions. fφ∗ (η(x)) is not necessarily the unique minimizer of Q(f ). Therefore, if the convex minimization algorithm converges to a different minimizer of Q, the resulting estimator may not achieve the Bayes error rate. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem A theorem relating L(f ) and Q(f ) To deal with these issues we need a theorem relating the risk L(f ) under the true penalty ` and the risk Q(f ) under the convex penalty φ. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem A theorem relating L(f ) and Q(f ) To deal with these issues we need a theorem relating the risk L(f ) under the true penalty ` and the risk Q(f ) under the convex penalty φ. Define ∆Q(f ) := Q(f ) − Q(fφ∗ (η(x))) to be the difference between the convex risk of a function f and its minimum risk, and define the pointwise analogue ∆Q(η, w ) := Q(η, w ) − Q ∗ (η). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem A theorem relating L(f ) and Q(f ) To deal with these issues we need a theorem relating the risk L(f ) under the true penalty ` and the risk Q(f ) under the convex penalty φ. Define ∆Q(f ) := Q(f ) − Q(fφ∗ (η(x))) to be the difference between the convex risk of a function f and its minimum risk, and define the pointwise analogue ∆Q(η, w ) := Q(η, w ) − Q ∗ (η). Theorem 1. If fφ∗ (η) > 0 for all η > 12 and if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we have | 12 − η|s ≤ c s ∆Q(η, 0), then L(f ) − L∗ ≤ 2c∆Q(f )1/s , where L∗ is the Bayes error rate. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ), we may always assume that fφ∗ (1 − η) = −fφ∗ (η). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ), we may always assume that fφ∗ (1 − η) = −fφ∗ (η). We will show that under the hypotheses of the theorem, (2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ), we may always assume that fφ∗ (1 − η) = −fφ∗ (η). We will show that under the hypotheses of the theorem, (2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ). Three cases: η > 12 . We get w ≤ 0 from (2η(x) − 1)w ≤ 0, and by hypothesis fφ∗ (η) > 0. We choose λ ∈ [0, 1] so that λp + (1 − λ)fφ∗ (η) = 0 and apply the convexity of Q(η, ·). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ), we may always assume that fφ∗ (1 − η) = −fφ∗ (η). We will show that under the hypotheses of the theorem, (2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ). Three cases: η > 12 . We get w ≤ 0 from (2η(x) − 1)w ≤ 0, and by hypothesis fφ∗ (η) > 0. We choose λ ∈ [0, 1] so that λp + (1 − λ)fφ∗ (η) = 0 and apply the convexity of Q(η, ·). η < 12 . The antisymmetry of fφ∗ shows that fφ∗ < 0, and the rest of the argument proceeds as in the case η > 21 . Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof Note that due to the symmetry of Q under (η, w ) 7→ (1 − η, −w ), we may always assume that fφ∗ (1 − η) = −fφ∗ (η). We will show that under the hypotheses of the theorem, (2η(x) − 1)w ≤ 0 =⇒ Q(η(x), 0) ≤ Q(η(x), w ). Three cases: η > 12 . We get w ≤ 0 from (2η(x) − 1)w ≤ 0, and by hypothesis fφ∗ (η) > 0. We choose λ ∈ [0, 1] so that λp + (1 − λ)fφ∗ (η) = 0 and apply the convexity of Q(η, ·). η < 12 . The antisymmetry of fφ∗ shows that fφ∗ < 0, and the rest of the argument proceeds as in the case η > 21 . η = 12 . Since fφ∗ (η) = 0, we have Q(η, 0) ≤ Q(η, w ) by the definition of fφ∗ (η). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Use the formula L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain: Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Use the formula L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain: L(f ) − L(2η(x) − 1) = E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X )) ≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1| Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Use the formula L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain: L(f ) − L(2η(x) − 1) = E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X )) ≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1| ≤ 2 E(2η(X )−1)f (X )≤0 η(X ) − Sam Watson 1/s 1 s , 2 Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Use the formula L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain: L(f ) − L(2η(x) − 1) = E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X )) ≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1| ≤ 2 E(2η(X )−1)f (X )≤0 η(X ) − Sam Watson 1/s Z 1/s Z 1 s s |f | dµ , from |f | dµ ≤ 2 Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Use the formula L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain: L(f ) − L(2η(x) − 1) = E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X )) ≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1| ≤ 2 E(2η(X )−1)f (X )≤0 η(X ) − 1/s Z 1/s Z 1 s s |f | dµ , from |f | dµ ≤ 2 1/s ≤ 2c E(2η(X )−1)f (X )≤0 ∆Q(η(X ), 0) Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem The Proof We use the notation EA , where A is a subset of the underlying state space Ω, for the expectation E[f · 1A ]. Here 1A denotes the indicator function on the set A. Use the formula L(f ) = Ef (X )≥0 [1 − η(X )] + Ef (X )<0 [η(X )] to obtain: L(f ) − L(2η(x) − 1) = E{η(X )≥0.5 & f (X )<0} (2η(X ) − 1) + E{η(X )<0.5 & f (X )≥0} (1 − 2η(X )) ≤ E(2η(X )−1)f (X )≤0 |2η(X ) − 1| ≤ 2 E(2η(X )−1)f (X )≤0 η(X ) − 1/s Z 1/s Z 1 s s |f | dµ , from |f | dµ ≤ 2 1/s ≤ 2c E(2η(X )−1)f (X )≤0 ∆Q(η(X ), 0) Finally observe that E(2η(X )−1)f (X )≤0 ∆Q(η(X ), 0) ≤ EX ∆Q(η(X ), f (X )) = ∆Q(f ). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem An Example Recall the theorem statement: If fφ∗ (η) > 0 for all η > 12 and if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we have | 21 − η|s ≤ c s ∆Q(η, 0), then L(f ) − L∗ ≤ 2c∆Q(f )1/s , where L∗ is the Bayes error rate. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem An Example Recall the theorem statement: If fφ∗ (η) > 0 for all η > 12 and if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we have | 21 − η|s ≤ c s ∆Q(η, 0), then L(f ) − L∗ ≤ 2c∆Q(f )1/s , where L∗ is the Bayes error rate. Also, recall that for least squares we have fφ∗ (η) = 2η − 1 and Q ∗ (η) = 4η(η − 1). Therefore, Q(η,w ) Q ∗ (η) z }| { z }| { ∆Q(η, w ) = 1 + (−4η + 2)w + w 2 − 4η(1 − η) = (2η−1−w )2 Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Theoretical Framework Examples The Main Theorem An Example Recall the theorem statement: If fφ∗ (η) > 0 for all η > 12 and if there exist c > 0 and s ≥ 1 such that for all η ∈ [0, 1], we have | 21 − η|s ≤ c s ∆Q(η, 0), then L(f ) − L∗ ≤ 2c∆Q(f )1/s , where L∗ is the Bayes error rate. Also, recall that for least squares we have fφ∗ (η) = 2η − 1 and Q ∗ (η) = 4η(η − 1). Therefore, Q(η,w ) Q ∗ (η) z }| { z }| { ∆Q(η, w ) = 1 + (−4η + 2)w + w 2 − 4η(1 − η) = (2η−1−w )2 2 This implies η − 12 = 14 ∆Q(η, 0), so in the theorem statement wepmay take c = 1/2 and s = 2. We find L(f ) − L∗ ≤ ∆Q(f ). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Implications of the theorem Ensuring that ∆Q(f ) is small guarantees that the classification error is close to the Bayes error rate. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Implications of the theorem Ensuring that ∆Q(f ) is small guarantees that the classification error is close to the Bayes error rate. If (fn )∞ n=1 is a sequence of estimators for which ∆Q(fn ) → 0, then the error rate of the predictor fn approaches the Bayes rate as n → ∞. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Implications of the theorem Ensuring that ∆Q(f ) is small guarantees that the classification error is close to the Bayes error rate. If (fn )∞ n=1 is a sequence of estimators for which ∆Q(fn ) → 0, then the error rate of the predictor fn approaches the Bayes rate as n → ∞. Let Pn fn be the predictor resulting from minimizing i=1 φ(f (xi )yi ). To ensure that ∆Q(fn ) → 0, the function class C needs to be: Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Implications of the theorem Ensuring that ∆Q(f ) is small guarantees that the classification error is close to the Bayes error rate. If (fn )∞ n=1 is a sequence of estimators for which ∆Q(fn ) → 0, then the error rate of the predictor fn approaches the Bayes rate as n → ∞. Let Pn fn be the predictor resulting from minimizing i=1 φ(f (xi )yi ). To ensure that ∆Q(fn ) → 0, the function class C needs to be: Expressive enough to admit a function close to fφ∗ , or some other function for which ∆Q(f ) is small (Approximation error). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Implications of the theorem Ensuring that ∆Q(f ) is small guarantees that the classification error is close to the Bayes error rate. If (fn )∞ n=1 is a sequence of estimators for which ∆Q(fn ) → 0, then the error rate of the predictor fn approaches the Bayes rate as n → ∞. Let Pn fn be the predictor resulting from minimizing i=1 φ(f (xi )yi ). To ensure that ∆Q(fn ) → 0, the function class C needs to be: Expressive enough to admit a function close to fφ∗ , or some other function for which ∆Q(f ) is small (Approximation error). Not so expressive that the estimator fn experiences a large variance due to the finite sample size used in the empirical risk (Estimation error). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Approximation Error To make sure the function class is sufficiently large, it is enough that to ensure that the function class is dense in the set of continuous functions. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Approximation Error To make sure the function class is sufficiently large, it is enough that to ensure that the function class is dense in the set of continuous functions. More specifically, let U ⊂ R p be Borel measurable, and let C (U) be the Banach space of continuous functions on U under the uniform-norm topology. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Approximation Error To make sure the function class is sufficiently large, it is enough that to ensure that the function class is dense in the set of continuous functions. More specifically, let U ⊂ R p be Borel measurable, and let C (U) be the Banach space of continuous functions on U under the uniform-norm topology. Theorem 2 If C is dense in C (U), then for any regular probability measure µ for which µ(U) = 1 and any measurable conditional probability η(x) = P(Y = 1|X = x), we have inf ∆Q(f ) = 0, where (X , Y ) is distributed according to f ∈C (µ, η). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Estimation error We use reproducing kernel Hilbert spaces to define a chain C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators f1 , f2 , f3 . Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Estimation error We use reproducing kernel Hilbert spaces to define a chain C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators f1 , f2 , f3 . More specifically, we begin with a positive symmetric kernel K (a, b) and define the RKHSs as closures of the linear combinations of functions K (xi , ·) (where xi are the input values). Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Estimation error We use reproducing kernel Hilbert spaces to define a chain C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators f1 , f2 , f3 . More specifically, we begin with a positive symmetric kernel K (a, b) and define the RKHSs as closures of the linear combinations of functions K (xi , ·) (where xi are the input values). We then calculate the estimators fn using regularization (using the Hilbert space norm). " n # 1X λn 2 fn = argmin φ(f (xi ), yi ) + kf k , n 2 f ∈Cn i=1 where λn is a regularization parameter. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Estimation error We use reproducing kernel Hilbert spaces to define a chain C1 ⊂ C2 ⊂ C3 ⊂ · · · of function classes for the estimators f1 , f2 , f3 . More specifically, we begin with a positive symmetric kernel K (a, b) and define the RKHSs as closures of the linear combinations of functions K (xi , ·) (where xi are the input values). We then calculate the estimators fn using regularization (using the Hilbert space norm). " n # 1X λn 2 fn = argmin φ(f (xi ), yi ) + kf k , n 2 f ∈Cn i=1 where λn is a regularization parameter. This infinite dimensional optimization problem reduces to a finite dimensional one. Sam Watson Convex Risk Minimization for Binary Classification Introduction True Risk vs. Convex Risk Producing a Consistent Estimator Implications of theorem Approximation Error Estimation error Questions? Sam Watson Convex Risk Minimization for Binary Classification