Functional Bregman Divergence Bayesian Estimation of Distributions

advertisement
Functional Bregman Divergence,
Bayesian Estimation of Distributions
and Completely Lazy Classifiers
Maya R. Gupta
Dept. of Electrical Engineering
University of Washington
Sergey Feldman
Dept. of Electrical Engineering
University of Washington
Santosh Srivastava
Fred Hutch Cancer Research Center Bela A. Frigyik
Eric Garcia
Dept. of Electrical Engineering
University of Washington
Dept. of Mathematics
Purdue University
1
this talk:
Bregman divergence
Functional Bregman divergence
Bayesian estimation of distributions
Uniform
Gaussian Bayesian QDA
Local BDA
Gaussian mixture
completely lazy learning
2
The mean minimizes average squared error
x1
Let x1 , x2 , . . . , xN ∈ Rn .
1 X
2
A = arg minn
(kxj − Ak2 )
A∈R N
j
∗
Then,
∗
A =
1
N
P
j
x2
x3
A∗
x4
xj
3
The mean minimizes average squared error
x1
Let x1 , x2 , . . . , xN ∈ Rn .
1 X
2
A = arg minn
(kxj − Ak2 )
A∈R N
j
∗
Then,
∗
A =
1
N
P
j
xj
x2
x3
A∗
x4
Are there other distortion
functions that yield
the sample mean?
4
The mean minimizes average Bregman divergence
(Banerjee et al. JMLR ’05, IEEE Trans. on Info Theory ’05)
x1
Let x1 , x2 , . . . , xN ∈ Rn .
Let d(x, y) be any Bregman divergence.
1 X
A = arg minn
d(xj , A)
A∈R N
j
∗
Then,
∗
A =
1
N
P
j
x2
x3
A∗
x4
xj
5
Bregman divergence between vectors
Class of distortion functions, including:
sum of squared errors
relative entropy
Itakura-Saito distance
etc.
General formula:
dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y),
x, y ∈ Rn
φ is convex function.
Total squared error:
X
φ(x) =
x[i]2
i
Relative entropy:
X
φ(x) =
x[i] log x[i]
i
6
Class of distortion functions, including:
sum of squared errors
relative entropy
Itakura-Saito distance
etc.
General formula:
dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y),
x, y ∈ Rn
φ is convex function.
dφ (x, y) is tail of Taylor series expansion of φ around y:
φ(x) = φ(y) + ∇φ(y)T (x − y) + dφ (x, y)
7
Relationship to Bayesian Estimation
Let d(x, y) be any Bregman divergence.
Goal: Estimate a parameter θ̂ ∈ R,
Given candidates θ ∈ R and posterior p(θ).
Consider the ZBayesian estimate with d as the risk function:
θ∗ = arg min p(θ)d(θ, θ̂)dθ = arg min EΘ [d(Θ, θ̂)]
θ̂∈R
θ
θ̂∈R
8
Relationship to Bayesian Estimation
Let d(x, y) be any Bregman divergence.
Goal: Estimate a parameter θ̂ ∈ R,
Given candidates θ ∈ R and posterior p(θ).
Consider the ZBayesian estimate with d as the risk function:
θ∗ = arg min p(θ)d(θ, θ̂)dθ = arg min EΘ [d(Θ, θ̂)]
θ̂∈R
θ̂∈R
θ
0.4
Say you flip
1 tail
9 heads
Let θ =P(tails)
0.35
likelihood of θ
given 1 tail and 9 heads
0.3
p(θ)
0.25
0.2
0.15
0.1
0.05
0
0
0
0.1
0.2
0.3
0.4
0.5
θ
0.6
0.7
0.8
0.9
9
1
1
Relationship to Bayesian Estimation
Let d(x, y) be any Bregman divergence.
Goal: Estimate a parameter θ̂ ∈ R,
Given candidates θ ∈ R and posterior p(θ).
Consider the ZBayesian estimate with d as the risk function:
θ∗ = arg min p(θ)d(θ, θ̂)dθ = arg min EΘ [d(Θ, θ̂)]
θ̂∈R
θ
θ̂∈R
Then, Banerjee et al. theorem says minimizer is the mean:
θ∗ = EΘ [Θ]
Say you flip
∗
θ
= .1666
1 tail
θ̂MLE = .1
9 heads
Let θ =P(tails)
10
Estimation of Distributions
Goal: Estimate a distribution fˆ(x)
Given candidates f : R → R, and posterior p(f ).
Ex: Given samples {2, 3, 7, 8}, estimate the
generating uniform distribution U [0, a].
11
Estimation of Distributions
Goal: Estimate a distribution fˆ(x)
Given candidates f : R → R, and posterior p(f ).
Ex: Given samples {2, 3, 7, 8}, estimate the
generating uniform distribution U [0, a].
fˆ(x)ML = U [0, 8]
12
Estimation of Distributions
Goal: Estimate a distribution fˆ(x)
Given candidates f : R → R, and posterior p(f ).
Ex: Given samples {2, 3, 7, 8}, estimate the
generating uniform distribution U [0, a].
Bayesian parameter estimate:
arg min EA [(A − â)2 ]
â∈R+
13
Estimation of Distributions
Goal: Estimate a distribution fˆ(x)
Given candidates f : R → R, and posterior p(f ).
Ex: Given samples {2, 3, 7, 8}, estimate the
generating uniform distribution U [0, a].
Bayesian parameter estimate:
arg min EA [(A − â)2 ]
â∈R+
Assume
for A: ´ ¸
∙ gamma³ prior
2
2
⇒ U 0,
P χγ < γ Xmax
1
2
´
³ 1
γ0 P χ2 <
2
γ3
γ Xmax
4
14
Bayesian Estimation of Distributions (Frigyik,Gupta, Srivastava ’06)
Goal: Estimate a distribution fˆ(x)
Given candidates f : R → R, and posterior p(f ).
Ex: Given samples {2, 3, 7, 8}, estimate the
generating uniform distribution U [0, a].
f ∗ (x) = arg min EF [d(F, fˆ)]
ˆ
f(x)
≡ arg min
ˆ
f(x)
Z
d(f, fˆ)p(f )dU
f ∈U
15
Bayesian Estimation of Distributions (Gupta and Srivastava ’06)
Goal: Estimate a distribution fˆ(x)
Given candidates f : R → R, and posterior p(f ).
Ex: Given samples {2, 3, 7, 8}, estimate the
generating uniform distribution U [0, a].
f ∗ (x) = arg min EF [d(F, fˆ)]
ˆ
f(x)
?
≡ EF [F ]
if d is a Bregman divergence?
What is a Bregman
divergence between functions?
16
Bregman Divergence Definitions
Bregman Divergence: for vectors x, y ∈ Rn , convex function φ,
dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y),
Pointwise Bregman Divergence: for functions f (t), g(t)
R
(Jones and Byrne 1990, Csiszar 1995 )
dφ (f, g) = t dφ (f (t), g(t))dν(t),
?
arg min EF [d(F, fˆ)] = EF [F ]
ˆ
f(x)
17
Bregman Divergence Definitions
Bregman Divergence: for vectors x, y ∈ Rn , convex function φ,
dφ (x, y) = φ(x) − φ(y) − ∇φ(y)T (x − y),
Pointwise Bregman Divergence: for functions f (t), g(t)
R
(Jones and Byrne 1990, Csiszar 1995 )
dφ (f, g) = t dφ (f (t), g(t))dν(t),
Functional Bregman Divergence: (Srivastava, Gupta, Frigyik, JMLR 06 )
f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν)
φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
Frechet derivative of φ
at g in the direction of f − g
18
Functional Bregman Divergence
(arXiv: Frigyik, Srivastava, Gupta 2006 )
f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν)
φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
Frechet derivative of φ
at g in the direction of f − g
Frechet derivative:
φ[g + a] − φ[g] = δφ[g; a] + ²[g, a]kakLp (ν)
For all a ∈ Lp (ν), with ²[g, a] → 0 as kakLp (ν) → 0.
19
Functional Bregman Divergence
(arXiv: Frigyik, Srivastava, Gupta 2006 )
f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν)
φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
Frechet derivative of φ
at g in the direction of f − g
Ex: total squared error φ[g] =
R
g 2 dν
Compare with
vector Bregman
divergence
P
φ(x) = i x[i]2
20
Functional Bregman Divergence
(arXiv: Frigyik, Srivastava, Gupta 2006 )
f, g : Rn → R, f, g, ∈ Lp (ν)
φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
Frechet derivative of φ
at g in the direction of f − g
R
Ex: total squared error φ[g] = g 2 dν
R
⇒ δφ[g; f − g] = 2g(f − g)dν
R 2
R 2
R
dφ (f, g) = f dν − g dν − 2g(f − g)dν
R
2
= (f − g)2 dν = kf − gkL2 (ν)
21
Functional Bregman Divergence
(arXiv: Frigyik, Srivastava, Gupta 2006 )
f, g : Rn → R, f, g, ∈ Lp (ν)
φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
Frechet derivative of φ
at g in the direction of f − g
Functional Bregman divergences include
pointwise Bregman divergences and more!
Ex: squared bias
µZ
¶2
dφ (f, g) =
(f − g)dν
φ[g] =
µZ
gdν
¶2
22
Functional Bregman divergence
has same properties
as Bregman divergence
(arXiv: Frigyik, Srivastava, Gupta 2006 )
Non-negativity
Convexity with respect to first function
Linearity with respect to φ-functionals
Equivalence classes with respect to φ-functionals
Dual divergences by Legendre transformation
Generalized Pythagorean Inequality
23
Bayesian Estimation of Distributions
(arXiv: Frigyik, Srivastava, Gupta 2006 )
Theorem: for random function F defined on a
finite-dimensional manifold with posterior pF ,
f ∗ = arg min EF [dφ [F, fˆ]]
fˆ
≡ EF [F ]
24
Bayesian Estimation of Distributions
(arXiv: Frigyik, Srivastava, Gupta 2006 )
Theorem: for random function F defined on a
finite-dimensional manifold with posterior pF ,
f ∗ = arg min EF [dφ [F, fˆ]]
fˆ
≡ EF [F ]
e.g. parametric distribution or decomposable in terms of finite basis functions
25
Uniform Example (arXiv: Frigyik, Srivastava, Gupta)
Ex: Given samples {2, 3, 7, 8}, estimate
the generating uniform distribution U [0, a].
Let F be a random uniform distribution: U [0, a]
Let pF be the likelihood of F given N data samples.
f ∗ = arg min EF [dφ [F, fˆ]]
fˆ
∗
≡ EF [F ]
f (x) Pf
R∞
¡1¢ ¡
f (x) =
a=max(x,Xmax )
R∞
a=Xmax
a
1
aN
dU
° °
¢
° df °
1
aN ° da °2 da
° °
° df °
° da ° da
2
(actually, we use the Fisher information metric for dU )
26
Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta)
Ex: Given samples {2, 3, 7, 8}, estimate
the generating uniform distribution U [0, a].
Let F be a random uniform distribution: U [0, a]
Let pF be the likelihood of F given N data samples.
f ∗ = arg min EF [dφ [F, fˆ]]
fˆ
≡ EF [F ]
∗
f (x) =
R∞
¡1¢ ¡
max(x,Xmax ) a
R∞
1 da
Xmax aN a
1
aN
¢ da
a
27
Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta)
f ∗ (x)Bayesian = arg min EF [dφ [F, fˆ]]
ˆ
f(x)
≡ EF [F ]
N (Xmax )N
=
(N + 1)[max(x, Xmax )]N+1
28
Compare estimates
Let F be a random uniform distribution: U [0, a]
Let pF be the likelihood of F given the data samples.
N (Xmax )N
=
(N + 1)[max(x, Xmax )]N+1
arg min EF [dφ [F, fˆ]]
ˆ
f(x)
³
arg min EF [ kF − fˆk2
fˆ(x)∈U
´2
]
= U [0, Xmax 21/N ]
Bayesian parameter estimate of a, gamma prior p(a):
arg min EA [(A − â)2 ]
â∈R+
³
´¸
∙
P χ2γ < γ X2max
⇒ U 0, γ10 ³ 21 2 2 ´
P χγ3 < γ
4 Xmax 29
Compare estimates
Simulation: Draw n random samples from U [0, 1]
Metric: Squared error between estimated dist. and U [0, 1].
1
SQUARED ESTIMATION ERROR
10
maximum likelihood estimate
´2
³
arg min EF [ kF − fˆk2 ]
0
10
fˆ(x)∈U
arg min EA [(A − â)2 ]
−1
â∈R+
10
−2
10
arg min EF [dφ [F, fˆ]]
ˆ
f(x)
−3
10
0
10
1
2
10
10
NUMBER OF DATA SAMPLES
30
this talk:
Bregman divergence
Functional Bregman divergence
Bayesian estimation of distributions
Uniform
Gaussian Bayesian QDA
Local BDA
Gaussian mixture
completely lazy learning
31
Classification set-up
Training Data T = {Xi , Yi }
Feature vectors Xi ∈ Rd for i = 1, . . . n.
Associated labels Yi ∈ G, where G is a finite set
of classes.
Test vector X, estimates its associated label Ŷ .
32
QDA: classifying with Gaussian models
1
Feature 2
2
X[2]
1
1
1
2
1
1
1
1
2
2
2
1
2
2
1
1
2 2
?
2
2 2
2
33
Feature 1
X[1]
QDA: classifying with Gaussian models
1
Feature 2
2
X[2]
1
1
1
2
1
1
1
1
2
2
2
1
2
2
1
1
2 2
?
2
2 2
2
34
Feature 1
X[1]
QDA: classifying with Gaussian models
1
Feature 2
2
X[2]
1
1
1
2
1
1
1
1
2
2
2
1
2
2
1
1
2 2
?
2
2 2
2
35
Feature 1
X[1]
Bayesian: Minimizing Expected Misclassification Costs
G
X
Y = arg min
C(g, h)p(x|Y = h)P (Y = h)
g
h=1
Ŷ = arg min E
g
≡ arg min
g
"
G
X
G
X
h=1
C(g, h)Nh (x)Θh
#
C(g, h)E[Nh (x)]EΘ [Θh ]
h=1
Eμh ,Σh [Nh (x)]
ENh [Nh (x)]
(Geisser 1964)
(Srivastava, Gupta 2006)
36
Distribution-based Bayesian Minimum Expected Misclassification Cost:
(Srivastava and Gupta, IEEE ISIT (2006) )
For a test point x and class h,
R
ENh [Nh (x)] = M N (x)f (N |Th )dM
Look at all Prob of
possible
test point
Gaussians given some
Gaussian
Prob. of
that Gaussian
given training
data and prior
Measure
over space
of Gaussians
dM =
dμdΣ
d+2
|Σ| 2
differential element
based on Fisher
information matrix
37
(C. R. Rao ’45) :
Distribution-based Bayesian Minimum Expected Misclassification Cost:
(Srivastava and Gupta, IEEE ISIT (2006) )
For a test point x and class h,
R
ENh [Nh (x)] = M N (x)f (N |Th )dM
Prob. of
a Gaussian
given training
data
f (N |Th ) = Πkj=1 N (Xj )p(N )
Likelihood
of the iid
training
samples
Prior
prob of
that
Gaussian
38
Prior matters with minimum expected risk
Design goals for the prior (over the Gaussian distributions):
p(N )
1) Regularize for ill-posed likelihood to reduce estimation variance
(not a flat prior).
2) Add sensible bias.
3) Allow the estimation to converge as number of training
samples becomes infinite.
4) Lead to closed form solution.
39
Proposed Prior
p(Nh ) = γ0
Σh,max
=
Bh ))
exp(− 12 trace(Σ−1
h
q
|Σh | 2
( inverted Wishart)
Bh
q
We set:
Bh = q diag(Σ̂h,M L )
Prior p(N )
Σ
diag(Σ̂ML )
40
Proposed Prior
p(Nh ) = γ0
Bh ))
exp(− 12 trace(Σ−1
h
q
|Σh | 2
( inverted Wishart)
Prior p(N )
q = 5d
q=d
Σ
diag(Σ̂ML )
41
Distribution-based classifier and closed form solution
Choose the class Ŷ = g ∈ G that minimizes
G
X
C(g, h)ENh [Nh (x)]EΘ [Θh ]
h=1
Closed-form solution:
ENh [Nh (x)] =
n +q+1
nh +q+1
nh
T −1
− h2
Γ( 2 )(1 + nh +1 Zh Dh Zh )
d
1
nh +q−d+1
nh +1
2
π Γ(
)|( nh )Dh | 2
2
Bh = .95q diag(Σ̂M L,h ) + .05I
Dh = Sh + Bh , and Zh = x − x̄h
42
Distribution-based Bayesian discriminant (Srivastava, Gupta, 2006)
ENh [Nh ] =
n +q+1
nh +q+1
nh
− h
−1
T
2
Γ(
)(1+ n +1 Zh Dh Zh )
2
h
d
1
nh +q−d+1
nh +1
2
π Γ(
)|( n )Dh | 2
2
h
Parameter-based Bayesian discriminant (Geisser, 1964)
Eμ,Σ [Nh ] =
n +q−d−1
nh +q−d−1
nh
− h
−1
T
2
Γ(
)(1+ n +1 Zh Dh Zh )
2
h
d
1
nh +1
n +q−2d−1
2
π |( n )Dh | 2 Γ( h 2
)
h
Difference: For parameter-based you need
nh > 2d – q +1 samples for each class. If you have few samples,
forced to use high q = more bias.
q = 5d
q=d
43
Σ
Bregman divergence
Functional Bregman divergence
Bayesian estimation of distributions
Uniform
Gaussian Bayesian QDA
(reduce model bias)
Local BDA
Gaussian mixture
completely lazy learning
44
Local BDA
1. Find the k samples from
each class nearest to the test sample.
2. Fit a Gaussian to the nearest k samples of each class.
3. Classify as the class that
minimizes expected misclassification costs.
Related Work:
Local Nearest Means (Mitani and Hamamoto, 2000)
SVM‐KNN (Malik et al. 2006)
45
Local BDA – 7 Neighbors
Feature 2
X[2]
?
1
46
Feature 1
X[1]
Local BDA – 7 Neighbors
Feature 2
X[2]
?
1
47
Feature 1
X[1]
Local BDA‐ 7 neighbors
Feature 2
X[2]
?
1
48
Feature 1
X[1]
How do we choose the neighborhood size?
Standard: cross‐validate on training. Not truly lazy.
‐ Theoretically‐sound only if training and test are iid.
‐ If training set evolving, must re‐train
49
How do we choose the neighborhood size?
Standard: cross‐validate on training. Not very lazy.
Proposed: average over multiple neighborhood sizes = completely lazy.
Choose the class Ŷ that solves
arg min
g=1,...,G
G
X
h=1
C(g, h)ENh ,K [Nh,K (x)]EΘ [Θh ]
average discriminant
with respect to uncertainty in the
Gaussian‐fit to the training samples
and
to the neighborhood size
50
Representative Misclassification Rates on UCI datasets
Optical Char. Rec.
cv k
E_K[]
Isolet
cv_k
E_k[]
Local Nearest Means
3.3
3.2 4.3
3.9
Local BDA
2.6
1.7
3.1
3.1
kNN
3.5
3.5
8.7
6.9
DANN
4.0
4.3
8.6
8.4
SVM‐kNN
2.6
2.7
3.7
3.1
SVM
2.9
4.3
GMM
10.9
68.4
GMM BQDA
5.5
68.4
51
Summary
1. Proposed a functional Bregman divergence for functions f, g:
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
2. Showed Bayesian distribution estimation with
functional Bregman yields mean distribution:
f ∗ (x) = arg min EF [dφ (F, fˆ)] ≡ EF [F ]
ˆ
f(x)
3. Demonstrated Bayesian distribution estimation on uniform.
4. Applied Bayesian Gaussian estimates locally for classification.
5. Proposed Bayesian estimate neighborhood for completely lazy classifiers.
52
du
e
.
n
ngto
pa
t
a
s
per
i
h
s
a
.w
e
e
.
idl
Bayesian Quadratic Discriminant Analysis, Srivastava, Gupta, Frigyik, Journal of Machine Learning, Oct. 2007.
Functional Bregman Divergence and Bayesian Estimation of Distributions, Frigyik, Gupta, Srivastava, in review, available on arXiv.
53
Extra slides
54
Fisher Information Metric (C. R. Rao ’45, Jeffreys ’46)
dM = |I(a)|1/2 da
I(a) is the Fisher information matrix.
For the 1-d manifold M formed by the set U ,
"µ
¶2 #
d log 1/a
I(a) = EX
da
=
Z
a
x=0
1 1
1
dx = 2
2
a a
a
g(x)p(x)
1
→ dM =
a
55
Fisher Information Metric (C. R. Rao ’45, Jeffreys ’46)
At each point on the statistical manifold U define a tangent,
which specifies a tangent space.
If an inner product is defined on each tangent space,
the collection of inner products is a Riemannian metric:
< ·, · >= {< ·, · >f |f ∈ U}
Together with < ·, · >, U is a Riemannian manifold.
Riemannian metric → a natural volume element = measure.
56
The mean minimizes average squared error
x1
Let x1 , x2 , . . . , xN ∈ Rn .
1 X
2
A = arg minn
(kxj − Ak2 )
A∈R N
j
∗
Then,
∗
A =
1
N
P
j
x2 A∗
∗
C
x3 x
4
xj
Not true of l2 error,
X
1
C ∗ = arg minn
kxj − Ck2
C∈R N
j
∗
∗
C 6= A
C ∗ minimizes length of string
needed to connect to points.
57
Bayesian Estimation of Distributions (arXiv: Frigyik, Srivastava, Gupta)
Ex: Given samples {2, 3, 7, 8}, estimate
the generating uniform distribution U [0, a].
Let F be a random uniform distribution: U [0, a]
Let pF be the likelihood of F given N data samples.
f ∗ = arg min EF [dφ [F, fˆ]]
fˆ
≡ EF [F ]
f ∗ (x) =
R∞
¡1¢ ¡
1
aN
max(x,Xmax ) a
R∞
1 da
Xmax aN a3/2
¢
da
a3/2
58
BDA discriminant acts like a regularized covariance estimate
ENh [Nh ] =
n +q+1
nh +q+1
nh
− h
−1
T
2
Γ(
)(1+ n +1 Zh Dh Zh )
2
h
d
1
n +q−d+1
nh +1
π 2 Γ( h 2
)|( n
)Dh | 2
h
Approximate |ZhT Dh−1 Zh | using 1 + r ≈ er :
ENh [Nh ] ≈
Σ̃h =
Γ(
nh +q+1
nh +1 Dh −1
1 T
)
exp[−
Z
[
Zh ]
2
2 h nh +q+1 nh ]
d
1
n +q−d+1
nh +1
π 2 Γ( h 2
)|( n
)Dh | 2
h
nh +1 Dh
nh +q+1 nh
≈ (1 −
q
Sh
)
nh +q+1 nh
q
+ ( nh +q+1
) Bqh
59
5
Figures show average distortion between each 1 X
d(xj , A)
point A in the space and the five black points: 5
j=1
10
10
20
20
30
30
40
40
50
50
60
60
70
70
80
80
90
90
100
100
10
20
30
40
50
60
70
80
90
Bregman divergence with
φ(x) = (kxk2 )2 .
Squared Error:
dφ (xj , A) = (kxj − Ak2 )2
100
10
20
30
40
50
60
70
80
90
Bregman divergence with
φ(x) = (kxk2 )4
Results in complicated
divergence function dφ .60
100
Functional Bregman Divergence
(arXiv: Frigyik, Srivastava, Gupta 2006 )
f, g : Rn → R and f, g ≥ 0, and f, g ∈ Lp (ν)
φ : Lp (ν) → R, strictly convex functional, φ ∈ C 2
dφ (f, g) = φ[f ] − φ[g] − δφ[g; f − g]
Frechet derivative of φ
at g in the direction of f − g
Frechet derivative:
φ[g + a] − φ[g] = δφ[g; a] + ²[g, a]kakLp (ν)
For all a ∈ Lp (ν), with ²[g, a] → 0 as kakLp (ν) → 0.
61
Bayesian estimate if forced to be uniform
MER estimate solves:
error if
truth is p
likelihood of p
Let p be uniform from zero to a:
our example:
The MER estimate is
q is uniform U[0,b]:
62
Regularized Discriminant Analysis (RDA)
(Friedman 1989)
Σ̂h (λ, γ) = (1 − γ)Σ̂h (λ) +
γ
d trace(Σ̂h (λ))I
controls shrinkage towards a multiple of the identity
Σ̂h (λ) =
γ
QDA
(1−λ)Sh +λS
(1−λ)nh +λn
controls degree of shrinkage
of the class covariance matrix
towards the pooled
nearest-means (identity cov)
LDA λ
63
Proposed Prior
p(Nh ) = p(μh )p(Σh ) = γ0
Bh ))
exp(− 12 trace(Σ−1
h
q
|Σh | 2
( inverted Wishart)
Differentiate log p(Nh ) with respect to Σh to solve for Σh,max :
q ∂
1 ∂
−1
trace(Σh Bh ) +
log |Σh | = 0
2 ∂Σh
2 ∂Σh
−1
−1
−Σ−1
B
Σ
+
qΣ
= 0
h,max h h,max
h,max
Σh,max
=
Bh
q
64
% Misclassification error results on UCI and Statlog benchmark datasets
Glass
Heart
Disease
Ionosphere
Iris
Letter
Recognition
Pen Digits
Pima
Sonar
Thyroid
Vowel
Local BDA
B=I
23.44
Local BDA
B =Trace
23.78
Local BDA
B =Diag
27.67
Local Nearest
Means
26.17
k-NN
26.56
21.78
5.53
1.73
25.41
9.74
1.53
20.48
35.29
1.93
32.63
9.50
2.73
33.30
13.35
2.60
3.03
2.17
27.41
12.40
3.19
41.77
3.05
2.17
26.68
12.45
3.33
36.80
3.18
2.26
26.25
10.10
3.76
32.72
4.38
2.12
26.80
15.60
3.52
41.56
4.73
2.66
26.36
15.55
5.09
43.51
65
Apply to Nearest‐Neighbor Learning
(Gupta et al. IEEE SSP ’05)
Goal: Classify x based on its k nearest-neighbors
2
such that the expected misclassification cost is minimized.
Let θh be the (unknown) P (class h | neighbors).
∗
Ideal: g = arg min
g
X
1
1
X
2
1
1
2
Cost(g, h)θh
h
66
Apply to Nearest‐Neighbor Learning
(Gupta et al. IEEE SSP ’05)
Goal: Classify x based on its k nearest-neighbors
2
such that the expected misclassification cost is minimized.
Let θh be the (unknown) P (class h | neighbors).
∗
Ideal: g = arg min
∗
g
Standard: g = arg min
g
X
1
1
X
2
1
1
2
Cost(g, h)θh
h
X
Cost(g, h)θ̂h
h
67
Apply to Nearest‐Neighbor Learning
(Gupta et al. IEEE SSP ’05)
Goal: Classify x based on its k nearest-neighbors
2
such that the expected misclassification cost is minimized.
X
2
Ideal: g = arg min
∗
g
Standard: g = arg min
∗
g
X
1
Cost(g, h)θh
h
X
h
Minimize Expected Cost: g = arg min EΘ
g
1
2
Let θh be the (unknown) P (class h | neighbors).
∗
1
1
Cost(g, h)θ̂h
"
X
h
Cost(g, h)Θh
#
68
Apply to Nearest‐Neighbor Learning
(Gupta et al. IEEE SSP ’05)
Goal: Classify x based on its k nearest-neighbors
2
such that the expected misclassification cost is minimized.
X
2
Ideal: g = arg min
∗
g
Standard: g = arg min
∗
g
X
X
h
g
g
1
Cost(g, h)θh
h
Minimize Expected Cost: g = arg min EΘ
≡ arg min
1
2
Let θh be the (unknown) P (class h | neighbors).
∗
1
1
X
Cost(g, h)θ̂h
"
X
Cost(g, h)Θh
h
#
Cost(g, h)EΘ [Θh ]
h
69
Apply to Nearest‐Neighbor Learning
(Gupta, Cazzanti, Srivastava, IEEE SSP ’05)
Goal: Classify x based on its k nearest-neighbors
2
such that the expected misclassification cost is minimized.
X
2
Ideal: g = arg min
∗
g
Standard: g = arg min
∗
g
X
≡ Minimizing Expected
Bregman Divergence
1
Cost(g, h)θh
h
X
h
Minimize Expected Cost: g = arg min EΘ
g
≡ arg min
1
2
Let θh be the (unknown) P (class h | neighbors).
∗
1
1
X
Cost(g, h)θ̂h
"
X
Cost(g, h)Θh
h
#
Cost(g, h)EΘ [Θh ]
g
h
µ
¶
X
70 θ̂)]
≡ arg min
Cost(g, h) arg min EΘ [d(Θ,
g
h
θ̂
How much better is MER than ML?
PMF estimate with 100 training/100 test samples
on 3D Kohonen simulation
mean-squared error of class pmf
0.22
0.2
Meanshapes:
squared−error
Maximum
as k varied
likelihood
lines: BMER estimate
0.18
0.16
0.14
0.12
k−NN
0.1
0.08
tricube
0.06
LIME
0.04
0.02
0
10
20
30
40
50
number of neighbors (k)
60
70
71
MER classification results
Classification with 1000 training/1000 test and 50,000 validation
samples on 4D Kohonen simulation.
45
40
ML cost / MER cost
ML Cost/ MER Cost
45
35
30
25
25
20
15
10
k−NN
k-NN
5
5
0
0
0.1
0.2
0.3
0.4
0.5
0.6
threshold
equal
costs
0.7
0.8
0.9
1
unequal
costs
72
Download