Basis Expansions and Regularization Continued (CH. 5 Part 2) Speaker: Brian Quanz

advertisement
Basis Expansions and
Regularization Continued
From Elements of Statistical Learning
(CH. 5 Part 2)
Speaker: Brian Quanz
(bquanz@ittc.ku.edu)
7/3/2008
A KTEC Center of Excellence
1
Overview
• Nonparametric Logistic Regression
• Multidimensional Splines
• Regularization and Reproducing Kernel
Hilbert Spaces
• Wavelet Smoothing
A KTEC Center of Excellence
2
Review: Logistic Regression
•
Uses logistic (a.k.a. sigmoid) function:
f(x)
•
Goal: Fit logistic curve to the data
using iterative procedure to calculate
maximum likelihood parameters
x
P(Class = 1|X=x)
P(Class = 2|X=x)
•
Can be used to associate probabilities with a discriminative
classifier (i.e. P(Class=1|X=x)).
•
For example, sigmoid fit is used with Support Vector Machines (SVM), where x is
distance from separating hyperplane, to assign probabilities to a classification.
•
H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt's probabilistic outputs for support vector
machines. Technical report, Department of Computer Science, National Taiwan University,
2003. URL http://www.csie.ntu.edu.tw/~cjlin/papers/plattprob.ps.
A KTEC Center of Excellence
*Original sigmoid image taken from wikipedia.org
3
Review: Logistic Regression - Fitting
• Maximum Likelihood (2 class case):
• Sample data consists of data samples x1, x2,…,xn with labels
y1, y2,…,yn, where xi has dimension p and yi are 0 or 1.
• Maximize Prob(Parameters and Data) = P(B;X;Y) =
P(Y|B;X)P(B;X)
• L(B) = P(Y|B;X) is called the likelihood function
• Then, assuming IID and taking log to simplify gives loglikelihood function:
• Goal: find B that maximizes ℓ(B), take derivative to obtain
score functions:
– Text uses Newton-Raphson algorithm to find zeros
A KTEC Center of Excellence
4
Logistic Regression – Newton-Rapshon Method
•
•
To find zeros of arbitrary function
Approximate function at starting point with tangent; find xintercept to attain new starting point; repeat
(f(xn)-0)/(xn-xn+1) = f’(xn)
=> xn+1 = xn – f(xn)/f’(xn)
•
•
Likely to converge since log-likelihood is concave
Update rule for logistic regression:
A KTEC Center of Excellence *Image taken from wikipedia.org
5
Nonparametric Logistic Regression
•
No longer fix the log odds as linear, allow a smoother fit:
•
Fit f(x) smoothly to allow smoother conditional probability
function
As with smoothing spline, penalize curvature:
•
•
As with smoothing splines, optimal f is finite dimensional natural
spline with knots at unique x. We can define:
A KTEC Center of Excellence
6
Nonparametric Logistic Regression
•
Which implies:
•
Where p is N-vector with elements pi, and as defined previously:
•
And W is diagonal matrix with entries P(Y = 1|X = xi)(1-P(Y = 1|X
= xi))
Using Newton-Raphson update as before:
•
A KTEC Center of Excellence
7
Multidimensional Splines
• Many options for how to do multi-dimensional splines
• The most basic, not even defined by text is additive
•
Simply add together the spline bases for different dimensions
• The tensor product basis combines the bases from
different dimensions through all possible multiplications
with one basis from each. Example:
• Tensor product basis:
A KTEC Center of Excellence
8
Multidimensional Splines – Tensor Product
• Simply expressed as
new basis, so same
fitting applies as
before, e.g. leastsquares
• With increasing
dimension, the
resulting basis
dimension grows
exponentially
• Selecting only
important basis
functions to solve
problem, discussed in
ch. 9
A KTEC Center of Excellence *Image taken from the book
9
Example Comparison: Additive and Tensor Natural Splines
left
right
A KTEC Center of Excellence *Image taken from the book
10
Smoothing Splines of Higher Dimension
•
Same problem as before, except x now has d dimensions:
•
J is an appropriate penalty. Text gives example of a twodimensional penalty extending the one-dimensional penalty
previously presented:
•
•
This optimization results in a thin-plate spline, which shares
many properties with previously presented smoothing splines
Thin-plate splines can be generalized to higher dimensions by
using the appropriate penalty J
A KTEC Center of Excellence
11
Thin-Plate Splines
• Properties similar to 1-D smoothing spline:
• λ -> 0 solution approaches interpolating function
• λ -> ∞ solution approaches least-squares linear
• For intermediate λ solution expressed as linear
expansion of basis functions, coefficients from
generalized ridge regression
• hj are in fact radial basis functions, as discussed in
previous presentation
• Computational complexity O(N3) can be reduced by
choosing subset of knots K < N resulting in O(NK2 + K3)
A KTEC Center of Excellence
12
Thin-Plate Spline Example:
A KTEC Center of Excellence *Image taken from the book
13
Additional Multidimensional Splines
•
In general, there are many possibilities for multi-dimensional
splines; we can use any suitably large basis expansion of
different basis types and use a suitable regularizer
•
E.g. Tensor products of B-splines
•
Additive splines are just one class that come from additive
penalty (f are univariate splines):
•
This can be extended to bases of functions with higer-order
interactions:
•
Many choices: maximum order, which terms to include, basis
type, etc. Automatic selection may be preferred (ch. 9 and 10).
A KTEC Center of Excellence
14
Overview
• Nonparametric Logistic Regression
• Multidimensional Splines
• Regularization and Reproducing Kernel
Hilbert Spaces
• Wavelet Smoothing
A KTEC Center of Excellence
15
Regularization and Reproducing Kernel Hilbert
Spaces “This section is quite technical and can be skipped by the disinterested or intimidated reader”
• Idea is to generalize the
fitting/regression problem as much as
possible
Space of functions on
which J(f) is defined
Loss function
Penalty function
• Motivation: a truly general penalty
•
Start by considering abstract vector spaces, where
vectors can represent any number of objects, points in
Euclidean space, functions, graphs, etc., as long as
certain conditions are met, the same rules apply to all
A KTEC Center of Excellence
16
A General Penalty
• One set of general penalties proposed is of the form:
•
denotes the fourier transform of f, and
is positive
and approaches 0 for large s, so that high frequency
components are more heavily penalized. Under additional
assumptions this has solution:
•
span the null space of J (the null space of J is the set
of all functions it maps to the same value)
A KTEC Center of Excellence
17
Hilbert Spaces: Introduction
•
Example to introduce Hilbert spaces which is also closely related
to wavelets
Recall Fourier series: all continuous functions f(x) defined on an
interval of length L, 0 < x < L, (let D denote this set of functions)
can be expanded as a sum of a sine series:
•
•
This defines a vector space since, given f,g in D and real
constants a,b, h=af + bg can be defined as:
which defines a continuous function (an element in D), fulfilling
axioms of vector space.
•
This Fourier series representation could also be expressed as:
A KTEC Center of Excellence *Ideas of introduction taken from course notes by Professor Edwin Langmann, “Notes on 18
Hilbert space theory” 2006. http://courses.theophys.kth.se/5A1305/hil1.pdf
Hilbert Spaces: Introduction
•
•
Thus un represent a set of special functions in D and from
Fourier series theory, every element in D can be written as
linear combination of these special functions
Immediate analogy with RN:
•
Also we can compute components with a scalar product:
•
A component has formula:
•
This can be easily shown:
A KTEC Center of Excellence
19
Hilbert Spaces: Introduction
•
This is in fact the same as the expression for Fourier series
components
and can be shown in the same way. We can define scalar product
of two functions in D as:
•
And un are orthogonal in the same sense:
A KTEC Center of Excellence
20
Hilbert Spaces: Introduction
•
Question of completeness: can every function in D be
represented as a combination of un? In general orthogonality of
a system of functions vn shows best approximation off :
•
We can define any system is complete if error
•
•
•
goes to 0 as M goes to infinity
Thus the un form an orthonormal basis of D that is of infinite
dimension.
This is an infinite dimensional vector space. Both this space and
the Euclidean space RN are special cases of the theory of
Hilbert spaces.
The Hilbert space allows a generalized way of considering many
different types of vector spaces
A KTEC Center of Excellence
21
Hilbert Spaces: Definition
•
A Hilbert space is an inner product space that is complete under
the norm defined by the inner product < ∙, ∙ > by
•
“Complete” means that if a sequence of vectors approaches a
limit in the space, than that limit is in the space as well.
•
•
•
For example real numbers are complete, rational numbers are not, since some
sequences approach irrational numbers like sqrt(2)
An inner product space is a vector space of arbitrary dimension,
with an inner product, which associates a scalar quantity with
each pair of vectors
A vector space is a collection of objects having operations of
vector addition and scalar multiplication and satisfying 8 axioms,
such as operations being associative, commutative, distributive,
containing an identity element, etc.
A KTEC Center of Excellence
22
Reproducing Kernel Hilbert Space
A KTEC Center of Excellence *Taken from slides by Dr. Christian Igel:
http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/LT/LT2.pdf
23
Reproducing Kernel Hilbert Space
A KTEC Center of Excellence *Taken from slides by Dr. Christian Igel:
http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/LT/LT2.pdf
24
Reproducing Kernel Hilbert Space
•
Moore-Aroszajn Theorem: For every positive definite function
K(∙, ∙) on X x X, there exists a unique RKHS and vice versa.
•
This allows us to apply the ideas from Euclidean geometry to nongeometric problems, so long as we can define a suitable Kernel K(∙,
∙)
A KTEC Center of Excellence *Taken from slides by Dr. Christian Igel:
http://www.neuroinformatik.ruhr-uni-bochum.de/PEOPLE/igel/LT/LT2.pdf
25
Results Presented in the Text about RKHS
•
Text considers an important subclass of problems of:
for which H is the space of functions defined by a positive
definite kernel K(x,y), HK, a RKHS
•
Suppose K has eigen-expansion:
•
Elements of HK expanded as:
A KTEC Center of Excellence
26
Results Presented in the Text about RKHS
•
We can define the penalty as:
which can be interpreted as generalized ridge penalty, where
functions with large eigenvalues are penalized less
•
It can be shown solution has finite-dimensional form:
•
Consists of basis function
representer of evaluation at xi
A KTEC Center of Excellence
known as
27
Results Presented in the Text about RKHS
•
Then by reproducing property:
•
And the objective function reduces to the following finitedimensional problem, known as the kernel property in support
vector machine literature:
•
We can have the penalty apply to only a subspace of the
functions in H by penalizing the projection of functions onto the
subspace
•
Solution then has form:
A KTEC Center of Excellence
(First term represents
expansion of H0)
28
RKHS Examples
• Squared-error loss:
•
Generalized ridge regression, solution obtained as:
A KTEC Center of Excellence
29
RKHS Examples
• Penalized Polynomial Regression
•
Example:
•
Objective Function:
•
By substitution can be expressed as the squared-error loss
problem
A KTEC Center of Excellence
30
RKHS Examples
• Gaussian Radial Basis Functions
• This is an expansion in radial basis
functions. As we saw earlier, a thin-plate
spline was also an expansion in radial basis
functions
A KTEC Center of Excellence
31
RKHS Examples
• Support Vector Classifiers (ch. 12)
A KTEC Center of Excellence
32
Wavelet Smoothing
• Similar idea to the Fourier series
representation, except wavelets are
localized in both time and frequency
• We have a complete dictionary of
orthonormal basis functions to represent
functions
• Sparse representation is obtained by
shrinking and selecting the coefficients
of the basis functions, as we’ve seen
before
A KTEC Center of Excellence
33
Wavelet example
• Fits basis
coefficients
by leastsquares,
thresholds
smaller
coefficients,
like the
Lasso
A KTEC Center of Excellence
34
Wavelet Derivation
•
We define father and mother wavelets. The rest are then
created from them, by increasing the frequency, as with Fourier
series (translations and dilations):
A KTEC Center of Excellence
35
Wavelet Derivation
•
For example, for Haar wavelet
•
Father wavelet
•
Build orthogonal mother wavelet:
•
•
All these basis functions are orthonormal
The father wavelets form the basis for the rough components of
a function, and the orthogonal mother wavelets build the detail:
•
Haar wavelet is often too course; many other wavelets have been
invented that are smoother, such as the symlet
A KTEC Center of Excellence
36
Wavelet Smoothing Example
FIGURE 5.14. The top panel shows a
NMR signal, with the wavelet-shrunk
version superimposed in green. The
lower left panel represents the wavelet
transform
of the original signal, down to V4, using
the symmlet-8 basis. Each coefficient
is represented by the height (positive or
negative) of the vertical bar. The
lower right panel represents the wavelet
coefficients after being shrunken using
the waveshrink function in S-PLUS,
which implements the SureShrink
method
of wavelet adaptation of Donoho and
Johnstone.
A KTEC Center of Excellence
37
Adaptive Wavelet Filtering
•
Lattice of N points, y is response vector and W is NxN
orthonormal wavelet basis evaluated at the N uniformly spaced
observations. The following is called wavelet transform of y:
•
Popular method for adaptive wavelet fitting is known as SURE
shrinkage (Stein Unbiased Risk Estimation)
•
•
•
Same as previously seen Lasso criterion
Because W is orthogonal, simple solution:
Fitted function obtained from inverse wavelet transform:
A KTEC Center of Excellence
38
Wavelets: Key Idea
• In general, any basis could be used, such
as the smoothing splines we’ve seen
before.
• The key difference is that wavelets allow
localization in time as well as frequency
(roughness), and along with the L1 penalty
allow sparse solutions
• Smoothing splines compress by imposing
smoothness; Wavelets compress by
imposing sparsity
A KTEC Center of Excellence
39
Wavelet Compared to Smoothing Spline
A KTEC Center of Excellence
40
Wavelet Compared to Smoothing Spline
A KTEC Center of Excellence
41
The End
• Questions?
A KTEC Center of Excellence
42
Download