ICANN02 tutorial - Electrical and Computer Engineering

advertisement
SIGNAL and IMAGE DENOISING
USING VC-LEARNING THEORY
Vladimir Cherkassky
Dept. Electrical & Computer Eng.
University of Minnesota
Minneapolis MN 55455 USA
Email cherkass@ece.umn.edu
Tutorial at ICANN-2002
Madrid, Spain, August 27, 2002
2
OUTLINE
1. BACKGROUND
Predictive Learning
VC Learning Theory
VC-based model selection
2. SIGNAL DENOISING as FUNCTION ESTIMATION
Principal Issues for Signal Denoising
VC framework for Signal Denoising
Empirical comparisons:
synthetic signals
ECG denoising
3. IMPROVED VC-BASED SIGNAL DENOISING
Estimating VC-dimension implicitly
FMRI Signal Denoising
2D Image Denoising
3
4. SUMMARY and DISCUSSION
1. PREDICTIVE LEARNING and VC THEORY
• The problem of predictive learning:
GIVEN past data + reasonable assumptions
ESTIMATE unknown dependency for future predictions
• Driven by applications (NOT theory):
medicine
biology: genomics
financial engineering (i.e., program trading, hedging)
signal/ image processing
data mining (for marketing)
........
• Math disciplines:
function approximation
pattern recognition
statistics
optimization …….
4
5
MANY ASPECTS of PREDICTIVE LEARNING
• MATHEMATICAL / STATISTICAL
foundations of probability/statistics and function
approximation
• PHILOSOPHICAL
• BIOLOGICAL
• COMPUTATIONAL
• PRACTICAL APPLICATIONS
• Many competing methodologies
BUT lack of consensus (even on basic issues)
6
STANDARD FORMULATION for PREDICTIVE LEARNING
X
Generator
of X-values
Learning
f*(X)
machine
X
Z
System
y
• LEARNING as function estimation = choosing the ‘best’ function
from a given set of functions [Vapnik, 1998]
• Formal specs:
Generator of random X-samples from unknown probability distribution
System (or teacher) that provides output y for every X-value according to
(unknown) conditional distribution P(y/X)
Learning machine that implements a set of approximating functions
f(X, w) chosen a priori where w is a set of parameters
7
• The problem of learning/estimation:
Given finite number of samples (Xi, yi), choose from a given set of
functions f(X, w) the one that approximates best the true output
Loss function L(y, f(X,w)) a measure of discrepancy (error)
Expected Loss (Risk)
R(w) =  L(y, f(X, w)) dP(X, y)
The goal is to find the function f(X, wo) that minimizes R(w) when the
distribution P(X,y) is unknown.
Important special cases
(a) Classification (Pattern Recognition)
output y is categorical (class label)
approximating functions f(X,w) are indicator functions
For two-class problems common loss
L(y, f(X,w)) = 0
if y = f(X,w)
L(y, f(X,w)) = 1
if y  f(X,w)
(b) Function estimation (regression)
output y and f(X,w) are real-valued functions
Commonly used squared loss measure:
L(y, f(X,w)) = [y - f(X,w)]2
8
 relevant for signal processing applications (denoising)
WHY VC-THEORY?
• USEFUL THEORY (!?)
- developed in 1970’s
-
constructive methodology (SVM) in mid 90’s
-
wide acceptance of SVM in late 90’s
-
wide misunderstanding of VC-theory
• TRADITIONAL APPROACH to LEARNING
GIVEN past data + reasonable assumptions
ESTIMATE unknown dependency for future predictions
(a) Develop practical methodology (CART, MLP etc.)
(b) Justify it using theoretical framework (Bayesian,
ML, etc.)
(c) Report results; suggest heuristic improvements
9
(d) Publish paper (optional)
VC LEARNING THEORY
•
Statistical theory for finite-sample estimation
•
Focus on predictive non-parametric formulation
•
Empirical Risk Minimization approach
•
Methodology for model complexity control (SRM)
•
VC-theory unknown/misunderstood in statistics
References on VC-theory:
V. Vapnik, The Nature of Statistical Learning Theory, Springer 1995
10
V. Vapnik, Statistical Learning Theory, Wiley, 1998
V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory and
Methods, Wiley, 1998
11
CONCEPTUAL CONTRIBUTIONS of VC-THEORY
• Clear separation between
problem statement
solution approach (i.e. inductive principle)
constructive implementation (i.e. learning algorithm)
- all 3 are usually mixed up in application studies.
• Main principle for solving finite-sample problems
Do not solve a given problem by indirectly solving a
more general (harder) problem as an intermediate step
- usually not followed in statistics, neural networks and
applications.
Example: maximum likelihood methods
• Worst-case analysis for learning problems
Theoretical analysis of learning should be based on the
worst-case scenario (rather than average-case).
12
13
STATISTICAL vs VC-THEORY APPROACH
GENERIC ISSUES in Predictive Learning:
• Problem Formulation
Statistics: density estimation
VC-theory: application-dependent
Signal Denoising ~ real-valued function estimation
• Possible/ admissible models
Statistics: linear expansion of basis functions
VC-theory: structure
Signal Denoising ~ orthogonal bases
• Model selection (complexity control)
Statistics: resampling, analytical (AIC, BIC etc)
VC-theory: resampling, analytical (VC-bounds)
Signal Denoising ~ thresholding using statistical models
14
Structural Risk Minimization (SRM) [Vapnik, 1982]
• SRM Methodology (for model selection)
GIVEN training data and a set of possible models (aka approximating
functions)
(a) introduce nested structure on approximating functions
Note: elements Sk are ordered according to increasing complexity
So
 S1  S2  ...
Examples (of structures):
- basis function expansion (i.e., algebraic polynomials)
- penalized structure (penalized polynomial)
- feature selection (sparse polynomials)
(b) minimize Remp on each element Sk (k=1,2...) producing increasingly
complex solutions (models).
(c) Select optimal model (using VC-bounds on prediction risk), where
optimal model complexity ~ minimum of VC upper bound.
15
MODEL SELECTION for REGRESSION
• COMMON OPINION
VC-bounds are not useful for practical model selection
References:
C. Bishop, Neural Networks for Pattern Recognition.
B. D. Ripley, Pattern Recognition and Neural Networks
T. Hastie et al, The Elements of Statistical Learning
• SECOND OPINION
VC-bounds work well for practical model selections
when they can be rigorously applied.
References:
V. Cherkassky and F. Mulier, Learning from Data
V. Vapnik, Statistical Learning Theory
V. Cherkassky et al.(1999), Model selection for
regression using VC generalization bounds, IEEE Trans.
Neural Networks 10, 5, 1075-1089
• EXPLANATION (of contradiction)
NEED: understanding of VC theory + common sense
16
ANALYTICAL MODEL SELECTION for regression
• Standard Regression Formulation
y  g (x)  noise
• STATISTICAL CRITERIA
- Akaike Information Criterion (AIC)
2d ^ 2
AIC (d )  Remp (d ) 

n
where d = (effective) DoF
- Bayesian Information Criterion (BIC)
d ^2
BIC (d )  Remp (d )  (ln n) 
n
AIC and BIC require noise estimation (from data):
^ 2
^
n
1 n
 
  ( yi  yi ) 2
n  d n i 1
17
• VC-bound on prediction risk
General form (for regression)

  an  

h  ln    1  ln 

  h  
R (h)  Remp (h)1  c
n



1







Practical form
1

h h h ln n 

R(h)  Remp (h)1 
 ln 

n
n
n
2
n


where h is VC-dimension (h = effective DoF in practice)
NOTE: the goal is NOT accurate estimation of RISK
• Common sense application of VC-bounds requires
- minimization of empirical risk
- accurate estimation of VC-dimension
 first, model selection for linear estimators;
second, comparisons for nonlinear estimators.
18
EMPIRICAL COMPARISON
• COMPARISON METHODOLOGY
- specify an estimator for regression (i.e.,
polynomials, k-nn, subset selection…)
- generate noisy training data
- select optimal model complexity for this data
- record prediction error as
MSE (model, true target function)
REPEAT model selection experiments many times for
different random realizations of training data
DISPLAY empirical distrib. of prediction error (MSE) using
standard box plot notation
NOTE: this methodology works only with synthetic data
19
• COMPARISONS for linear methods/ low-dimensional
Univariate target functions
(a) Sine squared function sin 2  x 
2
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
(b) Piecewise polynomial
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
20
• COMPARISONS for sine-squared target function,
polynomial regression
0.2
Risk
(MSE)
0.1
0
AIC
BIC
SRM
BIC
SRM
15
Dof
FFf
FF
10
5
AIC
(a) small size n=30,  =0.2
0.04
Risk
(MSE)
0.02
0
AIC
BIC
SRM
AIC
BIC
SRM
14
DoF
8
(b) large size n=100,  =0.2
21
• COMPARISONS for piecewise polynomial function,
Fourier basis regression
(a) n=30,  =0.2
0.2
Risk
0.1
(MSE)
0
AIC
BIC
SRM
BIC
SRM
15
DoF
5
AIC
(b) n=30,  =0.4
0.4
Risk
(MSE)
))
0.2
0
AIC
BIC
SRM
AIC
BIC
SRM
12
Dof
F
8
4
22
2. SIGNAL DENOISING as FUNCTION ESTIMATION
6
5
4
3
2
1
0
-1
-2
0
100
200
300
400
500
600
700
800
900
1000
10
20
30
40
50
60
70
80
90
100
6
5
4
3
2
1
0
-1
-2
-3
0
23
SIGNAL DENOISING PROBLEM STATEMENT
• REGRESSION FORMULATION
Real-valued function estimation (with squared loss)
Signal Representation:
y   wi g i ( x)
i
linear combination of orthogonal basis functions
• DIFFERENCES (from standard formulation)
- fixed sampling rate (no randomness in X-space)
- training data X-values = test data X-values
• SPECIFICS of this formulation
 computationally efficient orthogonal estimators, i.e.
- Discrete Fourier Transform (DFT)
- Discrete Wavelet Transform (DWT)
24
ISSUES FOR SIGNAL DENOISING
• MAIN FACTORS for SIGNAL DENOISING
- Representation (choice of basis functions)
- Ordering (of basis functions)
- Thresholding (model selection, complexity control)
Ref: V. Cherkassky and X. Shao (2001), Signal estimation and
denoising using VC-theory, Neural Networks, 14, 37-52
•
For finite-sample settings:
Thresholding and ordering are most important
•
For large-sample settings:
representation is most important
•
CONCEPTUAL SIGNIFICANCE
- wavelet thresholding = sparse feature selection
- nonlinear estimator suitable for ERM
25
VC FRAMEWORK for SIGNAL DENOISING
• ORDERING of (wavelet ) coefficients =
= STRUCTURE on orthogonal basis functions in
f ( x, w)   wi gi ( x)
i
Traditional ordering
w 0  w1    w n 1
Better ordering
| w 0 | | w1 |
| w n 1 |
 1    n 1
0
f
f
f
Issues: other (good) orderings, i.e. for 2D signals
• VC THRESHOLDING (MODEL SELECTION)
optimal number of basis functions ~ minimum of VC-bound
1

h h h ln n 


R(h)  Remp (h)1 
 ln 
n n n 2n  

where h is VC-dimension (h = effective DoF in practice)
usually h = m (the number of chosen wavelets or DoF)
Issues: estimating VC-dimension as a function of DoF
26
WAVELET REPRESENTATION
• Wavelet basis functions : symmlet wavelet
0.1
0.05
0
-0.05
-0.1
0
0.2
0.4
0.6
0.8
1
In decomposition f ( x, w)   wi gi ( x)
i
1
basis functions are g s ,c ( x) 
s
 xc

 s 

where s is a scale parameter and c is a translation parameter.
 x  ci 
  w0
f m x, w    wi 
i 1
 si 
m1
Commonly use wavelets with fixed scale and translation parameters:
sj = 2-j, where j = 0, 1, 2, …, J-1
ck(j) = k2j, where k = 0, 1, 2, …, 2j-1
So the basis function representation has the form:

f x, w    w jk 2 j x  k
j
k

27
DISCRETE WAVELET TRANSFORM
 Given n = 2J samples (xi, yi) on [0,1] interval
 Calculate n wavelet coefficients in
n
y   wi i
i 1
 Specify ordering of coefficients {wi}, so that an estimate
m
yˆ m ( x)   w i i ,
i 1
specifies an element of VC structure.
Empirical risk (fitting error) for this estimate:
|| y ( x)  yˆ m ( x) || 2
 Determine optimal element
m̂opt
 Obtain denoised signal from
DWT.
(model complexity)
m̂opt
coefficients via inverse
28
Example: Hard and Soft Wavelet Thresholding
Wavelet Thresholding Procedure:
1) Given noisy signal obtain its wavelet coefficients wi.
2) Order coefficients according to magnitude
w 0  w1    w n 1
3) Perform thresholding on coefficients wi to obtain
4) Obtain denoised signal from
ŵi
using inverse transform
Details of Thresholding:
 For Hard Thresholding:
wi
ˆ
wi  
0
if wi  t
if wi  t
 For Soft Thresholding:
wˆ i  sgn( wi )(| wi | t )
Threshold value t is taken as:
t  2 ln n
ŵi .
(a.k.a. VISU method)
where  is estimated noise variance.
Another method for selecting threshold is SURE.
29
EMPIRICAL COMPARISONS for denoising
• SYNTHETIC SIGNALS
true (target) signal is known
easy to make comparisons between methods
6
Blocks
4
2
y
0
-2
-4
-6
Heavisine
0
0.2
0.4
0.6
0.8
1
t
• REAL – LIFE APPLICATIONS
true signal is unknown
methods’ comparison becomes tricky
30
COMPARISONS for Blocks and Heavisine signals
• Experimental setup
Data set: 128 noisy samples with SNR = 2.5
Prediction accuracy: NRMS error (truth, estimate)
Estimation methods:
VISU(soft thresh), SURE(hard thresh) and SRM
Note: SRM = VC Thresholding = Vapnik's Method (VM)
All methods use symmlet wavelets.
0.1
0.05
0
-0.05
-0.1
0
0.2
Symmlet 'mother' wavelet
0.4
0.6
0.8
1
31
• Comparison results
0.8
NRMS for Blocks Function
100
DOF for Blocks Function
0.6
50
0.4
0.2
0
visu
sure
vm
visu
NRMS for Heavisine Function
80
0.6
60
0.4
40
0.2
20
0
visu
sure
(c)
vm
(b)
(a)
0.8
sure
vm
0
DOF for Heavisine Function
visu
sure
(d)
vm
32
VISU method
33
SURE method
34
VC-based denoising
35
ELECTRO-CARDIOGRAM (ECG) DENOISING
Ref: V. Cherkassky and S. Kilts (2001), Myopotential denoising of ECG
signals using wavelet thresholding methods, Neural Networks, 14, 11291137
Observed ECG = ECG + myopotential_noise
36
CLEAN PORTION of ECG:
37
NOISY PORTION of ECG
38
RESULT of VC-BASED WAVELET DENOISING
Sample size 1024, chosen DOF = 76, MSE = 1.78e5
Reduced noise, better identification of P, R, and T waves
vs SOFT THRESHOLDING (best wavelet method)
DOF = 325, MSE = 1.79e5
NOTE: MSE values refer to the noisy section only
39
4. IMPROVED VC-BASED SIGNAL DENOISING
Need correct VC-dim when using VC-bounds
 VC-dimension for nonlinear estimators
 VC-dimension = DoF only hold true for linear estimator
 Wavelet thresholding = nonlinear estimator
 Estimate ‘correct’ VC-dimension as:
h
1

m
 How to find good δ value ?
Standard VC-denoising: VC-dimension = DoF (h = m)
Improved VC-denoising: using ‘good’ δ
estimating VC-dimension in VC-bounds
value for
40
EMPIRICAL APPROACH for ESTIMATING VC-DIM.
 An IDEA: use known form of VC-bound to estimate
optimal dependency δ opt (m,n) for several synthetic
signals.
 HYPOTHESIS: if VC-bounds are good, then using
estimated dependency δ opt (m,n) will provide good
denoising accuracy for all other signals as well.
 IMPLEMENTATION: For a given known noisy signal
- using specific δ in VC-bound yields DoF value m*(δ);
- we can find opt. DoF value mopt (since the target is known)
- then opt. δ can be found from equality mopt ~ m*(δ)
- the problem is dependence on random noisy signal.
 STABLE DEPENDENCY between δopt and mopt
 stable relation exists independent of noisy signal
 opt   n (mopt )
 If mopt/n is less than 0.2, the relation is linear:
41
 opt   n (mopt )  a0 (n)  a1 (n)
mopt
n
,
if
mopt
n
 0.2
42
EMPIRICAL RESULTS
 Estimation of the relation between δopt and mopt.
Target Signals Used
0.4
3
0.3
2
0.2
1
0.1
0
f(x)
4
f(x)
0.5
0
-1
-0.1
-2
-0.2
-3
-0.3
-4
-0.4
-5
-0.5
0
0.2
0.4
0.6
0.8
-6
0
1
0.2
0.4
0.6
0.8
1
0.8
1
x
x
(b) heavisine
(a) doppler
2
6
1.5
5
1
0.5
f(x)
f(x)
4
3
0
-0.5
2
-1
1
0
0
-1.5
0.2
0.4
0.6
0.8
1
-2
0
0.2
x
(c) spires
0.4
0.6
x
(d) dbsin
 Samples Size: 512 pts and 2048 pts
 SNR Range: 2-20dB (for 512pts), 2-40dB (for 2048 pts)
43
 Linear Approximation
1
Empirical data
Linear approximation
0.95
0.9
 opt
0.85
0.8
0.75
0.7
0.65
0.6
0.04
0.06
0.08
0.1
m
0.12
/n
0.14
0.16
0.18
0.2
opt
(a) n = 512
1
Empirical data
Linear approximation
0.95
0.9
0.85
 opt
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.05
0.1
m /n
opt
(b) n = 2048
0.15
0.2
44
45
VC-dimension Estimation
VC-dimension as a function of DoF(m) for ordering:
| w 0 | | w1 |
| w n 1 |
 1    n 1
0
f
f
f
for n = 2048 samples
1400
1200
1000
h
800
600
400
200
0
0
100
200
300
400
500
m
n=2048pts
NOTE: when m/n is small (< 5%) then h  m.
600
46
PROCEDURE for ESTIMATING δopt
 For a given noisy signal, find an intersection between
- stable dependency and
- signal-dependent curve
0.5
Estimated dependency
Signal-dependent Curve
0.4
m/n
0.3
0.2
0.1
0
0.6
0.7
0.8

0.9
1
47
COMPARISON RESULTS
Comparisons between IVCD, SVCD, HardTh, SoftTh.
NOTE: for HardTh, SoftTh the value of threshold is taken as:
t  2 ln n
Target function used:
6
8
7
5
6
5
4
f(x)
f(x)
4
3
3
2
2
1
0
1
-1
0
0
0.2
0.4
0.6
0.8
-2
0
1
x
0.2
0.4
0.6
0.8
1
0.8
1
x
(a) spires
(b) blocks
1
3
0.8
2
0.6
0.4
1
f(x)
f(x)
0.2
0
0
-0.2
-1
-0.4
-0.6
-2
-0.8
-1
0
0.2
0.4
0.6
x
(c) winsin
0.8
1
-3
0
0.2
0.4
0.6
x
(d) mishmash
48
Sample Size = 512pts, High Noise (SNR = 3dB)
prediction risk
prediction risk
0.5
0.5
0.45
0.45
0.4
0.4
0.35
risk
risk
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
IVCD(0.78)
SVCD
HardTh
SoftTh
IVCD(0.78)
(a) spires
SVCD
HardTh
SoftTh
(b) blocks
prediction risk
prediction risk
0.5
1
0.45
0.4
0.9
0.8
0.3
risk
risk
0.35
0.25
0.7
0.2
0.6
0.15
0.1
0.5
0.05
IVCD(0.81)
SVCD
HardTh
(c) winsin
SoftTh
IVCD(0.5)
SVCD
HardTh
(d) mishmash
SoftTh
49
 Sample Size = 512pts, Low Noise (SNR = 20dB)
prediction risk
0.04
0.035
0.035
0.03
0.03
0.025
risk
0.025
0.02
0.02
0.015
0.015
0.01
0.01
0.005
0.005
IVCD(0.63)
SVCD
HardTh
SoftTh
IVCD(0.64)
(a) spires
16
x 10
-3
SVCD
HardTh
SoftTh
(b) blocks
prediction risk
prediction risk
1
0.9
14
0.8
12
0.7
10
0.6
risk
risk
risk
prediction risk
8
0.5
0.4
6
0.3
4
0.2
2
0.1
IVCD(0.74)
SVCD
HardTh
(c) winsin
SoftTh
IVCD(0.5)
SVCD
HardTh
(d) mishmash
SoftTh
50
 Sample Size = 2048pts, High Noise (SNR = 3dB)
prediction risk
prediction risk
0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
risk
risk
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
IVCD(0.87)
SVCD
HardTh
SoftTh
IVCD(0.86)
(a) spires
HardTh
SoftTh
(b) blocks
prediction risk
prediction risk
0.5
1
0.45
0.95
0.4
0.9
0.35
0.85
0.3
0.8
0.25
0.75
risk
risk
SVCD
0.7
0.2
0.65
0.15
0.6
0.1
0.55
0.05
0.5
0
IVCD(0.88)
SVCD
HardTh
(c) winsin
SoftTh
IVCD(0.5)
SVCD
HardTh
(d) mishmash
SoftTh
51
 Sample Size = 2048pts, Low Noise (SNR = 20dB)
x 10
-3
prediction risk
12
x 10
-3
prediction risk
11
10
10
9
8
8
risk
risk
7
6
6
5
4
4
3
2
2
IVCD(0.79)
SVCD
HardTh
SoftTh
IVCD(0.79)
(a) spires
-3
prediction risk
1
9
0.9
8
0.8
7
0.7
6
0.6
5
0.5
4
0.4
3
0.3
2
0.2
1
0.1
SVCD
HardTh
(c) winsin
SoftTh
prediction risk
10
IVCD(0.85)
HardTh
(b) blocks
risk
risk
x 10
SVCD
SoftTh
IVCD(0.5)
SVCD
HardTh
(d) mishmash
SoftTh
52
FMRI SIGNAL DENOISING
 MOTIVATION: understanding brain function
 APPROACH: functional MRI
 DATA SET: provided by CMRR at University of Minnesota
- single-trial experiments
- signals (waveforms) recorded at a certain brain location
show response to external stimulus applied 32 times
- Data Set 1 ~ signals at the visual cortex in response to
visual input light blinking 32 times
- Data Set 2 ~ signals at the motor cortex recorded when the
subject moves his finger 32 times
FMRI denoising = obtaining a better version of noisy signal
53
54
 Denoising comparisons:
Visual Cortex Data
1
0.5
0
-0.5
org sig
-1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1
0.5
0
-0.5
SoftTh
-1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1
0.5
0
-0.5
IVCD
-1
0
200
400
600
800
1000
1200
1400
1600
1800
SoftTh
IVCD(0.8)
DoF
77
101
MSE
0.0355
0.0247
MSE
Signal Power
0.5640
0.3371
2000
55
Motor Cortex Data
1
0.5
0
-0.5
org sig
-1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1
0.5
0
-0.5
SoftTh
-1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1
0.5
0
-0.5
IVCD
-1
0
200
400
600
800
1000
1200
1400
1600
1800
SoftTh
IVCD(0.9)
DoF
104
148
MSE
0.0396
0.0235
MSE
Signal Power
0.5140
0.3053
2000
56
IMAGE DENOISING
• Same Denoising Procedure:
- Perform DWT for a given noisy image
- Order (wavelet) coefficients
- Perform thresholding (using VC-bound)
- Obtain denoised image
• Main issue: what is a good ordering (structure) ?
- Ordering 1 ('Global' Thresholding)
Coefficients are ordered by their magnitude:
w 0  w1    w n 1
- Ordering 2 (same as for 1D signals in VC-denoising)
Coefficients are ordered by magnitude penalized by scale S:
| w0 | | w1 |
| wn 1 |
 1    n 1
S0
S
S
- Ordering 3 ('level-dependent' thresholding):
| w0 |
S
-
0

| w1 |
S
1

| wn 1 |
S n 1
Ordering 4 (tree-based): see references
J. M. Shapiro, Embedded Image Coding using Zerotrees of Wavelet coefficients,
IEEE Trans. Signal Processing, vol. 41, pp. 3445-3462, Dec. 1993
57
Zhong & Cherkassky, Image denoising using wavelet thresholding and model
selection, Proc. ICIP, Vancouver BC, 2000
Example of Artificial Image
Image Size: 256256, SNR(Noisy Image) = 3dB
Original Image
Noisy Image
Denoised by HardTh
(DoF = 241, MSE = 0.01858)
Denoised by VC method
(DoF = 104, MSE = 0.02058)
58
VC-based Model Selection
1.4
empirical risk
VC bound
prediction risk
1.2
Risk(MSE)
1
0.8
0.6
0.4
0.2
0
0
200
400
600
800
1000
59
4. SUMMARY and DISCUSSION
• Application of VC-theory to signal denoising
• Using VC-bounds for nonlinear estimators
• Using VC-bounds for sparse feature selection
• Extensions and future work:
- image denoising;
- other orthogonal estimators;
- SVM
• Importance of VC methodology: structure, VC-dimension
• QUESTIONS
ACKNOWLEDGENT:
This work was supported in part by NSF grant ECS-0099906.
Many thanks to Jun Tang from U of M for assistance in
preparation of tutorial slides.
60
Download