lecture set 6 - Electrical and Computer Engineering

advertisement
Introduction to Predictive
Learning
LECTURE SET 5
Statistical Methods
Electrical and Computer Engineering
1
OUTLINE
•
Objectives
- introduce statistical terminology/methodology/motivation
- taxonomy of methods
- describe several representative statistical methods
- interpretation of statistical methods under predictive learning
•
•
•
•
•
•
•
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
Decision Trees
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Summary and discussion
2
Methodology and Motivation
•
Original motivation:
- understand how the inputs affect the output
 simple model involving a few variables
•
•
•
Regression modeling:
Response = model + error
y = f(x) + noise, where f(x) = E(y/x)
Linear regression: f(x) = wx +b
Model parameters estimates via
1 n
2




MSE w, b   yi  ( wxi  b)  m in
n i 1
3
•
OLS Linear Regression
 ( x  x )( y  y )
OLS solution:
wˆ 
, bˆ  y  wˆ x
 (x  x)
i
i
i
2
i
i
- first, center x and y-values
- then calculate the slope and bias
200
180
SBP  Ey x  0.44Age105.7
160
The meaning of bias term?
Systolic Blood Pressure
Example: SBP vs. Age
140
120
100
80
40
45
50
55
60
Age in Years
65
70
75
80
4
Statistical Assumptions
•
•
•
Gaussian noise: zero mean, const variance
Known (linear) dependency
i.i.d. data samples (ensured by the protocol for
data collection) – may not hold for observational data
200
Do these assumptions hold for:
Systolic Blood Pressure
180
160
140
120
100
80
40
45
50
55
60
65
70
75
80
Age in Years
5
Multivariate Linear Regression
•
•
Parameterization f x,  w1 x1  w2 x2  ...  wd xd  b  (w  x)  b
Matrix form (for centered variables)
 x11
x
X   21
 .

 xd1
x12
x 22
.
xd 2
... x1n 
... x 2 n 
... . 

... x dn 
Xw  y
1
2
Remp w   Xw  y  min
n
•
ERM solution:
•
ˆ  X X
Analytic solution (when d < n): w

T

1
XT y
6
Linear Ridge Regression
•
When d > n, penalize large parameter values
Rridge w   Xw  y
2
 w
2
•
Regularization parameter estimated via resampling
•
Example: y  t (x)   t (x)  3x1  x2  2x3  0  x4  0  x5
•
- 10 training samples uniformly samples in [0,1] range
- additive gaussian noise with st. deviation 0.5
Apply standard linear least squares:
•
yˆ  3.3422x1  1.4668x2  2.3999x3  0.3133x4  0.0346x5  0.0675
Apply ridge regression using optimal log( )  3
ˆ  2.9847x1  1.0338x2  2.0161x3  0.0889x4  0.3891x5  0.018
y
7
Example cont’d
•
•
t (x)  3x1  x2  2x3  0  x4  0  x5
Target function
Coefficient shrinkage: how w’s depend on lambda?
Can it be used for feature selection?
8
Statistical Methodology for classification
•
•
For classification: output y ~ (binary) class label (0 or 1)
Probabilistic modeling starts with known distributions
P( y  1 / x), P( y  0 / x), P( y  0), P( y  1)
•
Bayes-optimal decision rule for known distributions:
P y  1 / x  P( y  0)


1 if
Dx   
P y  0 / x  P( y  1)
0 otherwise
•
Statistical approach ~ ERM
- parametric form of class distributions is known/assumed
 analytic form of D(x) is known, and its parameters are
estimated from available training data x i , yi , i  1,2,...,n
•
Issues: loss function (used for statistical modeling)?
9
Gaussian class distributions
10
Logistic Regression
•
•
Terminology: may be confusing (for non-statisticians)
Gaussian class distributions (with equal covariances)
 P y  1 x  

ln
is a linear function in x



1

P
y

1
x


•
Logistic regression estimates probabilistic model:
 P y  1 x  
  w  x   b
logitP y  1 x   ln

 1  P y  1 x  
•
Equivalently, logistic regression estimates
P y  1 x   sw  x   b  
expb  (w  x) 
1  expb  (w  x) 
where sigmoid function is
1
st  
1  exp t 
11
Logistic Regression
•
Example: interpretation of logistic regression model for
the probability of death from a heart disease during 10-year
period, for middle-aged patients, as a function of
- Age (years, less 50) ~x1
- Gender male/female (0/1) ~x2
- cholesterol level, in mmol/L (less 5) ~ x3
1
P y  1 x  
1  exp(t )
•
•
where
t  5  2x1  x2  1.2x3
The probability of binary outcome ~ the risk (of death)*
Logistic Regression Model interpretation:
- increasing Age is associated with increased risk of death
- females have lower risk of death (than males)
- increasing Cholesterol level  increased risk of death
12
Estimating Logistic Regression
•
•
Given: training data xi , yi , i  1,2,...,n
How to estimate model parameters (w,b) ?
Pˆ  y  1 x  f (x, w, b)
•
Pˆ  y  0 x  1  f (x, w, b)
Maximum Likelihood ~ minimize negative log-likelihood:
1 n
Remp (w, b)    yi ln f (xi , w, b)  (1  yi ) ln(1  f (xi , w, b))
n i 1
where Pˆ  y  1 x   f x, w, b   expb  (w  x) 
1  expb  (w  x) 
•
 non-linear optimization
Solution w*, b*  estimated model: Pˆ  y  1 x  f (x, w  , b  )
- which can be used for prediction and interpretation
(for prediction, the model should be combined with costs)
13
Statistical Modeling Strategy
•
Data-analytic models are used for:
understanding the importance of inputs in
explaining the output
•
ERM approach:
- a statistician selects (manually) a few ‘good’
variables and several models are estimated
- the final model selected manually
~ heuristic implementation of Occam’s razor
•
Linear regression and logistic regression
- both estimate E(y/x), since for classification:
Ey x  0  Py  0 x  1 Py  1 x  Py  1 x
14
Classification via multiple-response regression
How to use nonlinear regression s/w for classification?
- classification methods estimate model parameters via
minimization of squared-error  can use regression s/w
with minor modifications:
(1) for J class labels, use 1-of-J encoding, i.e. J=4 classes:
~ 1000 0100 0010 0001 (4 outputs in regression).
(2) estimate 4 regression models from the training data
(usually all regression models use the same parameterization)
x1
.
.
.
xd
.
Estimation of multiple
response regression
y1
.
.
yJ
15
Classification via Regression
•
Training ~ regression estimation using 1-of-J encoding
x1
.
.
.
.
Estimation of multiple
response regression
.
.
yJ
xd
•
y1
Prediction (classification) ~ based on the max
response value of estimated outputs
x1
ˆy1
.
.
.
xd
.
Multiple response
discriminant functions
.
.
MAX
ˆy
ˆyJ
16
OUTLINE
•
•
•
•
•
•
•
•
Objectives
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
- model parameterization (representation)
- nonlinear optimization strategies
Decision Trees
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Summary and discussion
17
Taxonomy of Nonlinear Methods
•
•
•
•
Main idea:
improve flexibility of classical linear methods
~ use flexible (nonlinear) parameterization
Dictionary parameterization
m
f m x, w,V   wi gx,vi 
~ SRM structure
i0
Two interrelated issues:
- parameterization (of nonlinear basis functions)
- optimization method used
These two factors define methods taxonomy
18
Taxonomy of nonlinear methods
•
Decision tree methods:
- piecewise-constant model
- greedy optimization
• Additive methods:
- backfitting method for model estimation
• Gradient-descent methods:
- popular in neural network learning
• Penalization methods
Note: all methods implement SRM structures
19
• Dictionary representation
Two possibilities
•
m
f m x, w,V   wi gx,vi 
i0
Linear (non-adaptive) methods
~ predetermined (fixed) basis functions g i x 
 only parameters w i have to be estimated
via standard optimization methods (linear least squares)
Examples: linear regression, polynomial regression
linear classifiers, quadratic classifiers
• Nonlinear (adaptive) methods
~ basis functions gx,vi  depend on the training data
Possibilities : nonlinear b.f. (in parameters v i )
feature selection (i.e. wavelet denoising)
20
Example of Nonlinear Parameterization
• Basis functions of the form gi (t )  g (xvi  bi )
i.e. sigmoid aka logistic function
st  
1
1  exp t 
- commonly used in artificial neural networks
- combination of sigmoids ~ universal approximator
21
Example of Nonlinear Parameterization
•
Basis functions of the form gi (t )  g x  v i
i.e. Radial Basis Function(RBF)

 (t  c) 2 

g t   exp 
2
2 


g t   t  b
2

2 2
gt  t
- RBF adaptive parameters: center, width
- commonly used in artificial neural networks
- combination of RBF’s ~ universal approximator
22
Neural Network Representation
•
MLP or RBF networks
m
ˆy   w j z j
m
j 1
f m x, w,V   wi gx,vi 
i0
W is m  1
z1
1
z2
2
zm
m
zj  gx,v j 
V is d  m
x1
x2
xd
- dimensionality reduction
- universal approximation property – see example at
http://www.mathworks.com/products/demos/nnettlbx/radial/index.html
23
Example of Nonlinear Parameterization
• Adaptive Partitioning (CART) f x   w I x R 
m
j 1
j
j
each b.f. is a rectangular region in x-space
d
Ix R j   Iajl  xl  bjl 
l 1
•
•
Each b.f. depends on 2d parameters a j ,b j
Since the regions R j are disjoint, parameters w
can be easily estimated (for regression) as
1
wj 
yi

n j x i R j
•
Estimating b.f.’s ~ adaptive partitioning
24
Example of CART Partitioning
•
CART Partitioning in 2D space
- each region ~ basis function
- piecewise-constant estimate of y (in each region)
- number of regions ~ model complexity
x2
s4
R5
R4
R1
s2
s3
R3
R2
s1
x1
25
OUTLINE
•
•
•
•
Objectives
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
Decision Trees
- Regression trees (CART)
- Boston Housing example
- Classification trees (CART)
•
•
•
•
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Summary and discussion
26
Greedy Optimization Strategy
•
Minimization of empirical risk for regression problems
1 n
1 n
2
Remp V, W   Lx i , yi , V, W    yi  f x i , V, W 
n i 1
n i 1
where the model f x, V, W   w j g j x, v j 
m
j 1
•
Greedy Optimization Strategy
basis functions are estimated sequentially, one at a time,
i.e., the training data is represented as
structure (model fit) + noise (residual):
(1) DATA = (model) FIT 1 + RESIDUAL 1
(2) RESIDUAL 1 = FIT 2 + RESIDUAL 2
and so on. The final model for the data will be
MODEL = FIT 1 + FIT 2 + ....
•
Advantages: computational speed, interpretability
27
Regression Trees (CART)
•
Minimization of empirical risk (squared error)
via partitioning of the input space into regions
m
1
where
wj 
f  x   w j I x R j 
 yi
nj
j 1
•
Example of CART partitioning for a function of 2 inputs
x2
1
s4
2
R5
R4
R1
s2
s3
x i R j
3
R3
split 1 x1 ,s1 
R1
x2 ,s2 
x2 ,s3 
4
x1 ,s4 
R2
s1
x1
R2
R3
R4
R5
28
Growing CART tree
•
•
•
•
•
•
Recursive partitioning for estimating regions
(via binary splitting)
Initial Model ~ Region R 0
(the whole input domain)
is divided into two regions R 1 and R 2
A split is defined by one of the inputs(k) and split point s
Optimal values of (k, s) chosen so that splitting a region
into two daughter regions minimizes empirical risk
Issues:
- efficient implementation (selection of opt. split point)
- optimal tree size ~ model selection(complexity control)
Advantages and limitations
29
Valid Split Points for CART
•
How to choose valid points (for binary splitting)?
valid points ~ combinations of the coordinate values of
training samples, i.e. for 4 bivariate samples  16 points
used as candidates for splitting:
30
CART Modeling Strategy
•
Growing CART tree ~ reducing MSE (for regression)
Splitting the parent region is allowed only if # of samples
exceeds certain threshold (~, Splitmin, user-defined).
•
Tree pruning ~ reducing tree size by selectively
combining adjacent leaf nodes (regions). This pruning
implements minimization of the penalized MSE:
Rp en  Rem p   T
where Rem p~ MSE
T ~ number of leaf nodes (regions)
and parameter  is estimated via resampling
31
Example: Boston Housing data set
•
Objective: to predict the value of homes in Boston area
•
Data set ~ 506 samples total
Output: value of owner-occupied homes (in $1,000’s)
Inputs: 13 variables
1. CRIM
per capita crime rate by town
2. ZN
proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX
nitric oxides concentration (parts per 10 million)
6. RM
average number of rooms per dwelling
7. AGE
proportion of owner-occupied units built prior to 1940
8. DIS
weighted distances to five Boston employment centres
9. RAD
index of accessibility to radial highways
10. TAX
full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
32
Example CART trees for Boston Housing
1.Training set: 450 samples Splitmin =100 (user-defined)
R0
R1
R2
33
Example CART trees for Boston Housing
2.Training set: 450 samples Splitmin =50 (user-defined)
R0
R1
R2
34
Example CART trees for Boston Housing
3.Training set: 455 samples Splitmin =100 (user-defined)
Note: CART model is sensitive to training samples (vs model 1)
35
Classification Trees (CART)
•
•
Binary classification example (2D input space)
Algorithm similar to regression trees (tree growth via
binary splitting + model selection), BUT using different
empirical loss function
x1< - 0.409
Split
1
x2< - 0.067
2
x1< - 0.148
3
+
+
36
Loss functions for Classification Trees
•
•
•
Misclassification loss: poor practical choice
Other loss (cost) functions for splitting nodes:
For J-class problem, a cost function is a
measure of node impurity Qt  Qp1t, p2t,..., pJ t
where p(i/t) denotes the probability of class i
samples at node t.
Possible cost functions
pj t 
Misclassification Qt 1  max
j
2
Q

t


p

i
t

p

j
t


1

p

j
t

 
j 

Gini function
i j
j
Entropy function Qt     p j t ln pj t 
j
37
Classification Trees: node splitting
•
Minimizing cost function =
maximizing the decrease in node impurity.
Assume node t is split into two regions (Left & Right)
on variable k at a split point s.
Then the decrease is impurity caused by this split is:
Qv,k,t   Qt   Qt L  pL t   Qt R pR t 
where pL t   pt L  pt 
pR t   ptR  pt 
•
Misclassification cost ~ discontinuous (due to max)
- may give sub-optimal solutions (poor local min)
- does not work well with greedy optimization
38
Using different cost fcts for node splitting
(a) Decrease in impurity:
misclassification = 0.25
gini = 0.13
entropy = 0.13
(b) Decrease in impurity:
misclassification = 0.25
gini = 0.17
entropy = 0.22
Split (b) is better as it leads
to a smaller final tree
39
Details of calculating decrease in impurity
Consider split (a)
• Misclassification Cost
Qt   1  0.5  0.5 pL t   4 8  0.5 pr t   0.5
QL t   1  3 / 4  0.25
QR t   1  3 / 4  0.25
Qt   0.5  0.5 * 0.25  0.5 * 0.25  0.25
•
Gini Cost
Qt   1  0.52  0.52  0.5
pL t   0.5 pr t   0.5
QL t   1  (3 / 4)2  (1/ 4)2  3 / 8
QR t   3 / 8
Qt   0.5  0.5 * (3 / 8)  0.5 * (3 / 8)  1 / 8
40
IRIS Data Set: A data set with 150 random samples of flowers from the
iris species setosa, versicolor, and virginica (3 classes). From each
species there are 50 observations for sepal length, sepal width, petal
length, and petal width in cm. This dataset is from classical statistics
MATLAB code (splitmin =10)
load fisheriris;
t = treefit(meas, species);
treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'});
41
Sensitivity to random training data:
Consider IRIS data set where every other sample is used (total 75
samples, 25 per class). Then the CART tree formed using the same
Matlab software (splitmin = 10, Gini loss fct)) is
42
Decision Trees: summary
•
Advantages
- speed
- interpretability
- different types of input variables
•
Limitations: sensitivity to
- correlated inputs
- affine transformations (of input variables)
- general instability of trees
•
Variations: ID3 (in machine learning), linear CART
43
OUTLINE
•
•
•
•
•
•
•
•
Objectives
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
Decision Trees
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Summary and discussion
44
Additive Modeling
•
Additive model parameterization for regression
•
where g j x j  is unknown (smooth) function.
Each univariate component estimated separately
Additive model for classification
•
Ey x  f ( x)  b  g1 ( x1 )  g 2 ( x2 )  ...  g d ( xd )
 P y  1 x  
  b  g1 ( x1 )  g 2 ( x 2 )  ...  g d ( x d )
logitP y  1 x   ln



1

P
y

1
x


Backfitting is a greedy optimization approach for
estimating basis functions sequentially
45
•
By fixing all basis functions j  k the empirical
risk (MSE) can be decomposed as
1 n
2
Remp V     yi  f x i , V 
n i 1


1  
   y i   g j x i , v j   w0   g k x i , v k 

n i 1  
j k


1 n
2
  ri  g k x i , v k 
n i 1
n
2
 Each basis function g k x,v k  is estimated via an
iterative backfitting algorithm (until some
stopping criterion is met)
Note: ri can be interpreted as the response variable
for the adaptive method g k x,v k 
46
Backfitting Algorithm: Example
•
Consider regression estimation of a function of
two variables of the form y  g1 x1   g2 x2   noise
from training data ( x1i , x2i , yi ) i  1,2,...,n
2
2
For example t ( x1 , x2 )  x1  sin(2x2 ) x  0,1
Backfitting method:
(1) estimate g1 x1  for fixed g 2
(2) estimate g 2 x2  for fixed g1
•
iterate above two steps
Estimation via minimization
of empirical risk
n
Remp g1 ( x1 ), g 2 ( x2 )  
1
2


y

g
(
x
)

g
(
x
)
 i 1 1i 2 2i
n i 1
1 n
2
( first _ iteration)    yi  g 2 ( x2i )  g1 ( x1i ) 
n i 1
1 n
2
  ri  g1 ( x1i ) 
n i 1
47
Backfitting Algorithm(cont’d)
•
Estimation of g1 ( x1 ) via minimization of MSE
1 n
2
Remp g1 ( x1 )    ri  g1 ( x1i )   m in
n i 1
• This is a univariate regression problem of
estimating g1 x1  from n data points ( x1i , ri )
where ri  yi  g 2 ( x2i )
• Can be estimated by smoothing (kNN regression)
• Estimation of g 2 x2  (second iteration) proceeds
in a similar manner, via minimization of
1 n
2
Remp g 2 ( x2 )    ri  g 2 ( x2i )  where ri  yi  g1 ( x1i )
n i 1
48
•
Projection Pursuit regression
Projection Pursuit is an additive model:
f x,V, W   g w  x,v   w
m
j
j
j
0
where basis functions g j z,v j  are univariate
functions (of projections)
j 1
•
Features z j  (w j  x) specify the projection of x onto w
•
A sum of nonlinear functions g j (w j  x, v j ) can
approximate any nonlinear model functions. See
example below.
49
g1 ( z1 )  0.5z1  0.1
g 2 ( z2 )  0.1z2
z1  ( x1  x2 )
z2  ( x1  x2 )
1
1
0.5
y
0.5
0
0
-0.5
-0.5
-1
1
-1
1
0.5
0.5
1
-1
-0.5
-1
x2
-0.5
-1
0
-0.5
0
-0.5
x2
1
0.5
0
0.5
0
-1
x1
x1
1
0.5
y
y
2
g1 ( x1 , x2 )  g 2 ( x1 , x2 )
0
-0.5
-1
1
0.5
1
0.5
0
0
-0.5
x2
-0.5
-1
-1
x1
50
•
Projection Pursuit regression
Projection Pursuit is an additive model:
f x,V, W   g w  x,v   w
m
j
j
j
0
where basis functions g j z,v j  are univariate
functions (of projections)
j 1
•
Backfitting algorithm is used to estimate
iteratively
(a) basis functions (parameters v j) via
scatterplot smoothing
(b) projection parameters w j (via gradient
descent)
51
EXAMPLE: estimation of a two-dimensional fct via projection pursuit
(a) Projections are found that minimize unexplained variance.
Smoothing is performed to create adaptive basis functions.
(b)
The final model is a sum of two univariate adaptive basis functions.
52
OUTLINE
•
•
•
•
•
•
•
•
Objectives
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
Decision Trees
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Summary and discussion
53
Greedy feature selection
•
•
•
Recall feature selection structure in SRM:
- difficult (nonlinear) optimization problem
- simple with orthogonal basis functions
- why not use orthogonal b.f.’s for all apps?
Consider sparse polynomial estimation (aka
best subset regression) as an example of
feature selection, i.e. features ~ {x k }, k  1,2,3,...
Compare two approaches:
- exhaustive search through all subsets
- forward stepwise selection (in statistics)
54
Data set used for comparisons
•
30 noisy training samples generated from
3
t
(
x
)


x
 x  0.5
y  t ( x)  N (0,0.05) where
and inputs are uniform in [0,1]
55
Feature selection via exhaustive search
•
•
•
Exhaustive search for best subset selection
- estimate prediction risk (MSE) via leave-oneout cross validation
- minimize empirical risk via least squares for
all possible subsets of m variables (features)
- select the best subset (~ min pred. risk)
Based on min prediction risk (via x-validation)
the following model was selected w0  w1x1  w2 x3
Final model estimated via linear regression
1
3
(
x
,
x
) with all data:
using features
 0.7930x 3  0.7709x  0.5562
56
Forward subset selection (greedy method)
•
•
Forward subset selection
- first estimate the model using one feature
- then add the second feature if it results in sufficiently
large decrease in RSS, otherwise stop
- etc. (sequentially adding one more feature)
Step 1: select the first feature (m=1) from a set of candidate
w0  w1 x1 w0  w1x2 w0  w1x3 w0  w1x4
models:
n
2
via RSS   ( yi  f (x i )) 0.249
i 1
0.270
0.274
0.271
so selected model is 0.677  0.09 x with RSS(1)=0.249
•
Step 2: select second feature (m=2) from a set of candidate
models: w0  w1 x1  w2 x 2 w0  w1 x1  w2 x3 w0  w1 x1  w2 x 4
via RSS =
0.0615
0.05424
0.05422
4
selected model 0.5769 0.6009x  0.6814x with RSS(2)=0.05422
57
Forward subset selection (greedy method)
•
Step 2 (cont’d)
- check whether including second feature in the model is justified
using some statistical criterion, usually F test:
RSS(m)  RSS(m  1) so (m+1)-st feature is included only if F>90
F
RSS(m  1) /(n  m  2)
0.2493 0.05422
 93.59
For adding second feature: F 
0.05422/(30  2  2)
so we keep it in the model
•
Step 3: select third feature from a set of candidate models:
w0  w1 x1  w2 x 4  w3 x 2
w0  w1 x1  w2 x 4  w3 x 3
with RSS=0.05362
RSS=0.05363
Test whether adding third feature is justified via F test:
0.05422 0.05362
 not justified, so
F
 0.2799
0.05362/(30  3  2)
the final model 0.5769 0.6009x  0.6814x 4
58
OUTLINE
•
•
•
•
•
•
•
Objectives
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
Decision Trees
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Refs: V. Cherkassky and X. Shao, Signal estimation and denoising using VC-theory,
Neural Networks, 14, 37-52, 2001
V. Cherkassky and S. Kilts, Myopotential denoising of ECG signals using
wavelet thresholding methods, Neural Networks, 14, 1129-1137, 2001
•
Summary and discussion
59
Signal Denoising Problem
6
6
5
5
4
4
3
3
2
2
1
1
0
0
-1
-1
-2
-2
0
100
200
300
400
500
600
700
800
900
1000
-3
0
10
20
30
40
50
60
70
80
90
100
60
Signal denoising problem statement
• Regression formulation ~ real-valued function estimation
(with squared loss)
• Signal representation: linear combination of orthogonal
basis functions (harmonic, wavelets)
y   wi g i ( x)
i
• Differences (from standard formulation)
- fixed sampling rate
- training data X-values = test data X-values
 Computationally efficient orthogonal estimators:
Discrete Fourier/Wavelet Transform (DFT / DWT)
61
Examples of wavelets
see http://en.wikipedia.org/wiki/Wavelet
Haar wavelet
Symmlet
0.1
0.05
0
-0.05
-0.1
0
0.2
0.4
0.6
0.8
1
62
Meyer
Mexican Hat
63
Wavelets (cont’d)
Example of translated and dilated wavelet basis functions:
mot her wavelet
-4
-3
-2
-1
0
1
2
3
4
64
Issues for signal denoising
• Denoising via (wavelet) thresholding
- wavelet thresholding = sparse feature selection
- nonlinear estimator suitable for ERM
• Main factors for signal denoising y   wi g i ( x)
i
Representation (choice of basis functions)
Ordering (of basis functions) ~ SRM structure
Thresholding (model selection)
• Large-sample setting: representation
• Finite-sample setting: thresholding + ordering
65
Framework for signal denoising
• Ordering of (wavelet) thresholding =
= structure on orthogonal basis functions
Traditional ordering
wk1  wk2 ... wkm ...
Better ordering
wk1
wk 2
wkm

...
...
freqk1 freqk 2
freqkm
• VC- thresholding
Opt number of wavelets ~ via min VC-bound for regression
where VC-dim. h=m (number of wavelets or DoF)
66
Empirical Results: signal denoising
• Two target functions
• Symmlet wavelet
• Data set: 128 noisy samples, SNR = 2.5
6
Blocks
4
2
y
0
-2
-4
-6
0
Heavisine
0.2
0.4
0.6
0.8
1
t
67
Empirical Results: Blocks signal
estimated by VC-based denoising
68
Empirical Results: Heavisine estimated
by VC-based denoising
69
Application Study: ECG Denoising
70
A closer look of a noisy segment
71
Denoised ECG signal
VC denoising applied to 4,096 noisy samples.
The final model (below) has 76 wavelets
72
OUTLINE
•
•
•
•
•
•
•
•
Objectives
Statistical Methodology and Basic Methods
Taxonomy of Nonlinear Methods
Decision Trees
Additive Modeling and Projection Pursuit
Greedy Feature Selection
Signal Denoising
Summary and discussion
73
Summary and Discussion
•
Evolution of statistical methods
- parametric  flexible (adaptive)
- fast optimization (favor greedy methods – why?)
- interpretable
- model complexity ~ number of parameters (basis
functions, regions, features …)
- batch mode (for training)
•
Probabilistic framework
- classical methods assume probabilistic models of
observed data
- adaptive statistical methods lack probabilistic
derivation, but use clever heuristics for controling
model complexity
74
Download