Overview of Predictive Learning

advertisement
Overview of Predictive Learning
Vladimir Cherkassky
University of Minnesota
cherk001@umn.edu
Presented at the University of Cyprus, 2009
Electrical and Computer Engineering
1
OUTLINE
•
•
•
•
•
•
Background and motivation
Application study: real-time pricing of
mutual funds
Inductive Learning and Philosophy
Two methodologies: classical
statistics and predictive learning
Statistical Learning Theory and SVM
Summary and discussion
2
Recall:
Learning ~ function estimation
Math terminology
• Past observations ~ data points
• Explanation (model) ~ function
 Learning ~ function estimation (from
data points)
Prediction ~ using estimated model to
make predictions
3
Statistical vs Predictive Approach
• Binary Classification problem
estimate decision boundary from training data x i , y i 
Assuming distribution P(x,y) were known:
10
8
(x1,x2) space
6
x2
4
2
0
-2
-4
-6
-2
0
2
4
x1
6
8
10
4
Classical Statistical Approach
(1) parametric form of unknown distribution P(x,y) is known
(2) estimate parameters of P(x,y) from training data
(3) Construct decision boundary using estimated distribution
and given misclassification costs
10
Estimated boundary
8
6
4
Unknown P(x,y) can be
accurately estimated from
available data
x2
Modeling assumption:
2
0
-2
-4
-6
-2
0
2
4
x1
6
8
10
5
Predictive Modeling Approach
(1) parametric form of decision boundary f(x,w) is given
(2) Explain available data via fitting f(x,w), or minimization of
some loss function (i.e., squared error)
(3) A function f(x,w*) providing smallest fitting error is then
used for predictiion
10
8
Estimated boundary
6
Modeling assumption:
x2
4
2
- Need to specify f(x,w) and 0
-2
loss function a priori.
-4
- No need to estimate P(x,y)
-6
-2
0
2
4
x1
6
8
10
6
Philosophical Interpretation
Unknown system, observed data (input x, output y)
Unknown P(x,y)
Goal is to estimate
a function: y = f (x)
Probabilistic Approach ~
Goal is to estimate the true model for data (x,y)
i.e. System Identification  REALISM
Predictive Approach ~
Goal is to imitate (predict) System output y
i.e., System Imitation  INSTRUMENTALISM
7
Classification with High-Dimensional Data
• Digit recognition 5 vs 8:
each example ~ 16 x 16 pixel image
 256-dimensional vector x
• Given finite number of labeled examples,
estimate decision rule y = f(x) for classifying new images
Note: x ~ 256-dimensional vector, y ~ binary class label 0/1
• Estimation of P(x,y) with finite data is not possible
• Accurate estimation of decision boundary in 256-dim.
space is possible, using just a few hundred samples
8
Statistical vs Predictive
Predictive approach
- estimates certain properties of unknown P(x,y)
that are useful for predicting y
- has solid theoretical foundations (VC-theory)
- successfully used in many apps
BUT its methodology + concepts are different from
classical statistical estimation:
- understanding of application
- a priori specification of a loss function (necessary for
imitation)
- interpretation of predictive models is hard
- possibility of several good models estimated from the
same data
9
OUTLINE
•
•
•
•
•
•
Background and motivation
Application study: real-time pricing of
mutual funds
Inductive Learning and Philosophy
Two methodologies: classical statistics
and predictive learning
Statistical Learning Theory and SVM
Summary and discussion
10
Quick Tour of VC-theory -1
Goals of Predictive Learning
- explain (or fit) available training data
- predict well future (yet unobserved) data
- ample empirical evidence in many apps
Similar to biological learning
Example: given 1, 3, 7, …
predict the rest of the sequence.
Rule 1:
Rule 2:
Rule 3:
x k 1  x k  2 k 1
randomly chosen odd numbers
xk  k 2  k  1
BUT for sequence 1, 3, 7, 15, 31, 63, …,
Rule 1 seems very reliable (why?)
11
Quick Tour of VC-theory - 2
Main Practical Result of VC-theory:
If a model explains well past data AND
is simple, then it can predict well
• This explains why Rule 1 is a good model for
sequence 1, 3, 7, 15, 31, 63, …,
• Measure of model complexity ~ VC-dimension
~ Ability to explain past data 1, 3, 7, 15, 31, 63
BUT can not explain all other possible sequences
 Low VC-dimension (~ large falsifiability)
• For linear models, VC-dim = DoF (as in statistics)
• But for nonlinear models they are different
12
Quick Tour of VC-theory - 3
Strategy for modeling high-dimensional data:
Find a model f(x) that explains past data AND
has low VC-dimension, even when dim. is large
SVM methods
for high-dim data:
Large margin =
Low VC-dimension
~ easy to falsify
13
Non-separable data: classification
Margin  2
L ( y, f (x, ))  max   yf (x, ),0
14
Support Vectors
• SV’s ~ training samples with non-zero loss
• SV’s are samples that falsify the model
• The model depends only on SVs
 SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:
About 40% of us (Americans) will vote for a Democrat, even if the
candidate is Genghis Khan. About 40% will vote for a Republican,
even if the candidate is Attila the Han. This means that the election
is left in the hands of one-fifth of the voters.
• SVM Generalization ~ data compression
15
Nonlinear Decision Boundary
• Fixed (linear) parameterization is too rigid
• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error  nonlinear kernel SVM
16
Handwritten Digit Recognition (mid-90’s)
• Data set:
postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding:
16x16 pixel image  256-dim. vector
• Summary: test error rate ~ 3-4%
- prediction accuracy better than custom NN’s
- accuracy does not depend on the kernel type
- 100 – 400 support vectors per class (digit)
17
Interpretation of SVM models
Humans can not provide interpretion of
high-dimensional data, even when they
make good decisions (predictions) using
such data
i. e. digit recognition
vs
How to interpret high-dimensional models?
-
Project data samples onto normal direction w of
SVM decision boundary D(x) = (w x) + b
Interpret univariate histograms of projections
18
Univariate histogram (of projections)
• Project training data onto normal vector w of trained SVM
 w  x  b
+1
W
0
-1
-1
0 +1
19
Projections for high-dimensional data -1
• Most training samples cluster on margin borders
• For 5 vs 8 recognition data, 100 training samples:
 Explanation (~ fitting of training data) is easy
45
40
35
30
25
20
15
10
5
0
-1.5
-1
-0.5
0
0.5
1
1.5
2
20
Continued..
• BUT test data projections (for this SVM model) have
completely different distribution:
• For 5 vs 8 recognition data, 1000 test samples:
test error ~ 6%  prediction is more difficult
300
250
200
150
100
50
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
21
Projections for high-dimensional data-2
• For 5 vs 8 recognition data, 1000 training samples
Projections of training data:
250
200
150
100
50
0
-3
-2
-1
0
1
2
3
22
Continued..
For this SVM model, test error is ~ 1.35%
And histogram of projections for 1000 test samples:
250
200
150
100
50
0
-3
-2
-1
0
1
2
3
23
OUTLINE
•
•
•
•
•
•
Background and motivation
Application study: real-time pricing of
mutual funds
Inductive Learning and Philosophy
Two methodologies: classical statistics
and predictive learning
Statistical Learning Theory and SVM
Summary and discussion
24
Summary
In many real-life applications:
1. Estimation of models that can explain
available data is easy
2. Estimation of models that can make useful
predictions is very difficult
3. It is important to make clear distinction
between (1) and (2)
Usually this constitutes the difference between
beliefs (opinions) and predictive models
25
Current Challenges
•
•
•
Non-technical:
- lack of agreement on understanding of
uncertainty and risk
Technical:
- many different fragmented disciplines
dealing with predictive learning
VC- theory gives consistent practical
approach for handling uncertainty and risk
but it is often misinterpreted by scientists
26
Acknowledgements
•
Parts of this presentation are taken
- from the forthcoming book
Introduction to Predictive Learning by V.
Cherkassky and Y. Ma, Springer 2010
- and from the course EE 4389 at
www.ece.umn.edu/users/cherkass/ee4389
27
Download