ppt file - Electrical and Computer Engineering

advertisement
Introduction to
Predictive Learning
LECTURE SET 2
Basic Learning Approaches and
Complexity Control
Electrical and Computer Engineering
1
OUTLINE
2.0 Objectives
2.1 Terminology and Basic Learning
Problems
2.2 Basic Learning Approaches
2.3 Generalization and Complexity Control
2.4 Application Example
2.5 Summary
2
2.0 Objectives
1. To quantify the notions of
explanation, prediction and model
2. Introduce terminology
3. Describe basic learning methods
• Past observations ~ data points
• Explanation (model) ~ function
 Learning ~ function estimation
Prediction ~ using estimated model to
make predictions
3
2.0 Objectives (cont’d)
•
Example: classification
training samples, model
Goal 1: explanation of training data
Goal 2: generalization (for future data)
•
Learning is ill-posed
4
Learning as Induction
Induction ~ function estimation from data:
Deduction ~ prediction for new inputs:
5
2.1 Terminology and Learning Problems
•
Input and output variables
x
z
•
•
System
y
Learning ~ estimation of F(X): Xy
Statistical dependency vs causality
y
* *
* ***
* *
* *
* *
*
* * **
* *
x
6
2.1.1 Types of Input and Output Variables
Real-valued
Categorical (class labels)
Ordinal (or fuzzy) variables
Membership value
•
•
•
LIGHT
75
MEDIUM
100
125
150
175
HEAVY
200
225
Weight (lbs)
•
Aside: fuzzy sets and fuzzy logic
7
Data Preprocessing and Scaling
•
Preprocessing is required with observational data
(step 4 in general experimental procedure)
Examples: ….
• Basic preprocessing includes
- summary univariate statistics: mean, st.
deviation, min + max value, range, boxplot
performed independently for each input/output
- detection (removal) of outliers
- scaling of input/output variables (may be
required for some learning algorithms)
• Visual inspection of data is tedious but useful
8
Example Data Set: animal body&brain weight
1 Mountain beaver
2 Cow
3 Gray wolf
4 Goat
5 Guinea pig
6 Diplodocus
7 Asian elephant
8 Donkey
9 Horse
10 Potar monkey
11 Cat
12 Giraffe
13 Gorilla
14 Human
kg
gram
1.350
465.000
36.330
27.660
1.040
11700.000
2547.000
187.100
521.000
10.000
3.300
529.000
207.000
62.000
8.100
423.000
119.500
115.000
5.500
50.000
4603.000
419.000
655.000
115.000
25.600
680.000
406.000
1320.000
9
Example Data Set: cont’d
15
16
17
18
19
20
21
22
23
24
25
26
27
28
African elephant
Triceratops
Rhesus monkey
Kangaroo
Hamster
Mouse
Rabbit
Sheep
Jaguar
Chimpanzee
Brachiosaurus
Rat
Mole
Pig
kg
gram
6654.000
9400.000
6.800
35.000
0.120
0.023
2.500
55.500
100.000
52.160
87000.000
0.280
0.122
192.000
5712.000
70.000
179.000
56.000
1.000
0.400
12.100
175.000
157.000
440.000
154.500
1.900
3.000
180.000
10
Original Unscaled Animal Data:
what points are outliers?
11
Animal Data: with outliers removed and scaled
to [0,1] range: humans in the left top corner
1
0.9
0.8
Brain weight
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Body weight
0.7
0.8
0.9
1
12
2.1.2 Supervised Learning: Regression
•
•
Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is univariate output (‘response’)
Regression: y is real-valued
 Estimation of real-valued function xy
13
2.1.2 Supervised Learning: Classification
•
•
Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is univariate output (‘response’)
Classification: y is categorical (class label)
 Estimation of indicator function xy
14
2.1.2 Unsupervised Learning
•
•
Data in the form (x), where
- x is multivariate input (i.e. vector)
Goal 1: data reduction or clustering
 Clustering = estimation of mapping X c
15
Unsupervised Learning (cont’d)
•
Goal 2: dimensionality reduction
Finding low-dimensional model of the data
16
2.1.3 Other (nonstandard) learning problems
•
Multiple model estimation:
17
OUTLINE
2.0 Objectives
2.1 Terminology and Learning Problems
2.2 Basic Learning Approaches
- Parametric Modeling
- Non-parametric Modeling
- Data Reduction
2.3 Generalization and Complexity Control
2.4 Application Example
2.5 Summary
18
2.2.1 Parametric Modeling
Given training data (xi , yi ), i  1,2,...n
(1) Specify parametric model
(2) Estimate its parameters (via fitting to data)
• Example: Linear regression F(x)= (w x) + b
n
 y
i 1
i
2
 (w  x i )  b  min
19
Parametric Modeling
Given training data (xi , yi ), i  1,2,...n
(1) Specify parametric model
(2) Estimate its parameters (via fitting to data)
Univariate classification:
20
2.2.2 Non-Parametric Modeling
Given training data
(xi , yi ), i  1,2,...n
Estimate the model (for given x 0) as
‘local average’ of the data.
Note: need to define ‘local’, ‘average’
• Example: k-nearest neighbors regression
k
f (x 0 ) 
y
j 1
j
k
21
2.2.3 Data Reduction Approach
Given training data, estimate the model as
‘compact encoding’ of the data.
Note: ‘compact’ ~ # of bits to encode the model
• Example: piece-wise linear regression
How many parameters needed
for two-linear-component model?
22
Example: piece-wise linear regression
vs linear regression
1.5
1
y
0.5
0
-0.5
-1
0
0.2
0.4
0.6
0.8
1
x
23
Data Reduction Approach (cont’d)
Data Reduction approaches are commonly used
for unsupervised learning tasks.
• Example: clustering.
Training data encoded by 3 points (cluster centers)
H
Issues:
- How to find centers?
- How to select the
number of clusters?
24
Inductive Learning Setting
Induction and Deduction in Philosophy:
All observed swans are white (data samples).
Therefore, all swans are white.
• Model estimation ~ inductive step, i.e. estimate
function from data samples.
• Prediction ~ deductive step
 Inductive Learning Setting
• Discussion: which of the 3 modeling
approaches follow inductive learning?
• Do humans implement inductive inference?
25
OUTLINE
2.0 Objectives
2.1 Terminology and Learning Problems
2.2 Modeling Approaches & Learning Methods
2.3 Generalization and Complexity Control
- Prediction Accuracy (generalization)
- Complexity Control: examples
- Resampling
2.4 Application Example
2.5 Summary
26
2.3.1 Prediction Accuracy
Inductive Learning ~ function estimation
• All modeling approaches implement ‘data
fitting’ ~ explaining the data
• BUT True goal ~ prediction
• Two possible goals of learning:
- estimation of ‘true function’
- good generalization for future data
• Are these two goals equivalent?
• If not, which one is more practical?
27
Explanation vs Prediction
(a) Classification
(b) Regression
28
Inductive Learning Setting
• The learning machine observes samples (x ,y), and
returns an estimated response yˆ  f (x, w)
• Recall ‘first-principles’ vs ‘empirical’ knowledge
Two modes of inference: identification vs imitation
• Risk  Loss(y, f(x,w)) dP(x,y) min
29
Discussion
•
•
•
Math formulation useful for quantifying
- explanation ~ fitting error (training data)
- generalization ~ prediction error
Natural assumptions
- future similar to past: stationary P(x,y),
i.i.d.data
- discrepancy measure or loss function,
i.e. MSE
What if these assumptions do not hold?
30
Example: Regression
Given: training data (xi , yi ), i  1,2,...n
Find a function f (x, w ) that minimizes squared
error for a large number (N) of future samples:
N

k 1
[( y k  f (x k , w)] 2  min
2 dP(x,y)  min
(y

f(
x
,w))

BUT Future data is unknown ~ P(x,y) unknown
31
2.3.2 Complexity Control: parametric modeling
Consider regression estimation
• Ten training samples
y  x 2  N (0,  2 ), where 2  0.25
•
Fitting linear and 2-nd order polynomial:
32
Complexity Control: local estimation
Consider regression estimation
• Ten training samples from
y  x 2  N (0,  2 ), where 2  0.25
•
Using k-nn regression with k=1 and k=4:
33
Complexity Control (cont’d)
• Complexity (of admissible models)
affects generalization (for future data)
• Specific complexity indices for
– Parametric models: ~ # of parameters
– Local modeling: size of local region
– Data reduction: # of clusters
• Complexity control = choosing good
complexity (~ good generalization) for a
given (training) data
34
How to Control Complexity ?
• Two approaches: analytic and resampling
• Analytic criteria estimate prediction error as a
function of fitting error and model complexity
For regression problems:
 DoF 
R   r 
 Remp
 n 
Representative analytic criteria for regression
1
• Schwartz Criterion:
r  p, n   1  p1  p  ln n
• Akaike’s FPE:
r p  1 p1 p
1
where p = DoF/n, n~sample size, DoF~degrees-of-freedom
35
2.3.3 Resampling
•
Split available data into 2 sets:
Training + Validation
(1) Use training set for model
estimation (via data fitting)
(2) Use validation data to estimate the
prediction error of the model
• Change model complexity index and
repeat (1) and (2)
• Select the final model providing lowest
(estimated) prediction error
BUT results are sensitive to data splitting
36
K-fold cross-validation
1. Divide the training data Z into k randomly selected
disjoint subsets {Z1, Z2,…, Zk} of size n/k
2. For each ‘left-out’ validation set Zi :
- use remaining data to estimate the model yˆ  f i (x)
k
2
- estimate prediction error on Zi :
ri    f i (x)  y 
k
nZ
1
3. Estimate ave prediction risk as Rcv   ri
k i 1
i
37
Example of model selection(1)
• 25 samples are generated as
y  sin 2 2x   
with x uniformly sampled in [0,1], and noise ~ N(0,1)
• Regression estimated using polynomials of degree m=1,2,…,10
• Polynomial degree m = 5 is chosen via 5-fold cross-validation.
The curve shows the polynomial model, along with training (* )
and validation (*) data points, for one partitioning.
m
Estimated R via
Cross validation
1
0.1340
2
0.1356
3
0.1452
4
0.1286
5
0.0699
6
0.1130
7
0.1892
8
0.3528
9
0.3596
10
0.4006
38
Example of model selection(2)
• Same data set, but estimated using k-nn regression.
• Optimal value k = 7 chosen according to 5-fold cross-validation
model selection. The curve shows the k-nn model, along with
training (* ) and validation (*) data points, for one partitioning.
k
Estimated R via
Cross validation
1
0.1109
2
0.0926
3
0.0950
4
0.1035
5
0.1049
6
0.0874
7
0.0831
8
0.0954
9
0.1120
10
0.1227
39
More on Resampling
• Leave-one-out (LOO) cross-validation
- extreme case of k-fold when k=n (# samples)
- efficient use of data, but requires n estimates
• Final (selected) model depends on:
- random data
- random partitioning of the data into K subsets (folds)
 the same resampling procedure may yield different
model selection results
• Some applications may use non-random splitting of the
data into (training + validation)
• Model selection via resampling is based on estimated
prediction risk (error).
• Does this estimated error measure reflect true prediction
accuracy of the final model?
40
Resampling for estimating true risk
• Prediction risk (test error) of a method can be
also estimated via resampling
• Partition the data into: Training/ validation/ test
• Test data should be never used for model
estimation
• Double resampling method:
- for complexity control
- for estimating prediction performance of a
method
• Estimation of prediction risk (test error) is critical
for comparison of different learning methods
41
Example of model selection for k-NN
classifier via 6-fold x-validation: Ripley’s data.
Optimal decision boundary for k=14
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
42
Example of model selection for k-NN
classifier via 6-fold x-validation: Ripley’s data.
Optimal decision boundary for k=50
1.2
which one
is better?
k=14 or 50
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
43
Estimating test error of a method
•
•
•
For the same example (Ripley’s data) what is the true
test error of k-NN method ?
Use double resampling, i.e. 5-fold cross validation to
estimate test error, and 6-fold cross-validation to
estimate optimal k for each training fold:
Fold # k
Validation Test error
1
20
11.76%
14%
2
9
0%
8%
3
1
17.65%
10%
4
12
5.88%
18%
5
7
17.65%
14%
mean
10.59%
12.8%
Note: opt k-values are different; errors vary for each fold,
due to high variability of random partitioning of the data
44
Estimating test error of a method
•
Another realization of double resampling, i.e. 5-fold
cross validation to estimate test error, and 6-fold crossvalidation to estimate optimal k for each training fold:
Fold #
1
2
3
4
5
mean
•
k
7
31
25
1
62
Validation
14.71%
8.82%
11.76%
14.71%
11.76%
12.35%
Test error
14%
14%
10%
18%
4%
12%
Note: predicted average test error (12%) is usually higher than
minimized validation error (11%) for model selection
45
2.4 Application Example
• Why financial applications?
- “market is always right” ~ loss function
- lots of historical data
- modeling results easy to understand
• Background on mutual funds
• Problem specification + experimental setup
• Modeling results
• Discussion
46
OUTLINE
2.0 Objectives
2.1 Terminology and Basic Learning
Problems
2.2 Basic Learning Approaches
2.3 Generalization and Complexity Control
2.4 Application Example
2.5 Summary
47
2.4.1 Background: pricing mutual funds
• Mutual funds trivia and recent scandals
• Mutual fund pricing:
- priced once a day (after market close)
 NAV unknown when order is placed
• How to estimate NAV accurately?
Approach 1: Estimate holdings of a fund (~200400 stocks), then find NAV
Approach 2: Estimate NAV via correlations
btwn NAV and major market indices (learning)
48
2.4.2 Problem specs and experimental setup
• Domestic fund: Fidelity OTC (FOCPX)
• Possible Inputs:
SP500, DJIA, NASDAQ, ENERGY SPDR
• Data Encoding:
Output ~ % daily price change in NAV
Inputs ~ % daily price changes of market indices
• Modeling period: 2003.
• Issues: modeling method? Selection of input
variables? Experimental setup?
49
Experimental Design and Modeling Setup
Possible variable selection:
Mutual Funds
Input Variables
Y
X1
X2
X3
FOCPX
^IXIC
-
-
FOCPX
^GSPC
^IXIC
-
FOCPX
^GSPC
^IXIC
XLE
• All variables represent % daily price changes.
• Modeling method: linear regression
• Data obtained from Yahoo Finance.
• Time period for modeling 2003.
50
Specification of Training and Test Data
Year 2003
1, 2
3, 4
Training
5, 6
7, 8
9, 10
11, 12
Test
Training
Test
Training
Test
Training
Test
Training
Test
Two-Month Training/ Test Set-up
 Total 6 regression models for 2003
51
Results for Fidelity OTC Fund (GSPC+IXIC)
Coefficients
w0
w1 (^GSPC) W2(^IXIC)
Average
-0.027
0.173
0.771
Standard Deviation (SD)
0.043
0.150
0.165
Average model: Y =-0.027+0.173^GSPC+0.771^IXIC
^IXIC is the main factor affecting FOCPX’s daily price change
 Prediction error: MSE (GSPC+IXIC) = 5.95%

52
Results for Fidelity OTC Fund (GSPC+IXIC)
140
130
Daily Account Value
120
110
100
90
FOCPX
Model(GSPC+IXIC)
80
1-Jan-03
20-Feb-03
11-Apr-03
31-May-03
20-Jul-03
8-Sep-03
28-Oct-03
17-Dec-03
Date
Daily closing prices for 2003: NAV vs synthetic model
53
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
Coefficients
w0
w1 (^GSPC) W2(^IXIC) W3(XLE)
Average
-0.029
0.147
0.784
0.029
Standard Deviation (SD) 0.044
0.215
0.191
0.061
Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE
 ^IXIC is the main factor affecting FOCPX daily price change
 Prediction error: MSE (GSPC+IXIC+XLE) = 6.14%

54
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
140
130
Daily Account Value
120
110
100
90
FOCPX
80
Model(GSPC+IXIC+XLE
)
1-Jan-03
20-Feb-03
11-Apr-03
31-May-03
20-Jul-03
8-Sep-03
28-Oct-03
17-Dec-03
Date
Daily closing prices for 2003: NAV vs synthetic model
55
Effect of Variable Selection
Different linear regression models for FOCPX:
•
Y =-0.035+0.897^IXIC
•
•
•
Y =-0.027+0.173^GSPC+0.771^IXIC
Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE
Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI
Have different prediction error (MSE):
•
•
•
•
MSE (IXIC) = 6.44%
MSE (GSPC + IXIC) = 5.95%
MSE (GSPC + IXIC + XLE) = 6.14%
MSE (GSPC + IXIC + XLE + DJIA) = 6.43%
(1) Variable Selection is a form of complexity control
(2) Good selection can be performed by domain experts
56
Discussion
• Many funds simply mimic major indices
statistical NAV models can be used for
ranking/evaluating mutual funds
• Statistical models can be used for
- hedging risk and
- to overcome restrictions on trading
(market timing) of domestic funds
• Since 70% of the funds under-perform their
benchmark indices, better use index funds
57
Summary
• Inductive Learning ~ function estimation
• Goal of learning (empirical inference):
to act/perform well, not system identification
• Important concepts:
- training data, test data
- loss function, prediction error (aka risk)
- basic learning problems
- basic learning methods
• Complexity control and resampling
• Estimating prediction error via resampling
58
Download