Introduction to Predictive Learning LECTURE SET 2 Basic Learning Approaches and Complexity Control Electrical and Computer Engineering 1 OUTLINE 2.0 Objectives 2.1 Terminology and Basic Learning Problems 2.2 Basic Learning Approaches 2.3 Generalization and Complexity Control 2.4 Application Example 2.5 Summary 2 2.0 Objectives 1. To quantify the notions of explanation, prediction and model 2. Introduce terminology 3. Describe basic learning methods • Past observations ~ data points • Explanation (model) ~ function Learning ~ function estimation Prediction ~ using estimated model to make predictions 3 2.0 Objectives (cont’d) • Example: classification training samples, model Goal 1: explanation of training data Goal 2: generalization (for future data) • Learning is ill-posed 4 Learning as Induction Induction ~ function estimation from data: Deduction ~ prediction for new inputs: 5 2.1 Terminology and Learning Problems • Input and output variables x z • • System y Learning ~ estimation of F(X): Xy Statistical dependency vs causality y * * * *** * * * * * * * * * ** * * x 6 2.1.1 Types of Input and Output Variables Real-valued Categorical (class labels) Ordinal (or fuzzy) variables Membership value • • • LIGHT 75 MEDIUM 100 125 150 175 HEAVY 200 225 Weight (lbs) • Aside: fuzzy sets and fuzzy logic 7 Data Preprocessing and Scaling • Preprocessing is required with observational data (step 4 in general experimental procedure) Examples: …. • Basic preprocessing includes - summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output - detection (removal) of outliers - scaling of input/output variables (may be required for some learning algorithms) • Visual inspection of data is tedious but useful 8 Example Data Set: animal body&brain weight 1 Mountain beaver 2 Cow 3 Gray wolf 4 Goat 5 Guinea pig 6 Diplodocus 7 Asian elephant 8 Donkey 9 Horse 10 Potar monkey 11 Cat 12 Giraffe 13 Gorilla 14 Human kg gram 1.350 465.000 36.330 27.660 1.040 11700.000 2547.000 187.100 521.000 10.000 3.300 529.000 207.000 62.000 8.100 423.000 119.500 115.000 5.500 50.000 4603.000 419.000 655.000 115.000 25.600 680.000 406.000 1320.000 9 Example Data Set: cont’d 15 16 17 18 19 20 21 22 23 24 25 26 27 28 African elephant Triceratops Rhesus monkey Kangaroo Hamster Mouse Rabbit Sheep Jaguar Chimpanzee Brachiosaurus Rat Mole Pig kg gram 6654.000 9400.000 6.800 35.000 0.120 0.023 2.500 55.500 100.000 52.160 87000.000 0.280 0.122 192.000 5712.000 70.000 179.000 56.000 1.000 0.400 12.100 175.000 157.000 440.000 154.500 1.900 3.000 180.000 10 Original Unscaled Animal Data: what points are outliers? 11 Animal Data: with outliers removed and scaled to [0,1] range: humans in the left top corner 1 0.9 0.8 Brain weight 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Body weight 0.7 0.8 0.9 1 12 2.1.2 Supervised Learning: Regression • • Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (‘response’) Regression: y is real-valued Estimation of real-valued function xy 13 2.1.2 Supervised Learning: Classification • • Data in the form (x,y), where - x is multivariate input (i.e. vector) - y is univariate output (‘response’) Classification: y is categorical (class label) Estimation of indicator function xy 14 2.1.2 Unsupervised Learning • • Data in the form (x), where - x is multivariate input (i.e. vector) Goal 1: data reduction or clustering Clustering = estimation of mapping X c 15 Unsupervised Learning (cont’d) • Goal 2: dimensionality reduction Finding low-dimensional model of the data 16 2.1.3 Other (nonstandard) learning problems • Multiple model estimation: 17 OUTLINE 2.0 Objectives 2.1 Terminology and Learning Problems 2.2 Basic Learning Approaches - Parametric Modeling - Non-parametric Modeling - Data Reduction 2.3 Generalization and Complexity Control 2.4 Application Example 2.5 Summary 18 2.2.1 Parametric Modeling Given training data (xi , yi ), i 1,2,...n (1) Specify parametric model (2) Estimate its parameters (via fitting to data) • Example: Linear regression F(x)= (w x) + b n y i 1 i 2 (w x i ) b min 19 Parametric Modeling Given training data (xi , yi ), i 1,2,...n (1) Specify parametric model (2) Estimate its parameters (via fitting to data) Univariate classification: 20 2.2.2 Non-Parametric Modeling Given training data (xi , yi ), i 1,2,...n Estimate the model (for given x 0) as ‘local average’ of the data. Note: need to define ‘local’, ‘average’ • Example: k-nearest neighbors regression k f (x 0 ) y j 1 j k 21 2.2.3 Data Reduction Approach Given training data, estimate the model as ‘compact encoding’ of the data. Note: ‘compact’ ~ # of bits to encode the model • Example: piece-wise linear regression How many parameters needed for two-linear-component model? 22 Example: piece-wise linear regression vs linear regression 1.5 1 y 0.5 0 -0.5 -1 0 0.2 0.4 0.6 0.8 1 x 23 Data Reduction Approach (cont’d) Data Reduction approaches are commonly used for unsupervised learning tasks. • Example: clustering. Training data encoded by 3 points (cluster centers) H Issues: - How to find centers? - How to select the number of clusters? 24 Inductive Learning Setting Induction and Deduction in Philosophy: All observed swans are white (data samples). Therefore, all swans are white. • Model estimation ~ inductive step, i.e. estimate function from data samples. • Prediction ~ deductive step Inductive Learning Setting • Discussion: which of the 3 modeling approaches follow inductive learning? • Do humans implement inductive inference? 25 OUTLINE 2.0 Objectives 2.1 Terminology and Learning Problems 2.2 Modeling Approaches & Learning Methods 2.3 Generalization and Complexity Control - Prediction Accuracy (generalization) - Complexity Control: examples - Resampling 2.4 Application Example 2.5 Summary 26 2.3.1 Prediction Accuracy Inductive Learning ~ function estimation • All modeling approaches implement ‘data fitting’ ~ explaining the data • BUT True goal ~ prediction • Two possible goals of learning: - estimation of ‘true function’ - good generalization for future data • Are these two goals equivalent? • If not, which one is more practical? 27 Explanation vs Prediction (a) Classification (b) Regression 28 Inductive Learning Setting • The learning machine observes samples (x ,y), and returns an estimated response yˆ f (x, w) • Recall ‘first-principles’ vs ‘empirical’ knowledge Two modes of inference: identification vs imitation • Risk Loss(y, f(x,w)) dP(x,y) min 29 Discussion • • • Math formulation useful for quantifying - explanation ~ fitting error (training data) - generalization ~ prediction error Natural assumptions - future similar to past: stationary P(x,y), i.i.d.data - discrepancy measure or loss function, i.e. MSE What if these assumptions do not hold? 30 Example: Regression Given: training data (xi , yi ), i 1,2,...n Find a function f (x, w ) that minimizes squared error for a large number (N) of future samples: N k 1 [( y k f (x k , w)] 2 min 2 dP(x,y) min (y f( x ,w)) BUT Future data is unknown ~ P(x,y) unknown 31 2.3.2 Complexity Control: parametric modeling Consider regression estimation • Ten training samples y x 2 N (0, 2 ), where 2 0.25 • Fitting linear and 2-nd order polynomial: 32 Complexity Control: local estimation Consider regression estimation • Ten training samples from y x 2 N (0, 2 ), where 2 0.25 • Using k-nn regression with k=1 and k=4: 33 Complexity Control (cont’d) • Complexity (of admissible models) affects generalization (for future data) • Specific complexity indices for – Parametric models: ~ # of parameters – Local modeling: size of local region – Data reduction: # of clusters • Complexity control = choosing good complexity (~ good generalization) for a given (training) data 34 How to Control Complexity ? • Two approaches: analytic and resampling • Analytic criteria estimate prediction error as a function of fitting error and model complexity For regression problems: DoF R r Remp n Representative analytic criteria for regression 1 • Schwartz Criterion: r p, n 1 p1 p ln n • Akaike’s FPE: r p 1 p1 p 1 where p = DoF/n, n~sample size, DoF~degrees-of-freedom 35 2.3.3 Resampling • Split available data into 2 sets: Training + Validation (1) Use training set for model estimation (via data fitting) (2) Use validation data to estimate the prediction error of the model • Change model complexity index and repeat (1) and (2) • Select the final model providing lowest (estimated) prediction error BUT results are sensitive to data splitting 36 K-fold cross-validation 1. Divide the training data Z into k randomly selected disjoint subsets {Z1, Z2,…, Zk} of size n/k 2. For each ‘left-out’ validation set Zi : - use remaining data to estimate the model yˆ f i (x) k 2 - estimate prediction error on Zi : ri f i (x) y k nZ 1 3. Estimate ave prediction risk as Rcv ri k i 1 i 37 Example of model selection(1) • 25 samples are generated as y sin 2 2x with x uniformly sampled in [0,1], and noise ~ N(0,1) • Regression estimated using polynomials of degree m=1,2,…,10 • Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the polynomial model, along with training (* ) and validation (*) data points, for one partitioning. m Estimated R via Cross validation 1 0.1340 2 0.1356 3 0.1452 4 0.1286 5 0.0699 6 0.1130 7 0.1892 8 0.3528 9 0.3596 10 0.4006 38 Example of model selection(2) • Same data set, but estimated using k-nn regression. • Optimal value k = 7 chosen according to 5-fold cross-validation model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning. k Estimated R via Cross validation 1 0.1109 2 0.0926 3 0.0950 4 0.1035 5 0.1049 6 0.0874 7 0.0831 8 0.0954 9 0.1120 10 0.1227 39 More on Resampling • Leave-one-out (LOO) cross-validation - extreme case of k-fold when k=n (# samples) - efficient use of data, but requires n estimates • Final (selected) model depends on: - random data - random partitioning of the data into K subsets (folds) the same resampling procedure may yield different model selection results • Some applications may use non-random splitting of the data into (training + validation) • Model selection via resampling is based on estimated prediction risk (error). • Does this estimated error measure reflect true prediction accuracy of the final model? 40 Resampling for estimating true risk • Prediction risk (test error) of a method can be also estimated via resampling • Partition the data into: Training/ validation/ test • Test data should be never used for model estimation • Double resampling method: - for complexity control - for estimating prediction performance of a method • Estimation of prediction risk (test error) is critical for comparison of different learning methods 41 Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data. Optimal decision boundary for k=14 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 42 Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data. Optimal decision boundary for k=50 1.2 which one is better? k=14 or 50 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 43 Estimating test error of a method • • • For the same example (Ripley’s data) what is the true test error of k-NN method ? Use double resampling, i.e. 5-fold cross validation to estimate test error, and 6-fold cross-validation to estimate optimal k for each training fold: Fold # k Validation Test error 1 20 11.76% 14% 2 9 0% 8% 3 1 17.65% 10% 4 12 5.88% 18% 5 7 17.65% 14% mean 10.59% 12.8% Note: opt k-values are different; errors vary for each fold, due to high variability of random partitioning of the data 44 Estimating test error of a method • Another realization of double resampling, i.e. 5-fold cross validation to estimate test error, and 6-fold crossvalidation to estimate optimal k for each training fold: Fold # 1 2 3 4 5 mean • k 7 31 25 1 62 Validation 14.71% 8.82% 11.76% 14.71% 11.76% 12.35% Test error 14% 14% 10% 18% 4% 12% Note: predicted average test error (12%) is usually higher than minimized validation error (11%) for model selection 45 2.4 Application Example • Why financial applications? - “market is always right” ~ loss function - lots of historical data - modeling results easy to understand • Background on mutual funds • Problem specification + experimental setup • Modeling results • Discussion 46 OUTLINE 2.0 Objectives 2.1 Terminology and Basic Learning Problems 2.2 Basic Learning Approaches 2.3 Generalization and Complexity Control 2.4 Application Example 2.5 Summary 47 2.4.1 Background: pricing mutual funds • Mutual funds trivia and recent scandals • Mutual fund pricing: - priced once a day (after market close) NAV unknown when order is placed • How to estimate NAV accurately? Approach 1: Estimate holdings of a fund (~200400 stocks), then find NAV Approach 2: Estimate NAV via correlations btwn NAV and major market indices (learning) 48 2.4.2 Problem specs and experimental setup • Domestic fund: Fidelity OTC (FOCPX) • Possible Inputs: SP500, DJIA, NASDAQ, ENERGY SPDR • Data Encoding: Output ~ % daily price change in NAV Inputs ~ % daily price changes of market indices • Modeling period: 2003. • Issues: modeling method? Selection of input variables? Experimental setup? 49 Experimental Design and Modeling Setup Possible variable selection: Mutual Funds Input Variables Y X1 X2 X3 FOCPX ^IXIC - - FOCPX ^GSPC ^IXIC - FOCPX ^GSPC ^IXIC XLE • All variables represent % daily price changes. • Modeling method: linear regression • Data obtained from Yahoo Finance. • Time period for modeling 2003. 50 Specification of Training and Test Data Year 2003 1, 2 3, 4 Training 5, 6 7, 8 9, 10 11, 12 Test Training Test Training Test Training Test Training Test Two-Month Training/ Test Set-up Total 6 regression models for 2003 51 Results for Fidelity OTC Fund (GSPC+IXIC) Coefficients w0 w1 (^GSPC) W2(^IXIC) Average -0.027 0.173 0.771 Standard Deviation (SD) 0.043 0.150 0.165 Average model: Y =-0.027+0.173^GSPC+0.771^IXIC ^IXIC is the main factor affecting FOCPX’s daily price change Prediction error: MSE (GSPC+IXIC) = 5.95% 52 Results for Fidelity OTC Fund (GSPC+IXIC) 140 130 Daily Account Value 120 110 100 90 FOCPX Model(GSPC+IXIC) 80 1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03 Date Daily closing prices for 2003: NAV vs synthetic model 53 Results for Fidelity OTC Fund (GSPC+IXIC+XLE) Coefficients w0 w1 (^GSPC) W2(^IXIC) W3(XLE) Average -0.029 0.147 0.784 0.029 Standard Deviation (SD) 0.044 0.215 0.191 0.061 Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE ^IXIC is the main factor affecting FOCPX daily price change Prediction error: MSE (GSPC+IXIC+XLE) = 6.14% 54 Results for Fidelity OTC Fund (GSPC+IXIC+XLE) 140 130 Daily Account Value 120 110 100 90 FOCPX 80 Model(GSPC+IXIC+XLE ) 1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03 Date Daily closing prices for 2003: NAV vs synthetic model 55 Effect of Variable Selection Different linear regression models for FOCPX: • Y =-0.035+0.897^IXIC • • • Y =-0.027+0.173^GSPC+0.771^IXIC Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI Have different prediction error (MSE): • • • • MSE (IXIC) = 6.44% MSE (GSPC + IXIC) = 5.95% MSE (GSPC + IXIC + XLE) = 6.14% MSE (GSPC + IXIC + XLE + DJIA) = 6.43% (1) Variable Selection is a form of complexity control (2) Good selection can be performed by domain experts 56 Discussion • Many funds simply mimic major indices statistical NAV models can be used for ranking/evaluating mutual funds • Statistical models can be used for - hedging risk and - to overcome restrictions on trading (market timing) of domestic funds • Since 70% of the funds under-perform their benchmark indices, better use index funds 57 Summary • Inductive Learning ~ function estimation • Goal of learning (empirical inference): to act/perform well, not system identification • Important concepts: - training data, test data - loss function, prediction error (aka risk) - basic learning problems - basic learning methods • Complexity control and resampling • Estimating prediction error via resampling 58