Introduction to Predictive Learning LECTURE SET 5 Statistical Methods Electrical and Computer Engineering 1 OUTLINE • Objectives - introduce statistical terminology/methodology/motivation - taxonomy of methods - describe several representative statistical methods - interpretation of statistical methods under predictive learning • • • • • • • Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion 2 Methodology and Motivation • Original motivation: - understand how the inputs affect the output simple model involving a few variables • • • Regression modeling: Response = model + error y = f(x) + noise, where f(x) = E(y/x) Linear regression: f(x) = wx +b Model parameters estimates via 1 n 2 MSE w, b yi ( wxi b) m in n i 1 3 • OLS Linear Regression ( x x )( y y ) OLS solution: wˆ , bˆ y wˆ x (x x) i i i 2 i i - first, center x and y-values - then calculate the slope and bias 200 180 SBP Ey x 0.44Age105.7 160 The meaning of bias term? Systolic Blood Pressure Example: SBP vs. Age 140 120 100 80 40 45 50 55 60 Age in Years 65 70 75 80 4 Statistical Assumptions • • • Gaussian noise: zero mean, const variance Known (linear) dependency i.i.d. data samples (ensured by the protocol for data collection) – may not hold for observational data 200 Do these assumptions hold for: Systolic Blood Pressure 180 160 140 120 100 80 40 45 50 55 60 65 70 75 80 Age in Years 5 Multivariate Linear Regression • • Parameterization f x, w1 x1 w2 x2 ... wd xd b (w x) b Matrix form (for centered variables) x11 x X 21 . xd1 x12 x 22 . xd 2 ... x1n ... x 2 n ... . ... x dn Xw y 1 2 Remp w Xw y min n • ERM solution: • ˆ X X Analytic solution (when d < n): w T 1 XT y 6 Linear Ridge Regression • When d > n, penalize large parameter values Rridge w Xw y 2 w 2 • Regularization parameter estimated via resampling • Example: y t (x) t (x) 3x1 x2 2x3 0 x4 0 x5 • - 10 training samples uniformly samples in [0,1] range - additive gaussian noise with st. deviation 0.5 Apply standard linear least squares: • yˆ 3.3422x1 1.4668x2 2.3999x3 0.3133x4 0.0346x5 0.0675 Apply ridge regression using optimal log( ) 3 ˆ 2.9847x1 1.0338x2 2.0161x3 0.0889x4 0.3891x5 0.018 y 7 Example cont’d • • t (x) 3x1 x2 2x3 0 x4 0 x5 Target function Coefficient shrinkage: how w’s depend on lambda? Can it be used for feature selection? 8 Statistical Methodology for classification • • For classification: output y ~ (binary) class label (0 or 1) Probabilistic modeling starts with known distributions P( y 1 / x), P( y 0 / x), P( y 0), P( y 1) • Bayes-optimal decision rule for known distributions: P y 1 / x P( y 0) 1 if Dx P y 0 / x P( y 1) 0 otherwise • Statistical approach ~ ERM - parametric form of class distributions is known/assumed analytic form of D(x) is known, and its parameters are estimated from available training data x i , yi , i 1,2,...,n • Issues: loss function (used for statistical modeling)? 9 Gaussian class distributions 10 Logistic Regression • • Terminology: may be confusing (for non-statisticians) Gaussian class distributions (with equal covariances) P y 1 x ln is a linear function in x 1 P y 1 x • Logistic regression estimates probabilistic model: P y 1 x w x b logitP y 1 x ln 1 P y 1 x • Equivalently, logistic regression estimates P y 1 x sw x b expb (w x) 1 expb (w x) where sigmoid function is 1 st 1 exp t 11 Logistic Regression • Example: interpretation of logistic regression model for the probability of death from a heart disease during 10-year period, for middle-aged patients, as a function of - Age (years, less 50) ~x1 - Gender male/female (0/1) ~x2 - cholesterol level, in mmol/L (less 5) ~ x3 1 P y 1 x 1 exp(t ) • • where t 5 2x1 x2 1.2x3 The probability of binary outcome ~ the risk (of death)* Logistic Regression Model interpretation: - increasing Age is associated with increased risk of death - females have lower risk of death (than males) - increasing Cholesterol level increased risk of death 12 Estimating Logistic Regression • • Given: training data xi , yi , i 1,2,...,n How to estimate model parameters (w,b) ? Pˆ y 1 x f (x, w, b) • Pˆ y 0 x 1 f (x, w, b) Maximum Likelihood ~ minimize negative log-likelihood: 1 n Remp (w, b) yi ln f (xi , w, b) (1 yi ) ln(1 f (xi , w, b)) n i 1 where Pˆ y 1 x f x, w, b expb (w x) 1 expb (w x) • non-linear optimization Solution w*, b* estimated model: Pˆ y 1 x f (x, w , b ) - which can be used for prediction and interpretation (for prediction, the model should be combined with costs) 13 Statistical Modeling Strategy • Data-analytic models are used for: understanding the importance of inputs in explaining the output • ERM approach: - a statistician selects (manually) a few ‘good’ variables and several models are estimated - the final model selected manually ~ heuristic implementation of Occam’s razor • Linear regression and logistic regression - both estimate E(y/x), since for classification: Ey x 0 Py 0 x 1 Py 1 x Py 1 x 14 Classification via multiple-response regression How to use nonlinear regression s/w for classification? - classification methods estimate model parameters via minimization of squared-error can use regression s/w with minor modifications: (1) for J class labels, use 1-of-J encoding, i.e. J=4 classes: ~ 1000 0100 0010 0001 (4 outputs in regression). (2) estimate 4 regression models from the training data (usually all regression models use the same parameterization) x1 . . . xd . Estimation of multiple response regression y1 . . yJ 15 Classification via Regression • Training ~ regression estimation using 1-of-J encoding x1 . . . . Estimation of multiple response regression . . yJ xd • y1 Prediction (classification) ~ based on the max response value of estimated outputs x1 ˆy1 . . . xd . Multiple response discriminant functions . . MAX ˆy ˆyJ 16 OUTLINE • • • • • • • • Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods - model parameterization (representation) - nonlinear optimization strategies Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion 17 Taxonomy of Nonlinear Methods • • • • Main idea: improve flexibility of classical linear methods ~ use flexible (nonlinear) parameterization Dictionary parameterization m f m x, w,V wi gx,vi ~ SRM structure i0 Two interrelated issues: - parameterization (of nonlinear basis functions) - optimization method used These two factors define methods taxonomy 18 Taxonomy of nonlinear methods • Decision tree methods: - piecewise-constant model - greedy optimization • Additive methods: - backfitting method for model estimation • Gradient-descent methods: - popular in neural network learning • Penalization methods Note: all methods implement SRM structures 19 • Dictionary representation Two possibilities • m f m x, w,V wi gx,vi i0 Linear (non-adaptive) methods ~ predetermined (fixed) basis functions g i x only parameters w i have to be estimated via standard optimization methods (linear least squares) Examples: linear regression, polynomial regression linear classifiers, quadratic classifiers • Nonlinear (adaptive) methods ~ basis functions gx,vi depend on the training data Possibilities : nonlinear b.f. (in parameters v i ) feature selection (i.e. wavelet denoising) 20 Example of Nonlinear Parameterization • Basis functions of the form gi (t ) g (xvi bi ) i.e. sigmoid aka logistic function st 1 1 exp t - commonly used in artificial neural networks - combination of sigmoids ~ universal approximator 21 Example of Nonlinear Parameterization • Basis functions of the form gi (t ) g x v i i.e. Radial Basis Function(RBF) (t c) 2 g t exp 2 2 g t t b 2 2 2 gt t - RBF adaptive parameters: center, width - commonly used in artificial neural networks - combination of RBF’s ~ universal approximator 22 Neural Network Representation • MLP or RBF networks m ˆy w j z j m j 1 f m x, w,V wi gx,vi i0 W is m 1 z1 1 z2 2 zm m zj gx,v j V is d m x1 x2 xd - dimensionality reduction - universal approximation property – see example at http://www.mathworks.com/products/demos/nnettlbx/radial/index.html 23 Example of Nonlinear Parameterization • Adaptive Partitioning (CART) f x w I x R m j 1 j j each b.f. is a rectangular region in x-space d Ix R j Iajl xl bjl l 1 • • Each b.f. depends on 2d parameters a j ,b j Since the regions R j are disjoint, parameters w can be easily estimated (for regression) as 1 wj yi n j x i R j • Estimating b.f.’s ~ adaptive partitioning 24 Example of CART Partitioning • CART Partitioning in 2D space - each region ~ basis function - piecewise-constant estimate of y (in each region) - number of regions ~ model complexity x2 s4 R5 R4 R1 s2 s3 R3 R2 s1 x1 25 OUTLINE • • • • Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees - Regression trees (CART) - Boston Housing example - Classification trees (CART) • • • • Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion 26 Greedy Optimization Strategy • Minimization of empirical risk for regression problems 1 n 1 n 2 Remp V, W Lx i , yi , V, W yi f x i , V, W n i 1 n i 1 where the model f x, V, W w j g j x, v j m j 1 • Greedy Optimization Strategy basis functions are estimated sequentially, one at a time, i.e., the training data is represented as structure (model fit) + noise (residual): (1) DATA = (model) FIT 1 + RESIDUAL 1 (2) RESIDUAL 1 = FIT 2 + RESIDUAL 2 and so on. The final model for the data will be MODEL = FIT 1 + FIT 2 + .... • Advantages: computational speed, interpretability 27 Regression Trees (CART) • Minimization of empirical risk (squared error) via partitioning of the input space into regions m 1 where wj f x w j I x R j yi nj j 1 • Example of CART partitioning for a function of 2 inputs x2 1 s4 2 R5 R4 R1 s2 s3 x i R j 3 R3 split 1 x1 ,s1 R1 x2 ,s2 x2 ,s3 4 x1 ,s4 R2 s1 x1 R2 R3 R4 R5 28 Growing CART tree • • • • • • Recursive partitioning for estimating regions (via binary splitting) Initial Model ~ Region R 0 (the whole input domain) is divided into two regions R 1 and R 2 A split is defined by one of the inputs(k) and split point s Optimal values of (k, s) chosen so that splitting a region into two daughter regions minimizes empirical risk Issues: - efficient implementation (selection of opt. split point) - optimal tree size ~ model selection(complexity control) Advantages and limitations 29 Valid Split Points for CART • How to choose valid points (for binary splitting)? valid points ~ combinations of the coordinate values of training samples, i.e. for 4 bivariate samples 16 points used as candidates for splitting: 30 CART Modeling Strategy • Growing CART tree ~ reducing MSE (for regression) Splitting the parent region is allowed only if # of samples exceeds certain threshold (~, Splitmin, user-defined). • Tree pruning ~ reducing tree size by selectively combining adjacent leaf nodes (regions). This pruning implements minimization of the penalized MSE: Rp en Rem p T where Rem p~ MSE T ~ number of leaf nodes (regions) and parameter is estimated via resampling 31 Example: Boston Housing data set • Objective: to predict the value of homes in Boston area • Data set ~ 506 samples total Output: value of owner-occupied homes (in $1,000’s) Inputs: 13 variables 1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 32 Example CART trees for Boston Housing 1.Training set: 450 samples Splitmin =100 (user-defined) R0 R1 R2 33 Example CART trees for Boston Housing 2.Training set: 450 samples Splitmin =50 (user-defined) R0 R1 R2 34 Example CART trees for Boston Housing 3.Training set: 455 samples Splitmin =100 (user-defined) Note: CART model is sensitive to training samples (vs model 1) 35 Classification Trees (CART) • • Binary classification example (2D input space) Algorithm similar to regression trees (tree growth via binary splitting + model selection), BUT using different empirical loss function x1< - 0.409 Split 1 x2< - 0.067 2 x1< - 0.148 3 + + 36 Loss functions for Classification Trees • • • Misclassification loss: poor practical choice Other loss (cost) functions for splitting nodes: For J-class problem, a cost function is a measure of node impurity Qt Qp1t, p2t,..., pJ t where p(i/t) denotes the probability of class i samples at node t. Possible cost functions pj t Misclassification Qt 1 max j 2 Q t p i t p j t 1 p j t j Gini function i j j Entropy function Qt p j t ln pj t j 37 Classification Trees: node splitting • Minimizing cost function = maximizing the decrease in node impurity. Assume node t is split into two regions (Left & Right) on variable k at a split point s. Then the decrease is impurity caused by this split is: Qv,k,t Qt Qt L pL t Qt R pR t where pL t pt L pt pR t ptR pt • Misclassification cost ~ discontinuous (due to max) - may give sub-optimal solutions (poor local min) - does not work well with greedy optimization 38 Using different cost fcts for node splitting (a) Decrease in impurity: misclassification = 0.25 gini = 0.13 entropy = 0.13 (b) Decrease in impurity: misclassification = 0.25 gini = 0.17 entropy = 0.22 Split (b) is better as it leads to a smaller final tree 39 Details of calculating decrease in impurity Consider split (a) • Misclassification Cost Qt 1 0.5 0.5 pL t 4 8 0.5 pr t 0.5 QL t 1 3 / 4 0.25 QR t 1 3 / 4 0.25 Qt 0.5 0.5 * 0.25 0.5 * 0.25 0.25 • Gini Cost Qt 1 0.52 0.52 0.5 pL t 0.5 pr t 0.5 QL t 1 (3 / 4)2 (1/ 4)2 3 / 8 QR t 3 / 8 Qt 0.5 0.5 * (3 / 8) 0.5 * (3 / 8) 1 / 8 40 IRIS Data Set: A data set with 150 random samples of flowers from the iris species setosa, versicolor, and virginica (3 classes). From each species there are 50 observations for sepal length, sepal width, petal length, and petal width in cm. This dataset is from classical statistics MATLAB code (splitmin =10) load fisheriris; t = treefit(meas, species); treedisp(t,'names',{'SL' 'SW' 'PL' 'PW'}); 41 Sensitivity to random training data: Consider IRIS data set where every other sample is used (total 75 samples, 25 per class). Then the CART tree formed using the same Matlab software (splitmin = 10, Gini loss fct)) is 42 Decision Trees: summary • Advantages - speed - interpretability - different types of input variables • Limitations: sensitivity to - correlated inputs - affine transformations (of input variables) - general instability of trees • Variations: ID3 (in machine learning), linear CART 43 OUTLINE • • • • • • • • Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion 44 Additive Modeling • Additive model parameterization for regression • where g j x j is unknown (smooth) function. Each univariate component estimated separately Additive model for classification • Ey x f ( x) b g1 ( x1 ) g 2 ( x2 ) ... g d ( xd ) P y 1 x b g1 ( x1 ) g 2 ( x 2 ) ... g d ( x d ) logitP y 1 x ln 1 P y 1 x Backfitting is a greedy optimization approach for estimating basis functions sequentially 45 • By fixing all basis functions j k the empirical risk (MSE) can be decomposed as 1 n 2 Remp V yi f x i , V n i 1 1 y i g j x i , v j w0 g k x i , v k n i 1 j k 1 n 2 ri g k x i , v k n i 1 n 2 Each basis function g k x,v k is estimated via an iterative backfitting algorithm (until some stopping criterion is met) Note: ri can be interpreted as the response variable for the adaptive method g k x,v k 46 Backfitting Algorithm: Example • Consider regression estimation of a function of two variables of the form y g1 x1 g2 x2 noise from training data ( x1i , x2i , yi ) i 1,2,...,n 2 2 For example t ( x1 , x2 ) x1 sin(2x2 ) x 0,1 Backfitting method: (1) estimate g1 x1 for fixed g 2 (2) estimate g 2 x2 for fixed g1 • iterate above two steps Estimation via minimization of empirical risk n Remp g1 ( x1 ), g 2 ( x2 ) 1 2 y g ( x ) g ( x ) i 1 1i 2 2i n i 1 1 n 2 ( first _ iteration) yi g 2 ( x2i ) g1 ( x1i ) n i 1 1 n 2 ri g1 ( x1i ) n i 1 47 Backfitting Algorithm(cont’d) • Estimation of g1 ( x1 ) via minimization of MSE 1 n 2 Remp g1 ( x1 ) ri g1 ( x1i ) m in n i 1 • This is a univariate regression problem of estimating g1 x1 from n data points ( x1i , ri ) where ri yi g 2 ( x2i ) • Can be estimated by smoothing (kNN regression) • Estimation of g 2 x2 (second iteration) proceeds in a similar manner, via minimization of 1 n 2 Remp g 2 ( x2 ) ri g 2 ( x2i ) where ri yi g1 ( x1i ) n i 1 48 • Projection Pursuit regression Projection Pursuit is an additive model: f x,V, W g w x,v w m j j j 0 where basis functions g j z,v j are univariate functions (of projections) j 1 • Features z j (w j x) specify the projection of x onto w • A sum of nonlinear functions g j (w j x, v j ) can approximate any nonlinear model functions. See example below. 49 g1 ( z1 ) 0.5z1 0.1 g 2 ( z2 ) 0.1z2 z1 ( x1 x2 ) z2 ( x1 x2 ) 1 1 0.5 y 0.5 0 0 -0.5 -0.5 -1 1 -1 1 0.5 0.5 1 -1 -0.5 -1 x2 -0.5 -1 0 -0.5 0 -0.5 x2 1 0.5 0 0.5 0 -1 x1 x1 1 0.5 y y 2 g1 ( x1 , x2 ) g 2 ( x1 , x2 ) 0 -0.5 -1 1 0.5 1 0.5 0 0 -0.5 x2 -0.5 -1 -1 x1 50 • Projection Pursuit regression Projection Pursuit is an additive model: f x,V, W g w x,v w m j j j 0 where basis functions g j z,v j are univariate functions (of projections) j 1 • Backfitting algorithm is used to estimate iteratively (a) basis functions (parameters v j) via scatterplot smoothing (b) projection parameters w j (via gradient descent) 51 EXAMPLE: estimation of a two-dimensional fct via projection pursuit (a) Projections are found that minimize unexplained variance. Smoothing is performed to create adaptive basis functions. (b) The final model is a sum of two univariate adaptive basis functions. 52 OUTLINE • • • • • • • • Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion 53 Greedy feature selection • • • Recall feature selection structure in SRM: - difficult (nonlinear) optimization problem - simple with orthogonal basis functions - why not use orthogonal b.f.’s for all apps? Consider sparse polynomial estimation (aka best subset regression) as an example of feature selection, i.e. features ~ {x k }, k 1,2,3,... Compare two approaches: - exhaustive search through all subsets - forward stepwise selection (in statistics) 54 Data set used for comparisons • 30 noisy training samples generated from 3 t ( x ) x x 0.5 y t ( x) N (0,0.05) where and inputs are uniform in [0,1] 55 Feature selection via exhaustive search • • • Exhaustive search for best subset selection - estimate prediction risk (MSE) via leave-oneout cross validation - minimize empirical risk via least squares for all possible subsets of m variables (features) - select the best subset (~ min pred. risk) Based on min prediction risk (via x-validation) the following model was selected w0 w1x1 w2 x3 Final model estimated via linear regression 1 3 ( x , x ) with all data: using features 0.7930x 3 0.7709x 0.5562 56 Forward subset selection (greedy method) • • Forward subset selection - first estimate the model using one feature - then add the second feature if it results in sufficiently large decrease in RSS, otherwise stop - etc. (sequentially adding one more feature) Step 1: select the first feature (m=1) from a set of candidate w0 w1 x1 w0 w1x2 w0 w1x3 w0 w1x4 models: n 2 via RSS ( yi f (x i )) 0.249 i 1 0.270 0.274 0.271 so selected model is 0.677 0.09 x with RSS(1)=0.249 • Step 2: select second feature (m=2) from a set of candidate models: w0 w1 x1 w2 x 2 w0 w1 x1 w2 x3 w0 w1 x1 w2 x 4 via RSS = 0.0615 0.05424 0.05422 4 selected model 0.5769 0.6009x 0.6814x with RSS(2)=0.05422 57 Forward subset selection (greedy method) • Step 2 (cont’d) - check whether including second feature in the model is justified using some statistical criterion, usually F test: RSS(m) RSS(m 1) so (m+1)-st feature is included only if F>90 F RSS(m 1) /(n m 2) 0.2493 0.05422 93.59 For adding second feature: F 0.05422/(30 2 2) so we keep it in the model • Step 3: select third feature from a set of candidate models: w0 w1 x1 w2 x 4 w3 x 2 w0 w1 x1 w2 x 4 w3 x 3 with RSS=0.05362 RSS=0.05363 Test whether adding third feature is justified via F test: 0.05422 0.05362 not justified, so F 0.2799 0.05362/(30 3 2) the final model 0.5769 0.6009x 0.6814x 4 58 OUTLINE • • • • • • • Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Refs: V. Cherkassky and X. Shao, Signal estimation and denoising using VC-theory, Neural Networks, 14, 37-52, 2001 V. Cherkassky and S. Kilts, Myopotential denoising of ECG signals using wavelet thresholding methods, Neural Networks, 14, 1129-1137, 2001 • Summary and discussion 59 Signal Denoising Problem 6 6 5 5 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 0 100 200 300 400 500 600 700 800 900 1000 -3 0 10 20 30 40 50 60 70 80 90 100 60 Signal denoising problem statement • Regression formulation ~ real-valued function estimation (with squared loss) • Signal representation: linear combination of orthogonal basis functions (harmonic, wavelets) y wi g i ( x) i • Differences (from standard formulation) - fixed sampling rate - training data X-values = test data X-values Computationally efficient orthogonal estimators: Discrete Fourier/Wavelet Transform (DFT / DWT) 61 Examples of wavelets see http://en.wikipedia.org/wiki/Wavelet Haar wavelet Symmlet 0.1 0.05 0 -0.05 -0.1 0 0.2 0.4 0.6 0.8 1 62 Meyer Mexican Hat 63 Wavelets (cont’d) Example of translated and dilated wavelet basis functions: mot her wavelet -4 -3 -2 -1 0 1 2 3 4 64 Issues for signal denoising • Denoising via (wavelet) thresholding - wavelet thresholding = sparse feature selection - nonlinear estimator suitable for ERM • Main factors for signal denoising y wi g i ( x) i Representation (choice of basis functions) Ordering (of basis functions) ~ SRM structure Thresholding (model selection) • Large-sample setting: representation • Finite-sample setting: thresholding + ordering 65 Framework for signal denoising • Ordering of (wavelet) thresholding = = structure on orthogonal basis functions Traditional ordering wk1 wk2 ... wkm ... Better ordering wk1 wk 2 wkm ... ... freqk1 freqk 2 freqkm • VC- thresholding Opt number of wavelets ~ via min VC-bound for regression where VC-dim. h=m (number of wavelets or DoF) 66 Empirical Results: signal denoising • Two target functions • Symmlet wavelet • Data set: 128 noisy samples, SNR = 2.5 6 Blocks 4 2 y 0 -2 -4 -6 0 Heavisine 0.2 0.4 0.6 0.8 1 t 67 Empirical Results: Blocks signal estimated by VC-based denoising 68 Empirical Results: Heavisine estimated by VC-based denoising 69 Application Study: ECG Denoising 70 A closer look of a noisy segment 71 Denoised ECG signal VC denoising applied to 4,096 noisy samples. The final model (below) has 76 wavelets 72 OUTLINE • • • • • • • • Objectives Statistical Methodology and Basic Methods Taxonomy of Nonlinear Methods Decision Trees Additive Modeling and Projection Pursuit Greedy Feature Selection Signal Denoising Summary and discussion 73 Summary and Discussion • Evolution of statistical methods - parametric flexible (adaptive) - fast optimization (favor greedy methods – why?) - interpretable - model complexity ~ number of parameters (basis functions, regions, features …) - batch mode (for training) • Probabilistic framework - classical methods assume probabilistic models of observed data - adaptive statistical methods lack probabilistic derivation, but use clever heuristics for controling model complexity 74