Statistical Learning Formula Sheet

SRM Updated 03/13/23 STATISTICAL LEARNING Statistical Learning Data Modeling Problems Types of Variables Response A variable of primary interest Explanatory A variable used to study the response variable Count A quantitative variable usually valid on non-negative integers Continuous A real-valued quantitative variable Nominal A categorical/qualitative variable having categories without a meaningful or logical order Ordinal A categorical/qualitative variable having categories with a meaningful or logical order Notation 𝑦𝑦, 𝑌𝑌 𝑥𝑥, 𝑋𝑋 Subscript 𝑖𝑖 𝑛𝑛 Subscript 𝑗𝑗 𝑝𝑝 𝐀𝐀! 𝐀𝐀"# 𝜀𝜀 , 𝑦𝑦+, 𝑌𝑌, 𝑓𝑓.(𝑥𝑥) ∑'&(#(𝑦𝑦& − 𝑦𝑦+& )% 𝑛𝑛 For fixed inputs 𝑥𝑥# , … , 𝑥𝑥$ , the test MSE is which can be estimated using % Var[𝜀𝜀] VXY .))*+,-./0* *))2) Test Error Rate = ES𝐼𝐼2𝑌𝑌 ≠ 𝑌𝑌,5T, ∑'&(# 𝐼𝐼(𝑦𝑦& ≠ 𝑦𝑦+& ) which can be estimated using 𝑛𝑛 Bayes Classifier: 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 = arg max Pr2𝑌𝑌 = 𝑐𝑐a𝑋𝑋# = 𝑥𝑥# , … , 𝑋𝑋$ = 𝑥𝑥$ 5 3 Unsupervised No response variable Key Ideas • The disadvantage to parametric methods is the danger of choosing a form for 𝑓𝑓 that is not close to the truth. • The disadvantage to non-parametric methods is the need for an abundance of observations. • Flexibility and interpretability are typically at odds. • As flexibility increases, the training MSE (or error rate) decreases, but the test MSE (or error rate) follows a u-shaped pattern. • Low flexibility leads to a method with low variance and high bias; high flexibility leads to a method with high variance and low bias. Classification Categorical response variable Parametric Functional form of f specified Non-Parametric Functional form of f not specified Prediction Output of fˆ Inference Comprehension of f Flexibility , fˆ s ability to follow the data Interpretability , fˆ s ability to be understood © 2023 Coaching Actuaries. All Rights Reserved % Test MSE = E B2𝑌𝑌 − 𝑌𝑌,5 D , Classification Problems Statistical Learning Problems Method Properties 𝑌𝑌 = 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 + 𝜀𝜀 where E[𝜀𝜀] = 0, so E[𝑌𝑌] = 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 )*+,-./0* *))2) Contrasting Statistical Learning Elements Regression Quantitative response variable Regression Problems Test Observations not used to train/obtain f̂ VarS𝑓𝑓.2𝑥𝑥# , … , 𝑥𝑥$ 5T + 2BiasS𝑓𝑓.2𝑥𝑥# , … , 𝑥𝑥$ 5T5 + VWWWWWWWWWWWWXWWWWWWWWWWWWY Response variable Explanatory variable Index for observations No. of observations Index for variables except response No. of variables except response Transpose of matrix 𝐀𝐀 Inverse of matrix 𝐀𝐀 Error term Estimate/Estimator of 𝑓𝑓(𝑥𝑥) Supervised Has response variable Training Observations used to train/obtain f̂ www.coachingactuaries.com SRM Formula Sheet 1 LINEAR MODELS Linear Models Simple Linear Regression (SLR) Special case of MLR where 𝑝𝑝 = 1 Estimation ∑' (𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦e) 𝑏𝑏# = &(# ' ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% 𝑏𝑏4 = 𝑦𝑦e − 𝑏𝑏# 𝑥𝑥̅ SLR Inferences Standard Errors 1 𝑥𝑥̅ % 𝑠𝑠𝑠𝑠5! = hMSE i + ' j 𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% MSE 𝑠𝑠𝑠𝑠5" = h ' ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% (𝑥𝑥 − 𝑥𝑥̅ )% 1 𝑠𝑠𝑠𝑠67 = hMSE i + ' j 𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% 𝑠𝑠𝑠𝑠67#$" = hMSE i1 + (𝑥𝑥'8# − 𝑥𝑥̅ )% 1 + ' j 𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% Multiple Linear Regression (MLR) 𝑌𝑌 = 𝛽𝛽4 + 𝛽𝛽# 𝑥𝑥# + ⋯ + 𝛽𝛽$ 𝑥𝑥$ + 𝜀𝜀 Notation 𝛽𝛽9 𝑏𝑏9 𝜎𝜎 % MSE X 𝐇𝐇 𝑒𝑒 SST SSR SSE The 𝑗𝑗th regression coefficient Estimate of 𝛽𝛽9 Variance of response / Irreducible error Estimate of 𝜎𝜎 % Design matrix Hat matrix Residual Total sum of squares Regression sum of squares Error sum of squares Assumptions 1. 𝑌𝑌& = 𝛽𝛽4 + 𝛽𝛽# 𝑥𝑥&,# + ⋯ + 𝛽𝛽$ 𝑥𝑥&,$ + 𝜀𝜀& Estimation – Ordinary Least Squares (OLS) 𝑦𝑦+ = 𝑏𝑏4 + 𝑏𝑏# 𝑥𝑥# + ⋯ + 𝑏𝑏$ 𝑥𝑥$ 𝑏𝑏4 o ⋮ q = 𝐛𝐛 = (𝐗𝐗 ! 𝐗𝐗)"# 𝐗𝐗 ! 𝐲𝐲 𝑏𝑏$ MSE = SSE⁄(𝑛𝑛 − 𝑝𝑝 − 1) residual standard error = √MSE Other Numerical Results 𝐇𝐇 = 𝐗𝐗(𝐗𝐗 ! 𝐗𝐗)"# 𝐗𝐗 ! 𝐲𝐲+ = 𝐇𝐇𝐇𝐇 𝑒𝑒 = 𝑦𝑦 − 𝑦𝑦+ SST = ∑'&(#(𝑦𝑦& − 𝑦𝑦e)% = total variability SSR = ∑'&(#(𝑦𝑦+& − 𝑦𝑦e)% = explained SSE = ∑'&(#(𝑦𝑦& − 𝑦𝑦+& )% = unexplained SST = SSR + SSE 𝑅𝑅% = SSR⁄SST 𝑛𝑛 − 1 MSE % = 1 − % = 1 − (1 − 𝑅𝑅% ) y z 𝑅𝑅;+<. 𝑛𝑛 − 𝑝𝑝 − 1 𝑠𝑠6 Key Ideas • 𝑅𝑅% is a poor measure for model comparison because it will increase simply by adding more predictors to a model. • Polynomials do not change consistently by unit increases of its variable, i.e. no constant slope. • Only 𝑤𝑤 − 1 dummy variables are needed to represent 𝑤𝑤 classes of a categorical predictor; one of the classes acts as a baseline. • In effect, dummy variables define a distinct intercept for each class. Without the interaction between a dummy variable and a predictor, the dummy variable cannot additionally affect that predictor's regression coefficient. 𝑌𝑌, 𝑠𝑠𝑠𝑠 𝐻𝐻4 𝐻𝐻# df 𝑡𝑡#">,+? 𝛼𝛼 𝑘𝑘 ndf ddf 𝐹𝐹#">,@+?,++? 𝑌𝑌'8# Subscript 𝑟𝑟 Subscript 𝑓𝑓 Standard Errors Estimated standard error Null hypothesis Alternative hypothesis Degrees of freedom 𝑞𝑞 quantile of a 𝑡𝑡-distribution Significance level Confidence level Numerator degrees of freedom Denominator degrees of freedom 𝑞𝑞 quantile of an 𝐹𝐹-distribution Response of new observation Reduced model Full model Variance-Covariance Matrix á T = MSE(𝐗𝐗 ! 𝐗𝐗)"# = Ö S𝜷𝜷 Var Ö S𝛽𝛽.4 T Ö S𝛽𝛽.4 , 𝛽𝛽.# T ⋯ Cov Ö S𝛽𝛽.4 , 𝛽𝛽.$ T Var Cov ⎡ ⎤ Ö S𝛽𝛽.4 , 𝛽𝛽.# T Ö S𝛽𝛽.# T Ö S𝛽𝛽.# , 𝛽𝛽.$ T⎥ ⎢Cov Var ⋯ Cov ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎢ ⎥ . . Ö Ö . . . Ö VarS𝛽𝛽$ T ⎦ ⎣CovS𝛽𝛽4 , 𝛽𝛽$ T CovS𝛽𝛽# , 𝛽𝛽$ T ⋯ 𝑡𝑡 Tests estimate − hypothesized value standard error 𝐻𝐻4 : 𝛽𝛽9 = hypothesized value 𝑡𝑡 statistic = Test Type Two-tailed Right-tailed 3. E[𝜀𝜀& ] = 0 4. Var[𝜀𝜀& ] = 𝜎𝜎 % 5. 𝜀𝜀& ’s are independent 6. 𝜀𝜀& ’s are normally distributed 7. The predictor 𝑥𝑥9 is not a linear Estimator for E[𝑌𝑌] Ö S𝛽𝛽.9 T 𝑠𝑠𝑠𝑠5% = ÑVar Left-tailed 2. 𝑥𝑥&,9 ’s are non-random 𝐹𝐹 Tests Rejection Region |𝑡𝑡 statistic| ≥ 𝑡𝑡A⁄%,'"$"# 𝑡𝑡 statistic ≤ −𝑡𝑡A,'"$"# 𝑡𝑡 statistic ≥ 𝑡𝑡A,'"$"# MSR SSR⁄𝑝𝑝 = MSE SSE⁄(𝑛𝑛 − 𝑝𝑝 − 1) 𝐻𝐻4 : 𝛽𝛽# = 𝛽𝛽% = ⋯ = 𝛽𝛽$ = 0 𝐹𝐹 statistic = Reject 𝐻𝐻4 if 𝐹𝐹 statistic ≥ 𝐹𝐹A,@+?,++? combination of the other 𝑝𝑝 predictors, for 𝑗𝑗 = 0, 1, … , 𝑝𝑝 © 2023 Coaching Actuaries. All Rights Reserved MLR Inferences Notation Estimator for 𝛽𝛽9 𝛽𝛽.9 • ndf = 𝑝𝑝 • ddf = 𝑛𝑛 − 𝑝𝑝 − 1 www.coachingactuaries.com SRM Formula Sheet 2 Partial 𝐹𝐹 Tests 𝐹𝐹 statistic = 2SSEC − SSED 5ò2𝑝𝑝D − 𝑝𝑝C 5 SSED ⁄2𝑛𝑛 − 𝑝𝑝D − 15 𝐻𝐻4 : Some 𝛽𝛽9 ′s = 0 Reject 𝐻𝐻4 if 𝐹𝐹 statistic ≥ 𝐹𝐹A,@+?,++? • ndf = 𝑝𝑝D − 𝑝𝑝C • ddf = 𝑛𝑛 − 𝑝𝑝D − 1 For all hypothesis tests, reject 𝐻𝐻4 if 𝑝𝑝-value ≤ 𝛼𝛼. Confidence and Prediction Intervals estimate ± (𝑡𝑡 quantile)(standard error) Quantity 𝛽𝛽9 E[𝑌𝑌] 𝑌𝑌'8# Interval Expression 𝑏𝑏9 ± 𝑡𝑡(#"F)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠5% 𝑦𝑦+ ± 𝑡𝑡(#"F)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠67 𝑦𝑦+'8# ± 𝑡𝑡(#"F)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠67#$" Linear Model Assumptions Leverage 𝑠𝑠𝑠𝑠67& % ℎ& = 𝐱𝐱&! (𝐗𝐗 ! 𝐗𝐗)"# 𝐱𝐱& = MSE (𝑥𝑥& − 𝑥𝑥̅ )% 1 for SLR ℎ& = + ' 𝑛𝑛 ∑H(#(𝑥𝑥H − 𝑥𝑥̅ )% • 1⁄𝑛𝑛 ≤ ℎ& ≤ 1 • ∑'&(# ℎ& = 𝑝𝑝 + 1 $8# • Frees rule of thumb: ℎ& > 3 £ ' § Studentized and Standardized Residuals 𝑒𝑒& 𝑒𝑒IJ,,& = ÑMSE(&) (1 − ℎ& ) 𝑒𝑒IJ;,& = 𝑒𝑒& •MSE(1 − ℎ& ) • Frees rule of thumb: a𝑒𝑒IJ;,& a > 2 Cook’s Distance % ∑'H(#2𝑦𝑦+H − 𝑦𝑦+(&)H 5 MSE(𝑝𝑝 + 1) 𝑒𝑒&% ℎ& = MSE(𝑝𝑝 + 1)(1 − ℎ& )% 𝐷𝐷& = Plots of Residuals • 𝑒𝑒 versus 𝑦𝑦+ Residuals are well-behaved if o Points appear to be randomly scattered o Residuals seem to average to 0 o Spread of residuals does not change • 𝑒𝑒 versus 𝑖𝑖 Detects dependence of error terms • 𝑞𝑞𝑞𝑞 plot of 𝑒𝑒 Variance Inflation Factor 𝑠𝑠K%% (𝑛𝑛 − 1) 1 VIF9 = = 𝑠𝑠𝑒𝑒5%% MSE 1 − 𝑅𝑅9% Tolerance is the reciprocal of VIF. • Frees rule of thumb: any VIF9 ≥ 10 Key Ideas • As realizations of a 𝑡𝑡-distribution, studentized residuals can help identify outliers. • When residuals have a larger spread for larger predictions, one solution is to transform the response variable with a concave function. • There is no universal approach to handling multicollinearity; it is even possible to accept it, such as when there is a suppressor variable. On the other hand, it can be eliminated by using a set of orthogonal predictors. Model Selection Notation 𝑔𝑔 Total no. of predictors in consideration 𝑝𝑝 No. of predictors for a specific model MSEL MSE of the model that uses all 𝑔𝑔 predictors Μ$ The "best" model with 𝑝𝑝 predictors Best Subset Selection 1. For 𝑝𝑝 = 0, 1, … , 𝑔𝑔, fit all £L$§ models with 𝑝𝑝 predictors. The model with the largest 𝑅𝑅% is Μ$ . 2. Choose the best model among Μ4 , … , ΜL using a selection criterion of choice. Forward Stepwise Selection 1. Fit all 𝑔𝑔 simple linear regression models. The model with the largest 𝑅𝑅% is Μ# . 2. For 𝑝𝑝 = 2, … , 𝑔𝑔, fit the models that add one of the remaining predictors to Μ$"# . % The model with the largest 𝑅𝑅 is Μ$ . 3. Choose the best model among Μ4 , … , ΜL using a selection criterion of choice. Backward Stepwise Selection 1. Fit the model with all 𝑔𝑔 predictors, ΜL . 2. For 𝑝𝑝 = 𝑔𝑔 − 1, … , 1, fit the models that drop one of the predictors from Μ$8# . The model with the largest 𝑅𝑅% is Μ$ . 3. Choose the best model among Μ4 , … , ΜL Selection Criteria • Mallows’ 𝐶𝐶$ SSE + 2𝑝𝑝 ⋅ MSEL 𝐶𝐶$ = 𝑛𝑛 SSE 𝐶𝐶$ = − 𝑛𝑛 + 2(𝑝𝑝 + 1) MSEL • Akaike information criterion SSE + 2𝑝𝑝 ⋅ MSEL AIC = 𝑛𝑛 ⋅ MSEL • Bayesian information criterion SSE + ln 𝑛𝑛 ⋅ 𝑝𝑝 ⋅ MSEL BIC = 𝑛𝑛 ⋅ MSEL • Adjusted 𝑅𝑅% • Cross-validation error Validation Set • Randomly splits all available observations into two groups: the training set and the validation set. • Only the observations in the training set are used to attain the fitted model, and those in validation set are used to estimate the test MSE. 𝑘𝑘-fold Cross-Validation 1. Randomly divide all available observations into 𝑘𝑘 folds. 2. For 𝑣𝑣 = 1, … , 𝑘𝑘, obtain the 𝑣𝑣th fit by training with all observations except those in the 𝑣𝑣th fold. 3. For 𝑣𝑣 = 1, … , 𝑘𝑘, use 𝑦𝑦+ from the 𝑣𝑣th fit to calculate a test MSE estimate with observations in the 𝑣𝑣th fold. 4. To calculate CV error, average the 𝑘𝑘 test MSE estimates in the previous step. Leave-one-out Cross-Validation (LOOCV) • Calculate LOOCV error as a special case of 𝑘𝑘-fold cross-validation where 𝑘𝑘 = 𝑛𝑛. • For MLR: LOOCV Error = ' 1 𝑦𝑦& − 𝑦𝑦+& % ∞y z 1 − ℎ& 𝑛𝑛 &(# Key Ideas on Cross-Validation • The validation set approach has unstable results and will tend to overestimate the test MSE. The two other approaches mitigate these issues. • With respect to bias, LOOCV < 𝑘𝑘-fold CV < Validation Set. • With respect to variance, LOOCV > 𝑘𝑘-fold CV > Validation Set. using a selection criterion of choice. © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 3 Other Regression Approaches Standardizing Variables Key Ideas on Ridge and Lasso Weighted Least Squares • Var[𝜀𝜀& ] = 𝜎𝜎 % ⁄𝑤𝑤& • 𝑥𝑥# , … , 𝑥𝑥$ are scaled predictors. • A centered variable is the result of subtracting the sample mean from a variable. • A scaled variable is the result of dividing a variable by its sample standard deviation. • Equivalent to running OLS with √𝑤𝑤𝑦𝑦 as • 𝜆𝜆 is inversely related to flexibility. • With a finite 𝜆𝜆, none of the ridge estimates will equal 0, but the lasso estimates could equal 0. the response and √𝑤𝑤𝐱𝐱 as the predictors, hence minimizing ∑'&(# 𝑤𝑤& (𝑦𝑦& − 𝑦𝑦+& )% . 𝐛𝐛 = (𝐗𝐗 ! 𝐖𝐖𝐖𝐖)"# 𝐗𝐗 ! 𝐖𝐖𝐖𝐖 where 𝐖𝐖 is the diagonal matrix of the weights. Partial Least Squares • The first partial least squares direction, 𝑧𝑧# , is a linear combination of standardized predictors 𝑥𝑥# , … , 𝑥𝑥$ , with coefficients • A standardized variable is the result of first centering a variable, then scaling it. Ridge Regression Coefficients are estimated by minimizing the SSE while constrained by ∑$9(# 𝑏𝑏9% ≤ 𝑎𝑎 𝑘𝑘-Nearest Neighbors (KNN) 1. Identify the "center of the neighborhood", i.e. the location of an observation with inputs 𝑥𝑥# , … , 𝑥𝑥$ . based on the relation between 𝑥𝑥9 and 𝑦𝑦. 2. Starting from the "center of the neighborhood", identify the 𝑘𝑘 nearest training observations. 3. For classification, 𝑦𝑦+ is the most frequent category among the 𝑘𝑘 observations; for regression, 𝑦𝑦+ is the average of the response among the 𝑘𝑘 observations. 𝑘𝑘 is inversely related to flexibility. • Every subsequent partial least squares direction is calculated iteratively as a linear combination of "updated predictors" which are the residuals of fits with the "previous predictors" explained by the previous direction. • The directions 𝑧𝑧# , … , 𝑧𝑧L are used as or equivalently, by minimizing the expression SSE + 𝜆𝜆 ∑$9(# 𝑏𝑏9% . Lasso Regression Coefficients are estimated by minimizing the SSE while constrained by ∑$9(#a𝑏𝑏9 a ≤ 𝑎𝑎 predictors in a multiple linear regression. The number of directions, 𝑔𝑔, is a measure of flexibility. or equivalently, by minimizing the expression SSE + 𝜆𝜆 ∑$9(#a𝑏𝑏9 a. Key Results for Distributions in the Exponential Family Distribution Normal Binomial (fixed 𝑛𝑛) Poisson Negative Binomial (fixed 𝑟𝑟) Gamma Inverse Gaussian Probability Function 1 𝜎𝜎√2𝜋𝜋 exp π− 𝜃𝜃 (𝑦𝑦 − 𝜇𝜇)% ∫ 2𝜎𝜎 % 𝑛𝑛 ln21 + 𝑒𝑒 N 5 𝜇𝜇 ln y z 𝑛𝑛 − 𝜇𝜇 1 −𝑟𝑟 ln21 − 𝑒𝑒 N 5 𝜇𝜇 ln y z 𝑟𝑟 + 𝜇𝜇 1 𝜆𝜆 −√−2𝜃𝜃 1 ln(1 − 𝑝𝑝) © 2023 Coaching Actuaries. All Rights Reserved 1 ln 𝜆𝜆 Γ(𝑦𝑦 + 𝑟𝑟) C 𝑝𝑝 (1 − 𝑝𝑝)6 𝑦𝑦! Γ(𝑟𝑟) 𝜆𝜆 𝜆𝜆(𝑦𝑦 − 𝜇𝜇)% h exp π− ∫ O 2𝜋𝜋𝑦𝑦 2𝜇𝜇% 𝑦𝑦 Canonical Link, 𝑏𝑏M "# (𝜇𝜇) 𝜎𝜎 % 𝜋𝜋 ln £ § 1 − 𝜋𝜋 𝛾𝛾 A A"# 𝑦𝑦 exp(−𝑦𝑦𝑦𝑦) Γ(𝛼𝛼) 𝑏𝑏(𝜃𝜃) 𝜇𝜇 𝑛𝑛 y z 𝜋𝜋 6 (1 − 𝜋𝜋)'"6 𝑦𝑦 𝜆𝜆6 exp(−𝜆𝜆) 𝑦𝑦! 𝜙𝜙 − − 𝛾𝛾 𝛼𝛼 1 2𝜇𝜇% 1 𝛼𝛼 𝜃𝜃 % 2 𝑒𝑒 N − ln(−𝜃𝜃) www.coachingactuaries.com 𝜇𝜇 ln 𝜇𝜇 − − 1 𝜇𝜇 1 2𝜇𝜇% SRM Formula Sheet 4 NON-LINEAR MODELS Non-Linear Models Generalized Linear Models Notation 𝜃𝜃, 𝜙𝜙 Linear exponential family parameters E[𝑌𝑌], 𝜇𝜇 Mean response 𝑏𝑏M (𝜃𝜃) Mean function 𝑣𝑣(𝜇𝜇) Variance function ℎ(𝜇𝜇) Link function 𝐛𝐛 Maximum likelihood estimate of 𝜷𝜷 𝑙𝑙(𝐛𝐛) Maximized log-likelihood 𝑙𝑙4 Maximized log-likelihood for null model 𝑙𝑙I;J Maximized log-likelihood for saturated model 𝑒𝑒 Residual 𝐈𝐈 Information matrix % 𝜒𝜒#">,+? 𝑞𝑞 quantile of a chi-square distribution 𝐷𝐷∗ Scaled deviance 𝐷𝐷 Deviance statistic Linear Exponential Family 𝑦𝑦𝑦𝑦 − 𝑏𝑏(𝜃𝜃) Prob. fn. of 𝑌𝑌 = exp π + 𝑎𝑎(𝑦𝑦, 𝜙𝜙)∫ 𝜙𝜙 E[𝑌𝑌] = 𝑏𝑏M (𝜃𝜃) Var[𝑌𝑌] = 𝜙𝜙 ⋅ 𝑏𝑏MM (𝜃𝜃) = 𝜙𝜙 ⋅ 𝑣𝑣(𝜇𝜇) Model Framework • ℎ(𝜇𝜇) = 𝐱𝐱 ! 𝜷𝜷 • Canonical link is the link function where ℎ(𝜇𝜇) = 𝑏𝑏 Numerical Results 𝐷𝐷∗ = 2[𝑙𝑙I;J − 𝑙𝑙(𝐛𝐛)] 𝐷𝐷 = 𝜙𝜙𝐷𝐷∗ For MLR, 𝐷𝐷 = SSE 1 − exp{2[𝑙𝑙4 − 𝑙𝑙(𝐛𝐛)]⁄𝑛𝑛} 1 − exp{2𝑙𝑙4 ⁄𝑛𝑛} 𝑙𝑙(𝐛𝐛) − 𝑙𝑙4 = 𝑙𝑙I;J − 𝑙𝑙4 % 𝑅𝑅QI = % 𝑅𝑅RI*. AIC = −2 ⋅ 𝑙𝑙(𝐛𝐛) + 2 ⋅ (𝑝𝑝 + 1)* BIC = −2 ⋅ 𝑙𝑙(𝐛𝐛) + ln 𝑛𝑛 ⋅ (𝑝𝑝 + 1)* *Assumes only 𝜷𝜷 needs to be estimated. If estimating 𝜙𝜙 is required, replace 𝑝𝑝 + 1 with 𝑝𝑝 + 2. Residuals Raw Residual 𝑒𝑒& = 𝑦𝑦& − 𝜇𝜇̂ & Pearson Residual 𝑦𝑦& − 𝜇𝜇̂ & 𝑒𝑒& = •𝜙𝜙 ⋅ 𝑣𝑣(𝜇𝜇̂ & ) The Pearson chi-square statistic is ∑'&(# 𝑒𝑒&% . Deviance Residual 𝑒𝑒& = ±•𝐷𝐷&∗ whose sign follows the 𝑖𝑖th raw residual Anscombe Residual á[𝑡𝑡(𝑌𝑌& )] 𝑡𝑡(𝑦𝑦& ) − E 𝑒𝑒& = Ö •Var[𝑡𝑡(𝑌𝑌& )] Inference á • Maximum likelihood estimators 𝜷𝜷 asymptotically have a multivariate normal distribution with mean 𝜷𝜷 and asymptotic variance-covariance matrix 𝐈𝐈"# . • To address overdispersion, change the variance to Var[𝑌𝑌& ] = 𝛿𝛿 ⋅ 𝜙𝜙 ⋅ 𝑏𝑏MM (𝜃𝜃& ) and estimate 𝛿𝛿 as the Pearson chi-square statistic divided by 𝑛𝑛 − 𝑝𝑝 − 1. Likelihood Ratio Tests 𝜒𝜒 % statistic = 2S𝑙𝑙2𝐛𝐛D 5 − 𝑙𝑙(𝐛𝐛C )T 𝐻𝐻4 : Some 𝛽𝛽9 ′s = 0 % Reject 𝐻𝐻4 if 𝜒𝜒 % statistic ≥ 𝜒𝜒A,$ ' "$( Goodness-of-Fit Tests 𝑌𝑌 follows a distribution of choice with 𝑔𝑔 free parameters, whose domain is split into 𝑤𝑤 mutually exclusive intervals. S 𝜒𝜒 % statistic = ∞ 3(# (𝑛𝑛3 − 𝑛𝑛𝑞𝑞3 )% 𝑛𝑛𝑞𝑞3 𝑛𝑛3 for all 𝑐𝑐 = 1, … , 𝑤𝑤 𝐻𝐻4 : 𝑞𝑞3 = 𝑛𝑛 % Reject 𝐻𝐻4 if 𝜒𝜒 % statistic ≥ 𝜒𝜒A,S"L"# Tweedie Distribution E[𝑌𝑌] = 𝜇𝜇, Var[𝑌𝑌] = 𝜙𝜙 ⋅ 𝜇𝜇T Distribution Normal 0 Poisson M "# (𝜇𝜇). Parameter Estimation 𝑦𝑦& 𝜃𝜃& − 𝑏𝑏(𝜃𝜃& ) 𝑙𝑙(𝜷𝜷) = ∞ π + 𝑎𝑎(𝑦𝑦& , 𝜙𝜙) ∫ 𝜙𝜙 &(# where 𝜃𝜃& = 𝑏𝑏M "# Sℎ"# 2𝐱𝐱&! 𝜷𝜷5T 1 Tweedie (1, 2) Inverse Gaussian 3 Gamma ' 𝑑𝑑 2 The score equations are the partial derivatives of 𝑙𝑙(𝜷𝜷) with respect to each 𝛽𝛽9 all set equal to 0. The solution to the score equations is 𝐛𝐛. Then, 𝜇𝜇̂ = ℎ"#(𝐱𝐱 ! 𝐛𝐛). © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 5 Logistic and Probit Regression • The odds of an event are the ratio of the probability that the event will occur to the probability that the event will not occur. • The odds ratio is the ratio of the odds of an event with the presence of a characteristic to the odds of the same event without the presence of that characteristic. Binary Response Function Name Logit Probit Complementary log-log ' ℎ(𝜇𝜇) ln y 𝜇𝜇 z 1 − 𝜇𝜇 Φ"# (𝜇𝜇) ln(− ln(1 − 𝜇𝜇)) Nominal Response – Generalized Logit Let 𝜋𝜋&,3 be the probability that the 𝑖𝑖th observation is classified as category 𝑐𝑐. The reference category is 𝑘𝑘. 𝜋𝜋&,3 ln i j = 𝐱𝐱&! 𝜷𝜷3 𝜋𝜋&,F 𝜋𝜋&,3 exp2𝐱𝐱&! 𝜷𝜷3 5 ⎧ ⎪1 + ∑UVF exp2𝐱𝐱 ! 𝜷𝜷U 5 , & = 1 ⎨ , ⎪ ! ⎩1 + ∑UVF exp2𝐱𝐱& 𝜷𝜷U 5 ' S 𝑙𝑙(𝜷𝜷) = ∞ ∞ 𝐼𝐼(𝑦𝑦& = 𝑐𝑐) ln 𝜋𝜋&,3 𝑐𝑐 ≠ 𝑘𝑘 𝑐𝑐 = 𝑘𝑘 &(# 3(# Ordinal Response – Proportional Odds Cumulative ℎ(Π3 ) = 𝛼𝛼3 + 𝐱𝐱&! 𝜷𝜷 where • Π3 = 𝜋𝜋# + ⋯ + 𝜋𝜋3 𝑥𝑥&,# 𝛽𝛽# • 𝐱𝐱& = — ⋮ “ , 𝜷𝜷 = o ⋮ q 𝑥𝑥&,$ 𝛽𝛽$ 𝑙𝑙(𝜷𝜷) = ∞[𝑦𝑦& ln 𝜇𝜇& + (1 − 𝑦𝑦& ) ln(1 − 𝜇𝜇& )] &(# ' 𝜕𝜕 𝜇𝜇&M 𝑙𝑙(𝜷𝜷) = ∞ 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎 𝜕𝜕𝜷𝜷 𝜇𝜇& (1 − 𝜇𝜇& ) ' &(# 1 − 𝑦𝑦& 𝑦𝑦& 𝐷𝐷 = 2 ∞ 𝑦𝑦& ln y z + (1 − 𝑦𝑦& ) ln y zÀ 𝜇𝜇̂ & 1 − 𝜇𝜇̂ & &(# Pearson residual, 𝑒𝑒& = 𝑦𝑦& − 𝜇𝜇̂ & ' &(# (𝑦𝑦& − 𝜇𝜇̂ & )% 𝜇𝜇̂ & (1 − 𝜇𝜇̂ & ) © 2023 Coaching Actuaries. All Rights Reserved ' 𝑙𝑙(𝜷𝜷) = ∞[𝑦𝑦& ln 𝜇𝜇& − 𝜇𝜇& − ln(𝑦𝑦& !) ] &(# ' 𝜕𝜕 𝑙𝑙(𝜷𝜷) = ∞ 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎 𝜕𝜕𝜷𝜷 &(# ' 𝐈𝐈 = ∞ 𝜇𝜇& 𝐱𝐱& 𝐱𝐱&! &(# ' 𝑦𝑦& 𝐷𝐷 = 2 ∞ ”𝑦𝑦& ln y z − 1À + 𝜇𝜇̂ & ‘ 𝜇𝜇̂ & &(# Pearson residual, 𝑒𝑒& = •𝜇𝜇̂ & ' &(# (𝑦𝑦& − 𝜇𝜇̂ & )% 𝜇𝜇̂ & Poisson Regression with Exposures Model ln 𝜇𝜇 = ln 𝑤𝑤 + 𝐱𝐱 ! 𝜷𝜷 Alternative Count Models These models can incorporate a Poisson distribution while letting the mean of the response differ from the variance of the response: Models Mean < Variance Mean > Variance Negative binomial Yes No Hurdle Yes Yes Heterogeneity www.coachingactuaries.com 𝑦𝑦& − 𝜇𝜇̂ & Pearson chi-square statistic = ∞ Zero-inflated •𝜇𝜇̂ & (1 − 𝜇𝜇̂ & ) Pearson chi-square statistic = ∞ Poisson Count Regression ln 𝜇𝜇 = 𝐱𝐱 ! 𝜷𝜷 Yes Yes No No SRM Formula Sheet 6 TIME SERIESTime Series Trend Models Notation Subscript 𝑡𝑡 Index for observations 𝑇𝑇W Trends in time 𝑆𝑆W Seasonal trends 𝜀𝜀W Random patterns 𝑦𝑦+'8X 𝑙𝑙-step ahead forecast 𝑠𝑠𝑠𝑠 Estimated standard error 𝑡𝑡#">,+? 𝑞𝑞 quantile of a 𝑡𝑡-distribution 𝑛𝑛# 𝑛𝑛% Training sample size Test sample size Trends Additive: 𝑌𝑌W = 𝑇𝑇W + 𝑆𝑆W + 𝜀𝜀W Multiplicative: 𝑌𝑌W = 𝑇𝑇W × 𝑆𝑆W + 𝜀𝜀W Stationarity Stationarity describes how something does not vary with respect to time. Control charts can be used to identify stationarity. White Noise 𝑦𝑦+'8X = 𝑦𝑦e 𝑠𝑠𝑠𝑠67#$) = 𝑠𝑠6 •1 + 1⁄𝑛𝑛 100𝑘𝑘% prediction interval for 𝑦𝑦'8X is 𝑦𝑦+'8X ± 𝑡𝑡(#"F)⁄%,'"# ⋅ 𝑠𝑠𝑠𝑠67#$) Random Walk 𝑤𝑤W = 𝑦𝑦W − 𝑦𝑦W"# 𝑦𝑦+'8X = 𝑦𝑦' + 𝑙𝑙𝑤𝑤 ÿ 𝑠𝑠𝑠𝑠67#$) = 𝑠𝑠S √𝑙𝑙 Approximate 95% prediction interval for 𝑦𝑦'8X is 𝑦𝑦+'8X ± 2 ⋅ 𝑠𝑠𝑠𝑠67#$) Model Comparison ME = ' 1 ∞ 𝑒𝑒W 𝑛𝑛% W('" 8# ' 𝑒𝑒W 1 ∞ MPE = 100 ⋅ 𝑦𝑦W 𝑛𝑛% ' W('" 8# 1 ∞ 𝑒𝑒W% MSE = 𝑛𝑛% MAE = W('" 8# ' 1 ∞ |𝑒𝑒W | 𝑛𝑛% W('" 8# ' 𝑒𝑒W 1 ∞ Ÿ Ÿ MAPE = 100 ⋅ 𝑦𝑦W 𝑛𝑛% W('" 8# © 2023 Coaching Actuaries. All Rights Reserved Autoregressive Models Notation 𝜌𝜌F Lag 𝑘𝑘 autocorrelation 𝑟𝑟F Lag 𝑘𝑘 sample autocorrelation Variance of white noise 𝜎𝜎 % % 𝑠𝑠 Estimate of 𝜎𝜎 % 𝑏𝑏4 Estimate of 𝛽𝛽4 𝑏𝑏# Estimate of 𝛽𝛽# 𝑦𝑦e" Sample mean of first 𝑛𝑛 − 1 observations 𝑦𝑦e8 Sample mean of last 𝑛𝑛 − 1 observations Autocorrelation ∑'W(F8#(𝑦𝑦W"F − 𝑦𝑦e)(𝑦𝑦W − 𝑦𝑦e) 𝑟𝑟F = ∑'W(#(𝑦𝑦W − 𝑦𝑦e)% Testing Autocorrelation test statistic = 𝑟𝑟F ⁄𝑠𝑠𝑠𝑠C* where 𝑠𝑠𝑠𝑠C* = 1⁄√𝑛𝑛 𝐻𝐻4 : 𝜌𝜌F = 0 against 𝐻𝐻# : 𝜌𝜌F ≠ 0 Reject 𝐻𝐻4 if |test statistic| ≥ 𝑧𝑧#"A⁄% AR(1) Model 𝑌𝑌W = 𝛽𝛽4 + 𝛽𝛽# 𝑌𝑌W"# + 𝜀𝜀W Assumptions 1. E[𝜀𝜀W ] = 0 2. Var[𝜀𝜀W ] = 𝜎𝜎 % 3. Cov[𝜀𝜀W8F , 𝑌𝑌W ] = 0 for 𝑘𝑘 > 0 • If 𝛽𝛽# = 0, 𝑌𝑌W follows a white noise process. • If 𝛽𝛽# = 1, 𝑌𝑌W follows a random walk process. • If −1 < 𝛽𝛽# < 1, 𝑌𝑌W is stationary. Properties of Stationary AR(1) Model 𝛽𝛽4 E[𝑌𝑌W ] = 1 − 𝛽𝛽# 𝜎𝜎 % Var[𝑌𝑌W ] = 1 − 𝛽𝛽#% 𝜌𝜌F = 𝛽𝛽#F Estimation ∑'W(%(𝑦𝑦W"# − 𝑦𝑦e" )(𝑦𝑦W − 𝑦𝑦e8 ) ≈ 𝑟𝑟# 𝑏𝑏# = ∑'W(%(𝑦𝑦W"# − 𝑦𝑦e")% Smoothing and Predictions 𝑦𝑦+W = 𝑏𝑏4 + 𝑏𝑏# 𝑦𝑦W"# , 2 ≤ 𝑡𝑡 ≤ 𝑛𝑛 𝑏𝑏4 + 𝑏𝑏# 𝑦𝑦'8X"# , 𝑙𝑙 = 1 𝑦𝑦+'8X = ” 𝑙𝑙 > 1 𝑏𝑏4 + 𝑏𝑏# 𝑦𝑦+'8X"# , %(X"#) 𝑠𝑠𝑠𝑠67#$) = 𝑠𝑠Ñ1 + 𝑏𝑏#% + 𝑏𝑏#Y + ⋯ + 𝑏𝑏# 100𝑘𝑘% prediction interval for 𝑦𝑦'8X is 𝑦𝑦+'8X ± 𝑡𝑡(#"F)⁄%,'"O ⋅ 𝑠𝑠𝑠𝑠67#$) Other Time Series Models Notation 𝑘𝑘 Moving average length 𝑤𝑤 Smoothing parameter 𝑔𝑔 Seasonal base 𝑑𝑑 No. of trigonometric functions Smoothing with Moving Averages 𝑌𝑌W = 𝛽𝛽4 + 𝜀𝜀W Smoothing 𝑦𝑦W + 𝑦𝑦W"# + ⋯ + 𝑦𝑦W"F8# 𝑠𝑠̂W = 𝑘𝑘 𝑦𝑦W − 𝑦𝑦W"F , 𝑘𝑘 = 1, 2, … 𝑠𝑠̂W = 𝑠𝑠̂W"# + 𝑘𝑘 Predictions 𝑏𝑏4 = 𝑠𝑠̂' 𝑦𝑦+'8X = 𝑏𝑏4 Double Smoothing with Moving Averages 𝑌𝑌W = 𝛽𝛽4 + 𝛽𝛽# 𝑡𝑡 + 𝜀𝜀W Smoothing 𝑠𝑠̂W + 𝑠𝑠̂W"# + ⋯ + 𝑠𝑠̂W"F8# (%) 𝑠𝑠̂W = 𝑘𝑘 𝑠𝑠̂W − 𝑠𝑠̂W"F (%) (%) , 𝑘𝑘 = 1, 2, … 𝑠𝑠̂W = 𝑠𝑠̂W"# + 𝑘𝑘 Predictions 𝑏𝑏4 = 𝑠𝑠̂' 𝑏𝑏# = (%) 2 £𝑠𝑠̂' − 𝑠𝑠̂' § 𝑘𝑘 − 1 𝑦𝑦+'8X = 𝑏𝑏4 + 𝑏𝑏# ⋅ 𝑙𝑙 𝑏𝑏4 = 𝑦𝑦e8 − 𝑏𝑏# 𝑦𝑦e" ≈ 𝑦𝑦e(1 − 𝑟𝑟# ) ∑'W(%(𝑒𝑒W − 𝑒𝑒̅)% 𝑠𝑠 % = 𝑛𝑛 − 3 𝑠𝑠 % Ö [𝑌𝑌W ] = Var 1 − 𝑏𝑏#% www.coachingactuaries.com SRM Formula Sheet 7 Exponential Smoothing 𝑌𝑌W = 𝛽𝛽4 + 𝜀𝜀W Smoothing 𝑠𝑠̂W = (1 − 𝑤𝑤)(𝑦𝑦W + 𝑤𝑤𝑦𝑦W"# + ⋯ + 𝑤𝑤 W 𝑦𝑦4 ) 𝑠𝑠̂W = (1 − 𝑤𝑤)𝑦𝑦W + 𝑤𝑤𝑠𝑠̂W"# , 0 ≤ 𝑤𝑤 < 1 The value of 𝑤𝑤 is determined by minimizing 𝑆𝑆𝑆𝑆(𝑤𝑤) = ∑'W(#(𝑦𝑦W − 𝑠𝑠̂W"# )% . Predictions 𝑏𝑏4 = 𝑠𝑠̂' 𝑦𝑦+'8X = 𝑏𝑏4 Smoothing (%) Predictions 𝑏𝑏4 = 𝑏𝑏# = 𝑦𝑦+'8X W = (1 − 𝑤𝑤)(𝑠𝑠̂W + 𝑤𝑤𝑠𝑠̂W"# + ⋯ + 𝑤𝑤 𝑠𝑠̂4 ) = (1 − 𝑤𝑤)𝑠𝑠̂W + 𝑤𝑤𝑠𝑠̂W"# , T 𝑆𝑆W = ∞S𝛽𝛽#,& sin(𝑓𝑓& 𝑡𝑡) + 𝛽𝛽%,& cos(𝑓𝑓& 𝑡𝑡)T &(# • 𝑓𝑓& = 2𝜋𝜋𝜋𝜋⁄𝑔𝑔 • 𝑑𝑑 ≤ 𝑔𝑔⁄2 Seasonal Autoregressive Models, SAR(p) 𝑌𝑌W = 𝛽𝛽4 + 𝛽𝛽# 𝑌𝑌W"L + ⋯ + 𝛽𝛽$ 𝑌𝑌W"$L + 𝜀𝜀W Holt-Winter Seasonal Additive Model 𝑌𝑌W = 𝛽𝛽4 + 𝛽𝛽# 𝑡𝑡 + 𝑆𝑆W + 𝜀𝜀W • 𝑆𝑆W = 𝑆𝑆W"L Double Exponential Smoothing 𝑌𝑌W = 𝛽𝛽4 + 𝛽𝛽# 𝑡𝑡 + 𝜀𝜀W (%) 𝑠𝑠̂W (%) 𝑠𝑠̂W Seasonal Time Series Models Fixed Seasonal Effects – Trigonometric Functions L • ∑W(# 𝑆𝑆W = 0 0 ≤ 𝑤𝑤 < 1 Unit Root Test • A unit root test is used to evaluate the fit of a random walk model. • A random walk model is a good fit if the time series possesses a unit root. • The Dickey-Fuller test and augmented Dickey-Fuller test are two examples of unit root tests. Volatility Models 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴(𝑝𝑝) Model % % + ⋯ + 𝛾𝛾$ 𝜀𝜀W"$ 𝜎𝜎W% = 𝜃𝜃 + 𝛾𝛾# 𝜀𝜀W"# 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺(𝑝𝑝, 𝑞𝑞) Model % % + ⋯ + 𝛾𝛾$ 𝜀𝜀W"$ + 𝜎𝜎W% = 𝜃𝜃 + 𝛾𝛾# 𝜀𝜀W"# % % 𝛿𝛿# 𝜎𝜎W"# + ⋯ + 𝛿𝛿> 𝜎𝜎W"> 𝜃𝜃 Var[𝜀𝜀W ] = 1 − ∑$9(# 𝛾𝛾9 − ∑>9(# 𝛿𝛿9 Assumptions • 𝜃𝜃 > 0 • 𝛾𝛾9 ≥ 0 (%) 2𝑠𝑠̂' − 𝑠𝑠̂' 1 − 𝑤𝑤 (%) £𝑠𝑠̂' − 𝑠𝑠̂' § 𝑤𝑤 = 𝑏𝑏4 + 𝑏𝑏# ⋅ 𝑙𝑙 • 𝛿𝛿9 ≥ 0 • ∑$9(# 𝛾𝛾9 + ∑>9(# 𝛿𝛿9 < 1 Key Ideas for Smoothing • It is only appropriate for time series data without a linear trend. • It is related to weighted least squares. • A double smoothing procedure can be used to forecast time series data with a linear trend. • Holt-Winter double exponential smoothing is a generalization of the double exponential smoothing. © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 8 DECISION TREES Decision Trees Regression and Classification Trees Notation 𝑅𝑅 Region of predictor space 𝑛𝑛U No. of observations in node 𝑚𝑚 𝑛𝑛U,3 No. of category 𝑐𝑐 observations in node 𝑚𝑚 𝐼𝐼 Impurity 𝐸𝐸 Classification error rate 𝐺𝐺 Gini index 𝐷𝐷 Cross entropy 𝑇𝑇 Subtree |𝑇𝑇| No. of terminal nodes in 𝑇𝑇 𝜆𝜆 Tuning parameter Algorithm 1. Construct a large tree with 𝑔𝑔 terminal nodes using recursive binary splitting. 2. Obtain a sequence of best subtrees, as a function of 𝜆𝜆, using cost complexity pruning. 3. Choose 𝜆𝜆 by applying 𝑘𝑘-fold cross validation. Select the 𝜆𝜆 that results in the lowest cross-validation error. 4. The best subtree is the subtree created in step 2 with the selected 𝜆𝜆 value. Recursive Binary Splitting Regression: L % Minimize ∞ ∞ 2𝑦𝑦& − 𝑦𝑦eZ+ 5 U(# &:𝐱𝐱 & ∈Z+ Classification: L 1 Minimize ∞ 𝑛𝑛U ⋅ 𝐼𝐼U 𝑛𝑛 U(# More Under Classification: 𝑝𝑝̂U,3 = 𝑛𝑛U,3 ⁄𝑛𝑛U 𝐸𝐸U = 1 − max 𝑝𝑝̂U,3 3 𝐺𝐺U = ∑S 3(# 𝑝𝑝̂U,3 21 − 𝑝𝑝̂ U,3 5 𝐷𝐷U = − ∑S 3(# 𝑝𝑝̂U,3 ln 𝑝𝑝̂ U,3 deviance = −2 ∑LU(# ∑S 3(# 𝑛𝑛U,3 ln 𝑝𝑝̂ U,3 deviance residual mean deviance = 𝑛𝑛 − 𝑔𝑔 Cost Complexity Pruning Regression: |!| % Minimize ∞ ∞ 2𝑦𝑦& − 𝑦𝑦eZ+ 5 + 𝜆𝜆|𝑇𝑇| U(# &:𝐱𝐱 & ∈Z+ Classification: |!| 1 Minimize ∞ 𝑛𝑛U ⋅ 𝐼𝐼U + 𝜆𝜆|𝑇𝑇| 𝑛𝑛 Key Ideas U(# • Terminal nodes or leaves represent the partitions of the predictor space. • Internal nodes are points along the tree where splits occur. • Terminal nodes do not have child nodes, but internal nodes do. • Branches are lines that connect any two nodes. • A decision tree with only one internal node is called a stump. Advantages of Trees • Easy to interpret and explain • Can be presented visually • Manage categorical variables without the need of dummy variables • Mimic human decision-making Disadvantages of Trees • Not robust • Do not have the same degree of predictive accuracy as other statistical methods Multiple Trees Bagging 1. Create 𝑏𝑏 bootstrap samples from the original training dataset. 2. Construct a decision tree for each bootstrap sample using recursive binary splitting. 3. Predict the response of a new observation by averaging the predictions (regression trees) or by using the most frequent category (classification trees) across all 𝑏𝑏 trees. Random Forests 1. Create 𝑏𝑏 bootstrap samples from the original training dataset. 2. Construct a decision tree for each bootstrap sample using recursive binary splitting. At each split, a random subset of 𝑘𝑘 variables are considered. 3. Predict the response of a new observation by averaging the predictions (regression trees) or by using the most frequent category (classification trees) across all 𝑏𝑏 trees. Properties • Bagging is a special case of random forests. • Increasing 𝑏𝑏 does not cause overfitting. • Decreasing 𝑘𝑘 reduces the correlation between predictions. Boosting Let 𝑧𝑧# be the actual response variable, 𝑦𝑦. 1. For 𝑘𝑘 = 1, 2, … , 𝑏𝑏: • Use recursive binary splitting to fit a tree with 𝑑𝑑 splits to the data with 𝑧𝑧F as the response. • Update 𝑧𝑧F by subtracting 𝜆𝜆 ⋅ 𝑓𝑓.F (𝐱𝐱), i.e. let 𝑧𝑧F8# = 𝑧𝑧F − 𝜆𝜆 ⋅ 𝑓𝑓.F (𝐱𝐱). 2. Calculate the boosted model prediction as 𝑓𝑓.(𝐱𝐱) = ∑5F(# 𝜆𝜆 ⋅ 𝑓𝑓.F (𝐱𝐱). Properties • Increasing 𝑏𝑏 can cause overfitting. • Boosting reduces bias. • 𝑑𝑑 controls complexity of the boosted model. • 𝜆𝜆 controls the rate at which boosting learns. Properties • Increasing 𝑏𝑏 does not cause overfitting. • Bagging reduces variance. • Out-of-bag error is a valid estimate of test error. © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 9 UNSUPERVISED LEARNING Unsupervised Learning Principal Components Analysis Notation 𝑧𝑧, 𝑍𝑍 Principal component (score) Subscript 𝑚𝑚 Index for principal components 𝜙𝜙 Principal component loading 𝑥𝑥, 𝑋𝑋 Centered explanatory variable Principal Components $ 𝑧𝑧U = ∞ 𝜙𝜙9,U 𝑥𝑥9 , $ 𝑧𝑧&,U = ∞ 𝜙𝜙9,U 𝑥𝑥&,9 9(# 9(# % • ∑$9(# 𝜙𝜙9,U =1 $ • ∑9(# 𝜙𝜙9,U ⋅ 𝜙𝜙9,H = 0, 𝑚𝑚 ≠ 𝑢𝑢 Proportion of Variance Explained (PVE) $ $ 9(# 9(# ' 1 % ∞ 𝑠𝑠K%% = ∞ ∞ 𝑥𝑥&,9 𝑛𝑛 − 1 𝑠𝑠_%+ ' &(# 1 % = ∞ 𝑧𝑧&,U 𝑛𝑛 − 1 PVE = &(# Cluster Analysis Notation 𝐶𝐶 Cluster containing indices 𝑊𝑊(𝐶𝐶) Within-cluster variation of cluster |𝐶𝐶| No. of observations in cluster $ % Euclidean Distance = Ñ∑9(#2𝑥𝑥&,9 − 𝑥𝑥U,9 5 𝑘𝑘-Means Clustering 1. Randomly assign a cluster to each observation. This serves as the initial cluster assignments. 2. Calculate the centroid of each cluster. 3. For each observation, identify the closest centroid and reassign to that cluster. 4. Repeat steps 2 and 3 until the cluster assignments stop changing. $ 1 % 𝑊𝑊(𝐶𝐶H ) = ∞ ∞2𝑥𝑥&,9 − 𝑥𝑥U,9 5 |𝐶𝐶H | &,U∈`, 9(# $ % = 2 ∞ ∞2𝑥𝑥&,9 − 𝑥𝑥̅H,9 5 &∈`, 9(# 𝑠𝑠_%+ ∑$9(# 𝑠𝑠K%% Key Ideas • The variance explained by each subsequent principal component is always less than the variance explained by the previous principal component. • All principal components are uncorrelated with one another. • A dataset has min(𝑛𝑛 − 1, 𝑝𝑝) distinct principal components. • The first 𝑘𝑘 principal component scores and loadings approximate the original dataset, 𝑥𝑥&,9 ≈ ∑FU(# 𝑧𝑧&,U 𝜙𝜙9,U . Hierarchical Clustering 1. Select the dissimilarity measure and linkage to be used. Treat each observation as its own cluster. 2. For 𝑘𝑘 = 𝑛𝑛, 𝑛𝑛 − 1, … , 2: • Compute the inter-cluster dissimilarity between all 𝑘𝑘 clusters. • Examine all 2F%5 pairwise dissimilarities. The two clusters with the lowest inter-cluster dissimilarity are fused. The dissimilarity indicates the height in the dendrogram at which these two clusters join. Linkage Inter-cluster dissimilarity Complete The largest dissimilarity Average The arithmetic mean Single Centroid The smallest dissimilarity The dissimilarity between the cluster centroids Key Ideas • For 𝑘𝑘-means clustering, the algorithm needs to be repeated for each 𝑘𝑘. • For hierarchical clustering, the algorithm only needs to be performed once for any number of clusters. • The result of clustering depends on many parameters, such as: o Choice of 𝑘𝑘 in 𝑘𝑘-means clustering o Choice of number of clusters, linkage, and dissimilarity measure in hierarchical clustering o Choice to standardize variables Principal Components Regression 𝑌𝑌 = 𝜃𝜃4 + 𝜃𝜃# 𝑧𝑧# + ⋯ + 𝜃𝜃F 𝑧𝑧F + 𝜀𝜀 • If 𝑘𝑘 = 𝑝𝑝, then 𝛽𝛽9 = ∑FU(# 𝜃𝜃U 𝜙𝜙9,U . © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 10

Statistical Learning Formula Sheet

Related documents

Products

Support

Statistical Learning Formula Sheet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib