Statistical Learning Formula Sheet

SRM Updated 03/13/23 C LL R Statistical Learning Data ode ing Prob e s Types of ariables Response A variable of primary interest Explanatory A variable used to study the response variable Count A uantitative variable usually valid on non-negative integers Continuous A real-valued uantitative variable Nominal A categorical ualitative variable having categories without a meaningful or logical order Ordinal A categorical ualitative variable having categories with a meaningful or logical order Notation 𝑦𝑦 𝑥𝑥 ubscript 𝑖𝑖 ubscript ! "# 𝑦𝑦 (𝑥𝑥) Training Observations used to train/obtain f̂ Regression Problems , 𝑥𝑥$ 5 + where [ ] = = 2𝑥𝑥# , est = 2 − 5 ar 2𝑥𝑥# , Regression Quantitative response variable Method Properties , 𝑥𝑥$ the test M E is % ar[ ] , 𝑥𝑥$ 5 5 + 𝐼𝐼2 ≠ 5 hich can e esti ated usin ∑'&(# 𝐼𝐼(𝑦𝑦& ≠ 𝑦𝑦& ) a es lassi ier 2𝑥𝑥# , Unsupervised No response variable No output , 𝑥𝑥$ 5 = ar 3 ax Pr2 = 𝑐𝑐 # = 𝑥𝑥# , , $ = 𝑥𝑥$ 5 ey deas • The disadvantage to parametric methods is the danger of choosing a form for that is not close to the truth. • The disadvantage to non-parametric methods is the need for an abundance of observations. • lexibility and interpretability are typically at odds. • As flexibility increases the training M E or error rate decreases but the test M E or error rate follows a u-shaped pattern. • Low flexibility leads to a method with low variance and high bias high flexibility leads to a method with high variance and low bias. Classification Categorical response variable Parametric Functional form of f specified Non-Parametric Functional form of f not specified Prediction Output of fˆ Inference Comprehension of f Flexibility , fˆ s ability to follow the data Interpretability , fˆ s ability to be understood © 2023 Coaching Actuaries. All Rights Reserved ∑'&(#(𝑦𝑦& − 𝑦𝑦& )% , 𝑥𝑥$ 5 + 2 ias 2𝑥𝑥# , est rror ate = Statistical Learning Problems output , 𝑥𝑥$ 5 Classification Problems Contrasting tatistical Learning Elements Supervised Has response variable so [ ] = 2𝑥𝑥# , % hich can e esti ated usin or fixed inputs 𝑥𝑥# , Response variable Explanatory variable ndex for observations No. of observations ndex for variables except response No. of variables except response Transpose of matrix nverse of matrix Error term Estimate Estimator of (𝑥𝑥) Test Observations not used to train/obtain f̂ www.coachingactuaries.com SRM Formula Sheet 1 L R Estimation – Ordinary Least 𝑦𝑦 = + # 𝑥𝑥# + ⋯ + $ 𝑥𝑥$ L Linear Models i e Linear Regression LR pecial case of MLR where = 1 o⋮q= 𝑒𝑒 " = ∑'&(#(𝑥𝑥& 𝑒𝑒 #$" 𝑥𝑥 % j ∑'&(#(𝑥𝑥& − 𝑥𝑥)% − 𝑥𝑥)% % % (𝑥𝑥 − 𝑥𝑥)% i + ' j ∑&(#(𝑥𝑥& − 𝑥𝑥)% = i1 + 1 + Notation 𝛽𝛽 The th LR regression coefficient Estimate of 𝛽𝛽 % M E 𝑒𝑒 T R E 𝑒𝑒 ariance of response rreducible error Estimate of % esign matrix Hat matrix Residual Total sum of s uares Regression sum of s uares Error sum of s uares Assumptions . & = 𝛽𝛽 + 𝛽𝛽# 𝑥𝑥&,# + ⋯ + 𝛽𝛽$ 𝑥𝑥&,$ + = = # d #" , 𝛼𝛼 𝑘𝑘 ndf ddf + =1− % = 1 − (1 − Estimated standard error Null hypothesis Alternative hypothesis egrees of freedom uantile of a -distribution ignificance level Confidence level Numerator degrees of freedom enominator degrees of freedom uantile of an -distribution Response of new observation Reduced model ull model %) y #" , −1 z − −1 , ' # (𝑥𝑥' # − 𝑥𝑥)% j ∑'&(#(𝑥𝑥& − 𝑥𝑥)% u ti e Linear Regression = 𝛽𝛽 + 𝛽𝛽# 𝑥𝑥# + ⋯ + 𝛽𝛽$ 𝑥𝑥$ + Estimator for [ ] Other Numerical Results = ( ! )"# ! = 𝑒𝑒 = 𝑦𝑦 − 𝑦𝑦 = ∑'&(#(𝑦𝑦& − 𝑦𝑦)% = total variability = ∑'&(#(𝑦𝑦& − 𝑦𝑦)% = explained = ∑'&(#(𝑦𝑦& − 𝑦𝑦& )% = unexplained 1 𝑒𝑒 = ! residual standard error = # 𝑥𝑥 1 i + = )"# ( − − 1) = LR n erences tandard Errors 𝑒𝑒 ! LR n erences Notation Estimator for 𝛽𝛽 𝛽𝛽 $ Estimation ∑'&(#(𝑥𝑥& − 𝑥𝑥)(𝑦𝑦& − 𝑦𝑦) # = ∑'&(#(𝑥𝑥& − 𝑥𝑥)% = 𝑦𝑦 − =( uares OL ey deas • % is a poor measure for model comparison because it will increase simply by adding more predictors to a model. • Polynomials do not change consistently by unit increases of its variable i.e. no constant slope. • Only 𝑤𝑤 − 1 dummy variables are needed to represent 𝑤𝑤 classes of a categorical predictor one of the classes acts as a baseline. • n effect dummy variables define a distinct intercept for each class. ithout the interaction between a dummy variable and a predictor the dummy variable cannot additionally affect that predictor s regression coefficient. & tandard Errors 𝑒𝑒 = ar 𝛽𝛽 ariance-Covariance Matrix ( ar 𝜷𝜷 = ! )"# = ar 𝛽𝛽 o 𝛽𝛽 , 𝛽𝛽# ⋯ o 𝛽𝛽 , 𝛽𝛽$ o 𝛽𝛽 , 𝛽𝛽# ⋮ o 𝛽𝛽 , 𝛽𝛽$ ar 𝛽𝛽# ⋮ o 𝛽𝛽# , 𝛽𝛽$ ⋯ o 𝛽𝛽# , 𝛽𝛽$ ⋮ ar 𝛽𝛽$ . [ &] = . ar[ & ] = % . & s are independent . & s are normally distributed . The predictor 𝑥𝑥 is not a linear ⋯ Tests esti ate − h pothesi ed alue standard error 𝛽𝛽 = h pothesi ed alue statistic = est e Two-tailed e ection e ion statistic %,'"$"# Left-tailed statistic Right-tailed . 𝑥𝑥&, s are non-random − statistic ,'"$"# ,'"$"# Tests statistic = = ( − − 1) 𝛽𝛽# = 𝛽𝛽% = ⋯ = 𝛽𝛽$ = Re ect combination of the other predictors for = , 1, , © 2023 Coaching Actuaries. All Rights Reserved ubscript ubscript • nd = • dd = www.coachingactuaries.com if statistic , , − −1 SRM Formula Sheet 2 Partial ariance nflation actor %( − 1) 1 = = 𝑒𝑒 % % 1− Tests statistic = − 2 − 5 2 2 − 5 − 15 o e 𝛽𝛽 s = if Re ect • nd = statistic , Tolerance is the reciprocal of • rees rule of thumb: any , − • dd = −1 − or all hypothesis tests re ect - alue 𝛼𝛼. if Confidence and Prediction ntervals esti ate ( quantile)(standard error) uantit 𝛽𝛽 nter al [ ] 𝑦𝑦 𝑦𝑦' ' # Linear ode Leverage ℎ& = ℎ& = 𝐱𝐱&! ( ! 1 + #"F %,'"$"# #"F %,'"$"# 𝑒𝑒 𝑒𝑒 #$" tions 𝑒𝑒 𝐱𝐱& = % (𝑥𝑥& − 𝑥𝑥)% or % (#(𝑥𝑥 − 𝑥𝑥) ∑' • 1 ℎ& • ∑'&(# ℎ& = • %,'"$"# ssu )"# ression 𝑒𝑒 #"F # 1 +1 $ # rees rule of thumb: ℎ& ' tudentized and tandardized Residuals 𝑒𝑒& 𝑒𝑒 ,& = & (1 − ℎ& ) 𝑒𝑒& 𝑒𝑒 ,& • rees rule of thumb: 𝑒𝑒 = • . 1 , fit all $ (1 − ℎ& ) ,& 2 , ∑'(#2𝑦𝑦 − 𝑦𝑦 & 5 ( + 1) 𝑒𝑒&% ℎ& = ( + 1)(1 − ℎ& )% 𝐷𝐷& = Plots of Residuals • 𝑒𝑒 versus 𝑦𝑦 Residuals are well-behaved if o Points appear to be randomly scattered o Residuals seem to average to o pread of residuals does not change % , • 𝑒𝑒 versus 𝑖𝑖 etects dependence of error terms plot of 𝑒𝑒 orward tepwise election . it all simple linear regression models. The model with the largest % is # . . or = 2, , fit the models that add one of the remaining predictors to $"# . The model with the largest % is $. , . Choose the best model among , using a selection criterion of choice. Bac ward tepwise election . it the model with all predictors . . or = − 1, , 1 fit the models that drop one of the predictors from $ # . The model with the largest % . Choose the best model among is − + 2( + 1) % alidation et using a selection criterion of choice. % = • Cross-validation error models with . Choose the best model among $ • Ad usted Best ubset election = , 1, = • Bayesian information criterion + ln = ode e ection Notation Total no. of predictors in consideration No. of predictors for a specific model M E of the model that uses all predictors The best model with predictors $ . or $ $ +2 • A ai e information criterion +2 = ey deas • As realizations of a -distribution studentized residuals can help identify outliers. • hen residuals have a larger spread for larger predictions one solution is to transform the response variable with a concave function. • There is no universal approach to handling multicollinearity it is even possible to accept it such as when there is a suppressor variable. On the other hand it can be eliminated by using a set of orthogonal predictors. predictors. The model with the largest is $ . Coo s istance • election Criteria • Mallows • Randomly splits all available observations into two groups: the training set and the validation set. • Only the observations in the training set are used to attain the fitted model and those in validation set are used to estimate the test M E. 𝑘𝑘-fold Cross- alidation . Randomly divide all available observations into 𝑘𝑘 folds. . or = 1, , 𝑘𝑘 obtain the th fit by training with all observations except those in the th fold. . or = 1, , 𝑘𝑘 use 𝑦𝑦 from the th fit to calculate a test M E estimate with observations in the th fold. . To calculate C error average the 𝑘𝑘 test M E estimates in the previous step. Leave-one-out Cross- alidation LOOC • Calculate LOOC error as a special case of 𝑘𝑘-fold cross-validation where 𝑘𝑘 = . • or MLR: rror = 1 ' 𝑦𝑦& − 𝑦𝑦& % ∞y z 1 − ℎ& &(# ey deas on Cross- alidation • The validation set approach has unstable results and will tend to overestimate the test M E. The two other approaches mitigate these issues. • ith respect to bias LOOC 𝑘𝑘-fold C alidation et. • ith respect to variance LOOC C alidation et. 𝑘𝑘-fold $. , , using a selection criterion of choice. © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 3 ey deas on Ridge and Lasso t er Regression roac es tandardizing ariables • 𝑥𝑥# , • A centered variable is the result of subtracting the sample mean from a variable. • A scaled variable is the result of dividing a variable by its sample standard deviation. • A standardized variable is the result of first centering a variable then scaling it. is inversely related to flexibility. • E uivalent to running OL with 𝑤𝑤𝑦𝑦 as • ith a finite none of the ridge estimates will e ual but the lasso estimates could e ual . the response and 𝑤𝑤𝐱𝐱 as the predictors hence minimizing ∑'&(# 𝑤𝑤& (𝑦𝑦& − 𝑦𝑦& )% . )"# ! =( ! where is the diagonal matrix of the weights. predictors in a multiple linear regression. The number of directions is a measure of flexibility. e Negative Binomial fixed Gamma nverse Gaussian esults or istri utions in t e onential Famil ( ) Probability unction 1 exp − 2𝜋𝜋 (𝑦𝑦 − 𝜇𝜇)% 2 % 𝜇𝜇 y z 𝜋𝜋 (1 − 𝜋𝜋)'" 𝑦𝑦 Poisson 𝑦𝑦! (𝛼𝛼) 𝑦𝑦 ln exp(− ) (𝑦𝑦 + ) 𝑦𝑦! ( ) 2𝜋𝜋𝑦𝑦 . tarting from the center of the neighborhood identify the 𝑘𝑘 nearest training observations. . or classification 𝑦𝑦 is the most fre uent category among the 𝑘𝑘 observations for regression 𝑦𝑦 is the average of the response among the 𝑘𝑘 observations. 𝑘𝑘 is inversely related to flexibility. • Every subse uent partial least s uares direction is calculated iteratively as a linear combination of updated predictors which are the residuals of fits with the previous predictors explained by the previous direction. • The directions # , , are used as or e uivalently by minimizing the expression + ∑$(# . Binomial fixed 𝑘𝑘-Nearest Neighbors NN . dentify the center of the neighborhood i.e. the location of an observation with inputs 𝑥𝑥# , , 𝑥𝑥$ . based on the relation between 𝑥𝑥 and 𝑦𝑦. Lasso Regression Coefficients are estimated by minimizing the E while constrained by ∑$(# Normal uares • The first partial least s uares direction # is a linear combination of standardized predictors 𝑥𝑥# , , 𝑥𝑥$ with coefficients or e uivalently by minimizing the expression + ∑$(# % . "# 𝜋𝜋 1 − 𝜋𝜋 ln21 + 𝑒𝑒 5 𝑒𝑒 − (𝑦𝑦 − 𝜇𝜇)% 2𝜇𝜇% 𝑦𝑦 − 𝛼𝛼 1 2𝜇𝜇% − ln21 − 𝑒𝑒 5 1 𝛼𝛼 − ln(− ) 1 www.coachingactuaries.com − −2 M "# (𝜇𝜇) 𝜇𝜇 2 ln(1 − ) exp(−𝑦𝑦 ) Canonical Lin % % ln (1 − ) exp − © 2023 Coaching Actuaries. All Rights Reserved eighted Least uares ar[ & ] = % 𝑤𝑤& • Partial Least Ridge Regression Coefficients are estimated by minimizing the E while constrained by ∑$(# % istribution • , 𝑥𝑥$ are scaled predictors. 𝜇𝜇 z − 𝜇𝜇 ln y ln 𝜇𝜇 𝜇𝜇 z + 𝜇𝜇 ln y − − 1 𝜇𝜇 1 2𝜇𝜇% SRM Formula Sheet 4 L R L Models Non-Linear enera i ed Linear ode s Notation , Linear exponential family parameters [ ] 𝜇𝜇 Mean response M( ) Mean function (𝜇𝜇) ariance function ℎ(𝜇𝜇) Lin function Maximum li elihood estimate of 𝜷𝜷 𝑙𝑙( ) Maximized log-li elihood 𝑙𝑙 Maximized log-li elihood for null model 𝑙𝑙 Maximized log-li elihood for saturated model 𝑒𝑒 Residual 𝐈𝐈 nformation matrix % uantile of a chi-s uare #" , distribution 𝐷𝐷 caled deviance 𝐷𝐷 eviance statistic Linear Exponential amily 𝑦𝑦 − ( ) Pro n o = exp + (𝑦𝑦, ) [ ] = M( ) ar[ ] = MM ( )= (𝜇𝜇) Model ramewor • ℎ(𝜇𝜇) = 𝐱𝐱 ! 𝜷𝜷 • Canonical lin is the lin function where Numerical Results 𝐷𝐷 = 2[𝑙𝑙 − 𝑙𝑙( )] 𝐷𝐷 = 𝐷𝐷 or MLR 𝐷𝐷 = % % 1 − exp 2[𝑙𝑙 − 𝑙𝑙( )] 1 − exp 2𝑙𝑙 𝑙𝑙( ) − 𝑙𝑙 = 𝑙𝑙 − 𝑙𝑙 = = −2 𝑙𝑙( ) + 2 ( + 1) = −2 𝑙𝑙( ) + ln ( + 1) Assumes only 𝜷𝜷 needs to be estimated. f estimating is re uired replace + 1 with + 2. Residuals a esidual 𝑒𝑒& = 𝑦𝑦& − 𝜇𝜇̂ & earson esidual 𝑦𝑦& − 𝜇𝜇̂ & 𝑒𝑒& = (𝜇𝜇̂ & ) • The Pearson chi-s uare statistic is ∑'&(# 𝑒𝑒&% . e iance esidual 𝑒𝑒& = •𝐷𝐷& whose sign follows the 𝑖𝑖th raw residual nscom e esidual (𝑦𝑦& ) − [ ( & )] 𝑒𝑒& = • ar[ ( & )] Parameter Estimation ' 𝑙𝑙(𝜷𝜷) = ∞ 𝑦𝑦& & − ( &) Li elihood Ratio Tests % statistic = 2 𝑙𝑙2 ) 5 − 𝑙𝑙( o e 𝛽𝛽 s = if Re ect % % statistic ,$ "$ Goodness-of- it Tests follows a distribution of choice with free parameters whose domain is split into 𝑤𝑤 mutually exclusive intervals. S % statistic = ∞ ( 3 3 Re ect = if % 3 3(# 3 3) − or all 𝑐𝑐 = 1, , 𝑤𝑤 % % statistic Tweedie istribution [ ] = 𝜇𝜇, ar[ ] = ,S" "# 𝜇𝜇 istri ution Normal M "# (𝜇𝜇). ℎ(𝜇𝜇) = nference • Maximum li elihood estimators 𝜷𝜷 asymptotically have a multivariate normal distribution with mean 𝜷𝜷 and asymptotic variance-covariance matrix 𝐈𝐈"# . • To address overdispersion change the MM ( ) variance to ar[ & ] = & and estimate as the Pearson chi-s uare statistic divided by − − 1. Poisson 1 Tweedie (1, 2) Gamma 2 nverse Gaussian + (𝑦𝑦& , ) &(# where & = M "# ℎ"# 2𝐱𝐱&! 𝜷𝜷5 The score e uations are the partial derivatives of 𝑙𝑙(𝜷𝜷) with respect to each 𝛽𝛽 all set e ual to . The solution to the score e uations is . Then 𝜇𝜇̂ = ℎ"#(𝐱𝐱 ! ). © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 5 Logistic and Probit Regression • The odds of an event are the ratio of the probability that the event will occur to the probability that the event will not occur. • The odds ratio is the ratio of the odds of an event with the presence of a characteristic to the odds of the same event without the presence of that characteristic. Logit observation is classified as category 𝑐𝑐. The reference category is 𝑘𝑘. 𝜋𝜋&,3 ln i j = 𝐱𝐱&! 𝜷𝜷3 𝜋𝜋&,F 𝜋𝜋&,3 exp2𝐱𝐱&! 𝜷𝜷3 5 ⎧ ⎪1 + ∑UVF exp2𝐱𝐱 ! 𝜷𝜷U 5 , & = 1 ⎨ , ⎪ ! ⎩1 + ∑UVF exp2𝐱𝐱& 𝜷𝜷U 5 ' Binary Response Function Name Nominal Response – Generalized Logit Let 𝜋𝜋&,3 be the probability that the 𝑖𝑖th 𝑐𝑐 ≠ 𝑘𝑘 ln y 𝜇𝜇 z 1 − 𝜇𝜇 Probit Φ"# (𝜇𝜇) Complementary log-log ln(− ln(1 − 𝜇𝜇)) 𝑐𝑐 = 𝑘𝑘 S &(# 3(# ' &(# ' &(# ' 𝑦𝑦& 𝐷𝐷 = 2 ∞ ”𝑦𝑦& ln y z − 1À + 𝜇𝜇̂ & ‘ 𝜇𝜇̂ & &(# Pearson residual, 𝑒𝑒& = Poisson Regression with Exposures Model ln 𝜇𝜇 = ln 𝑤𝑤 + 𝐱𝐱 ! 𝜷𝜷 ' ' &(# 1 − 𝑦𝑦& 𝑦𝑦& 𝐷𝐷 = 2 ∞ 𝑦𝑦& ln y z + (1 − 𝑦𝑦& ) ln y zÀ 𝜇𝜇̂ & 1 − 𝜇𝜇̂ & &(# 𝑦𝑦& − 𝜇𝜇̂ & •𝜇𝜇̂ & (1 − 𝜇𝜇̂ & ) ' &(# (𝑦𝑦& − 𝜇𝜇̂ & )% 𝜇𝜇̂ & (1 − 𝜇𝜇̂ & ) © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com (𝑦𝑦& − 𝜇𝜇̂ & )% 𝜇𝜇̂ & Alternative Count Models These models can incorporate a Poisson distribution while letting the mean of the response differ from the variance of the response: &(# Pearson chi-square statistic = ∞ •𝜇𝜇̂ & • Π3 = 𝜋𝜋# + ⋯ + 𝜋𝜋3 𝑥𝑥&,# 𝛽𝛽# • 𝐱𝐱& = — ⋮ “ , 𝜷𝜷 = o ⋮ q 𝑥𝑥&,$ 𝛽𝛽$ 𝜕𝜕 𝜇𝜇&M 𝑙𝑙(𝜷𝜷) = ∞ 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎 𝜕𝜕𝜷𝜷 𝜇𝜇& (1 − 𝜇𝜇& ) Pearson residual, 𝑒𝑒& = 𝑦𝑦& − 𝜇𝜇̂ & Pearson chi-square statistic = ∞ ' ' &(# 𝜕𝜕 𝑙𝑙(𝜷𝜷) = ∞ 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎 𝜕𝜕𝜷𝜷 Ordinal Response – Proportional Odds Cumulative ℎ(Π3 ) = 𝛼𝛼3 + 𝐱𝐱&! 𝜷𝜷 where 𝑙𝑙(𝜷𝜷) = ∞[𝑦𝑦& ln 𝜇𝜇& + (1 − 𝑦𝑦& ) ln(1 − 𝜇𝜇& )] &(# ' 𝑙𝑙(𝜷𝜷) = ∞[𝑦𝑦& ln 𝜇𝜇& − 𝜇𝜇& − ln(𝑦𝑦& !) ] 𝐈𝐈 = ∞ 𝜇𝜇& 𝐱𝐱& 𝐱𝐱&! 𝑙𝑙(𝜷𝜷) = ∞ ∞ 𝐼𝐼(𝑦𝑦& = 𝑐𝑐) ln 𝜋𝜋&,3 ℎ(𝜇𝜇) Poisson Count Regression ln 𝜇𝜇 = 𝐱𝐱 ! 𝜷𝜷 Models Mean < Variance Mean > Variance Negative binomial Yes No Zero-inflated Yes No Hurdle Yes Yes Heterogeneity Yes No SRM Formula Sheet 6 R Time Series rend ode s Notation ubscript ndex for observations Trends in time easonal trends Random patterns 𝑦𝑦' 𝑙𝑙-step ahead forecast 𝑒𝑒 Estimated standard error uantile of a -distribution #" , Training sample size Test sample size # % Trends Additive: = Multiplicative: + = + + utoregressi e ode s Notation Lag 𝑘𝑘 autocorrelation F Lag 𝑘𝑘 sample autocorrelation F % ariance of white noise % Estimate of % Estimate of 𝛽𝛽 Estimate of 𝛽𝛽# # 𝑦𝑦" ample mean of first − 1 observations 𝑦𝑦 ample mean of last − 1 observations Autocorrelation ∑'(F #(𝑦𝑦 "F − 𝑦𝑦)(𝑦𝑦 − 𝑦𝑦) F = ∑'(#(𝑦𝑦 − 𝑦𝑦)% Re ect hite Noise 𝑦𝑦' = 𝑦𝑦 AR Model = 𝛽𝛽 + 𝛽𝛽# 𝑒𝑒 #$) = 𝑘𝑘 •1 + 1 prediction interval for 𝑦𝑦' 𝑒𝑒 #$) #"F %,'"# 𝑦𝑦' is Random al 𝑤𝑤 = 𝑦𝑦 − 𝑦𝑦 "# 𝑦𝑦' = 𝑦𝑦' + 𝑙𝑙𝑤𝑤 𝑒𝑒 #$) = 𝑙𝑙 S Approximate 𝑦𝑦' is 𝑦𝑦' 2 prediction interval for 𝑒𝑒 #$) Model Comparison = ' 1 ∞ 𝑒𝑒 % ('" # P =1 = = 𝑒𝑒 ∞ 𝑦𝑦 % 1 % 1 % P =1 ∞ 𝑒𝑒 % ('" # ' ∞ 𝑒𝑒 ('" # 1 % = ' ∞ ('" # 𝑒𝑒 𝑦𝑦 © 2023 Coaching Actuaries. All Rights Reserved * against # F ≠ if test statistic "# #" % = 𝑘𝑘 1+ % # + # + ⋯+ % "# # prediction interval for 𝑦𝑦' 𝑒𝑒 #$) #"F %,'" 𝑦𝑦' is t er i e eries ode s Notation 𝑘𝑘 Moving average length 𝑤𝑤 moothing parameter easonal base No. of trigonometric functions redictions = '̂ 𝑦𝑦' = + ouble moothing with Moving Averages = 𝛽𝛽 + 𝛽𝛽# + for 𝑘𝑘 • f 𝛽𝛽# = follows a white noise process. • f 𝛽𝛽# = 1 follows a random wal process. • f −1 𝛽𝛽# 1 is stationary. ro erties o tationar 𝛽𝛽 [ ]= 1 − 𝛽𝛽# Model % stimation ∑'(%(𝑦𝑦 "# − 𝑦𝑦" )(𝑦𝑦 − 𝑦𝑦 ) # = ∑'(%(𝑦𝑦 "# − 𝑦𝑦")% = 𝑦𝑦 − # 𝑦𝑦" 𝑦𝑦(1 − # ) ∑'(%(𝑒𝑒 − 𝑒𝑒)% % = − moot in ̂ + ̂ % ̂ = ̂ % + ⋯+ ̂ 𝑘𝑘 ̂ − ̂ "F % = ̂ "# + , 𝑘𝑘 "# "F # 𝑘𝑘 = 1, 2, redictions = '̂ # 1 − 𝛽𝛽#% 𝑦𝑦' = 𝛽𝛽#F ar[ ] = #$) moot in 𝑦𝑦 + 𝑦𝑦 "# + ⋯ + 𝑦𝑦 "F # ̂ = 𝑘𝑘 𝑦𝑦 − 𝑦𝑦 "F , 𝑘𝑘 = 1, 2, ̂ = ̂ "# + 𝑘𝑘 =1 ssum tions . [ ]= . ar[ ] = % . o [ F, ] = F ('" # ' F ar[ ] = ' 1 where 𝑒𝑒 𝑒𝑒 moothing with Moving Averages = 𝛽𝛽 + estin utocorrelation test statistic = F 𝑒𝑒 * tationarity tationarity describes how something does not vary with respect to time. Control charts can be used to identify stationarity. moot in and redictions 𝑦𝑦 = + # 𝑦𝑦 "# , 2 + # 𝑦𝑦' "# , 𝑙𝑙 = 1 𝑦𝑦' = ” + # 𝑦𝑦' "# , 𝑙𝑙 1 = % 2 ̂' − ̂' = 𝑘𝑘 − 1 + # 𝑙𝑙 # % 1− % # www.coachingactuaries.com SRM Formula Sheet 7 Exponential moothing = 𝛽𝛽 + moot in ̂ = (1 − 𝑤𝑤)(𝑦𝑦 + 𝑤𝑤𝑦𝑦 "# + ⋯ + 𝑤𝑤 𝑦𝑦 ) 𝑤𝑤 1 ̂ = (1 − 𝑤𝑤)𝑦𝑦 + 𝑤𝑤 ̂ "# , The value of 𝑤𝑤 is determined by minimizing (𝑤𝑤) = ∑'(#(𝑦𝑦 − ̂ "# )% . easonal Time eries Models Fi ed easonal ects ri onometric Functions = ∞ 𝛽𝛽#,& sin( ̂ ) + 𝛽𝛽%,& cos( & ) &(# • • redictions = '̂ 𝑦𝑦' = ̂ & & = 2𝜋𝜋𝑖𝑖 2 nit Root Test • A unit root test is used to evaluate the fit of a random wal model. • A random wal model is a good fit if the time series possesses a unit root. • The ic ey- uller test and augmented ic ey- uller test are two examples of unit root tests. easonal utore ressi e Models = 𝛽𝛽 + 𝛽𝛽# " + ⋯ + 𝛽𝛽$ "$ + olt inter easonal dditi e Model = 𝛽𝛽 + 𝛽𝛽# + + ouble Exponential moothing = 𝛽𝛽 + 𝛽𝛽# + • moot in • ∑ % = (1 − 𝑤𝑤)( ̂ + 𝑤𝑤 ̂ % = (1 − 𝑤𝑤) ̂ + 𝑤𝑤 ̂ "# % "# , = (# " $ ( , ) Model + # %"# + ⋯ + $ % = = % # "# + ⋯ + 𝑤𝑤 ̂ ) 𝑤𝑤 olatility Models ( ) Model % = + # %"# + ⋯ + ar[ ] = 1 redictions + ⋯+ 1 − ∑$(# −∑ % "$ % "$ % " + (# ssum tions % ̂ ' • • = 2 '̂ − 1 − 𝑤𝑤 % ̂ − '̂ # = ' 𝑤𝑤 𝑦𝑦' = + # 𝑙𝑙 • • ∑$(# +∑ (# 1 ey deas for moothing • t is only appropriate for time series data without a linear trend. • t is related to weighted least s uares. • A double smoothing procedure can be used to forecast time series data with a linear trend. • Holt- inter double exponential smoothing is a generalization of the double exponential smoothing. © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 8 C R Decision Trees Regression and C assi ication rees Notation Region of predictor space No. of observations in node U No. of category 𝑐𝑐 observations in U,3 node 𝐼𝐼 mpurity Classification error rate Gini index 𝐷𝐷 Cross entropy ubtree No. of terminal nodes in Tuning parameter Algorithm . Construct a large tree with terminal nodes using recursive binary splitting. . Obtain a se uence of best subtrees as a function of using cost complexity pruning. . Choose by applying 𝑘𝑘-fold cross validation. elect the that results in the lowest cross-validation error. . The best subtree is the subtree created in step with the selected value. Recursive Binary plitting e ression ini i e ∞ ∞ 2𝑦𝑦& − 𝑦𝑦 Cost Complexity Pruning e ression ! ini i e ∞ ∞ 2𝑦𝑦& − 𝑦𝑦 % 5 + U(# & lassi ication ini i e 1 ! ∞ U 𝐼𝐼U + U(# ey deas • Terminal nodes or leaves represent the partitions of the predictor space. • nternal nodes are points along the tree where splits occur. • Terminal nodes do not have child nodes but internal nodes do. • Branches are lines that connect any two nodes. • A decision tree with only one internal node is called a stump. Advantages of Trees • Easy to interpret and explain • Can be presented visually • Manage categorical variables without the need of dummy variables • Mimic human decision-ma ing % 5 • Not robust • o not have the same degree of predictive accuracy as other statistical methods lassi ication 1 ∞ U 𝐼𝐼U U(# More nder lassi ication ̂U,3 = U,3 U U =1− ax ̂U,3 3 = ∑S 3(# ̂U,3 21 − ̂ U,3 5 𝐷𝐷U = − ∑S 3(# ̂U,3 ln ̂ U,3 U de iance = −2 ∑U(# ∑S 3(# residual ro erties • Bagging is a special case of random forests. • ncreasing does not cause overfitting. • ecreasing 𝑘𝑘 reduces the correlation between predictions. Boosting Let # be the actual response variable 𝑦𝑦. . or 𝑘𝑘 = 1, 2, , : • se recursive binary splitting to fit a tree with splits to the data with F as the response. • U,3 ln ̂ U,3 de iance ean de iance = − u ti e rees Bagging . Create bootstrap samples from the original training dataset. . Construct a decision tree for each bootstrap sample using recursive binary splitting. . Predict the response of a new observation by averaging the predictions regression trees or by using the most fre uent category classification trees across all trees. pdate F by subtracting F (𝐱𝐱) i.e. let F # = F − F (𝐱𝐱). . Calculate the boosted model prediction as (𝐱𝐱) = ∑F(# F (𝐱𝐱). isadvantages of Trees U(# & ini i e Random orests . Create bootstrap samples from the original training dataset. . Construct a decision tree for each bootstrap sample using recursive binary splitting. At each split a random subset of 𝑘𝑘 variables are considered. . Predict the response of a new observation by averaging the predictions regression trees or by using the most fre uent category classification trees across all trees. ro erties • ncreasing can cause overfitting. • Boosting reduces bias. • controls complexity of the boosted model. • controls the rate at which boosting learns. ro erties • ncreasing does not cause overfitting. • Bagging reduces variance. • Out-of-bag error is a valid estimate of test error. © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 9 P Unsupervised R L R Learning Princi a Co Notation onents na sis Principal component score ndex for principal components Principal component loading Centered explanatory variable ubscript 𝑥𝑥 Principal Components $ U $ =∞ ,U 𝑥𝑥 , &,U =∞ (# ,U 𝑥𝑥&, (# • ∑$(# % ,U =1 $ • ∑ (# ,U , = , ≠ Proportion of ariance Explained P E $ $ ∞ % ' 1 =∞ ∞ 𝑥𝑥&,% −1 (# (# &(# ' % = 1 ∞ −1 % &,U C uster na sis Notation Cluster containing indices ( ) ithin-cluster variation of cluster No. of observations in cluster uclidean istance = % ∑$(#2𝑥𝑥&, − 𝑥𝑥U, 5 𝑘𝑘-Means Clustering . Randomly assign a cluster to each observation. This serves as the initial cluster assignments. . Calculate the centroid of each cluster. . or each observation identify the closest centroid and reassign to that cluster. . Repeat steps and until the cluster assignments stop changing. ( )= 1 $ ∞ ∞2𝑥𝑥&, − 𝑥𝑥U, 5 (# = 2 ∞ ∞2𝑥𝑥&, − 𝑥𝑥 % , 5 (# &(# % P = ∑$(# • Examine all 2F%5 pairwise dissimilarities. The two clusters with the lowest inter-cluster dissimilarity are fused. The dissimilarity indicates the height in the dendrogram at which these two clusters oin. in a e nter cluster dissimilarit Complete The largest dissimilarity ingle % ey deas • The variance explained by each subse uent principal component is always less than the variance explained by the previous principal component. • All principal components are uncorrelated with one another. • A dataset has in( − 1, ) distinct principal components. The smallest dissimilarity Average The arithmetic mean Centroid The dissimilarity between the cluster centroids % &,U $ & Hierarchical Clustering . elect the dissimilarity measure and lin age to be used. Treat each observation as its own cluster. . or 𝑘𝑘 = , − 1, , 2: • Compute the inter-cluster dissimilarity between all 𝑘𝑘 clusters. ey deas • or 𝑘𝑘-means clustering the algorithm needs to be repeated for each 𝑘𝑘. • or hierarchical clustering the algorithm only needs to be performed once for any number of clusters. • The result of clustering depends on many parameters such as: o Choice of 𝑘𝑘 in 𝑘𝑘-means clustering o Choice of number of clusters lin age and dissimilarity measure in hierarchical clustering o Choice to standardize variables • The first 𝑘𝑘 principal component scores and loadings approximate the original ∑FU(# &,U ,U . dataset 𝑥𝑥&, Principal Components Regression = + # # + ⋯+ F F + • f 𝑘𝑘 = then 𝛽𝛽 = ∑FU(# U ,U . © 2023 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com SRM Formula Sheet 10

Statistical Learning Formula Sheet

Related documents

Products

Support

Statistical Learning Formula Sheet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib