SRM
Updated 03/13/23
STATISTICAL LEARNING
Statistical
Learning
Data
Modeling Problems
Types of Variables
Response
A variable of primary interest
Explanatory A variable used to study the response variable
Count
A quantitative variable usually valid on
non-negative integers
Continuous
A real-valued quantitative variable
Nominal
A categorical/qualitative variable having categories
without a meaningful or logical order
Ordinal
A categorical/qualitative variable having categories
with a meaningful or logical order
Notation
๐ฆ๐ฆ, ๐๐
๐ฅ๐ฅ, ๐๐
Subscript ๐๐
๐๐
Subscript ๐๐
๐๐
๐๐!
๐๐"#
๐๐
,
๐ฆ๐ฆ+, ๐๐, ๐๐.(๐ฅ๐ฅ)
∑'&(#(๐ฆ๐ฆ& − ๐ฆ๐ฆ+& )%
๐๐
For fixed inputs ๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ , the test MSE is
which can be estimated using
%
Var[๐๐]
VXY
.))*+,-./0* *))2)
Test Error Rate = ES๐ผ๐ผ2๐๐ ≠ ๐๐,5T,
∑'&(# ๐ผ๐ผ(๐ฆ๐ฆ& ≠ ๐ฆ๐ฆ+& )
which can be estimated using
๐๐
Bayes Classifier:
๐๐2๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ 5 = arg max Pr2๐๐ = ๐๐a๐๐# = ๐ฅ๐ฅ# , … , ๐๐$ = ๐ฅ๐ฅ$ 5
3
Unsupervised
No response variable
Key Ideas
• The disadvantage to parametric methods is the danger of
choosing a form for ๐๐ that is not close to the truth.
• The disadvantage to non-parametric methods is the need for an
abundance of observations.
• Flexibility and interpretability are typically at odds.
• As flexibility increases, the training MSE (or error rate) decreases,
but the test MSE (or error rate) follows a u-shaped pattern.
• Low flexibility leads to a method with low variance and high bias;
high flexibility leads to a method with high variance and low bias.
Classification
Categorical response
variable
Parametric
Functional form
of f specified
Non-Parametric
Functional form
of f not specified
Prediction
Output of fˆ
Inference
Comprehension
of f
Flexibility
,
fˆ s ability to
follow the data
Interpretability
,
fˆ s ability to
be understood
© 2023 Coaching Actuaries. All Rights Reserved
%
Test MSE = E B2๐๐ − ๐๐,5 D ,
Classification Problems
Statistical Learning Problems
Method
Properties
๐๐ = ๐๐2๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ 5 + ๐๐ where E[๐๐] = 0, so E[๐๐] = ๐๐2๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ 5
)*+,-./0* *))2)
Contrasting Statistical Learning Elements
Regression
Quantitative
response variable
Regression Problems
Test
Observations not used
to train/obtain fฬ
VarS๐๐.2๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ 5T + 2BiasS๐๐.2๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ 5T5 +
VWWWWWWWWWWWWXWWWWWWWWWWWWY
Response variable
Explanatory variable
Index for observations
No. of observations
Index for variables except response
No. of variables except response
Transpose of matrix ๐๐
Inverse of matrix ๐๐
Error term
Estimate/Estimator of ๐๐(๐ฅ๐ฅ)
Supervised
Has response variable
Training
Observations used
to train/obtain fฬ
www.coachingactuaries.com
SRM Formula Sheet 1
LINEAR MODELS
Linear Models
Simple Linear Regression (SLR)
Special case of MLR where ๐๐ = 1
Estimation
∑' (๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)(๐ฆ๐ฆ& − ๐ฆ๐ฆe)
๐๐# = &(# '
∑&(#(๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)%
๐๐4 = ๐ฆ๐ฆe − ๐๐# ๐ฅ๐ฅฬ
SLR Inferences
Standard Errors
1
๐ฅ๐ฅฬ
%
๐ ๐ ๐ ๐ 5! = hMSE i + '
j
๐๐ ∑&(#(๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)%
MSE
๐ ๐ ๐ ๐ 5" = h '
∑&(#(๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)%
(๐ฅ๐ฅ − ๐ฅ๐ฅฬ
)%
1
๐ ๐ ๐ ๐ 67 = hMSE i + '
j
๐๐ ∑&(#(๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)%
๐ ๐ ๐ ๐ 67#$" = hMSE i1 +
(๐ฅ๐ฅ'8# − ๐ฅ๐ฅฬ
)%
1
+ '
j
๐๐ ∑&(#(๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)%
Multiple Linear Regression (MLR)
๐๐ = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐ฅ๐ฅ# + โฏ + ๐ฝ๐ฝ$ ๐ฅ๐ฅ$ + ๐๐
Notation
๐ฝ๐ฝ9
๐๐9
๐๐
%
MSE
X
๐๐
๐๐
SST
SSR
SSE
The ๐๐th regression coefficient
Estimate of ๐ฝ๐ฝ9
Variance of response /
Irreducible error
Estimate of ๐๐ %
Design matrix
Hat matrix
Residual
Total sum of squares
Regression sum of squares
Error sum of squares
Assumptions
1. ๐๐& = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐ฅ๐ฅ&,# + โฏ + ๐ฝ๐ฝ$ ๐ฅ๐ฅ&,$ + ๐๐&
Estimation – Ordinary Least Squares (OLS)
๐ฆ๐ฆ+ = ๐๐4 + ๐๐# ๐ฅ๐ฅ# + โฏ + ๐๐$ ๐ฅ๐ฅ$
๐๐4
o โฎ q = ๐๐ = (๐๐ ! ๐๐)"# ๐๐ ! ๐ฒ๐ฒ
๐๐$
MSE = SSE⁄(๐๐ − ๐๐ − 1)
residual standard error = √MSE
Other Numerical Results
๐๐ = ๐๐(๐๐ ! ๐๐)"# ๐๐ !
๐ฒ๐ฒ+ = ๐๐๐๐
๐๐ = ๐ฆ๐ฆ − ๐ฆ๐ฆ+
SST = ∑'&(#(๐ฆ๐ฆ& − ๐ฆ๐ฆe)% = total variability
SSR = ∑'&(#(๐ฆ๐ฆ+& − ๐ฆ๐ฆe)% = explained
SSE = ∑'&(#(๐ฆ๐ฆ& − ๐ฆ๐ฆ+& )% = unexplained
SST = SSR + SSE
๐
๐
% = SSR⁄SST
๐๐ − 1
MSE
%
= 1 − % = 1 − (1 − ๐
๐
% ) y
z
๐
๐
;+<.
๐๐ − ๐๐ − 1
๐ ๐ 6
Key Ideas
• ๐
๐
% is a poor measure for model
comparison because it will increase
simply by adding more predictors
to a model.
• Polynomials do not change consistently
by unit increases of its variable, i.e. no
constant slope.
• Only ๐ค๐ค − 1 dummy variables are
needed to represent ๐ค๐ค classes of a
categorical predictor; one of the classes
acts as a baseline.
• In effect, dummy variables define a
distinct intercept for each class. Without
the interaction between a dummy
variable and a predictor, the dummy
variable cannot additionally affect that
predictor's regression coefficient.
๐๐,
๐ ๐ ๐ ๐
๐ป๐ป4
๐ป๐ป#
df
๐ก๐ก#">,+?
๐ผ๐ผ
๐๐
ndf
ddf
๐น๐น#">,@+?,++?
๐๐'8#
Subscript ๐๐
Subscript ๐๐
Standard Errors
Estimated standard error
Null hypothesis
Alternative hypothesis
Degrees of freedom
๐๐ quantile of
a ๐ก๐ก-distribution
Significance level
Confidence level
Numerator degrees
of freedom
Denominator degrees
of freedom
๐๐ quantile of
an ๐น๐น-distribution
Response of
new observation
Reduced model
Full model
Variance-Covariance Matrix
á T = MSE(๐๐ ! ๐๐)"# =
Ö S๐ท๐ท
Var
Ö S๐ฝ๐ฝ.4 T
Ö S๐ฝ๐ฝ.4 , ๐ฝ๐ฝ.# T โฏ Cov
Ö S๐ฝ๐ฝ.4 , ๐ฝ๐ฝ.$ T
Var
Cov
โก
โค
Ö S๐ฝ๐ฝ.4 , ๐ฝ๐ฝ.# T
Ö S๐ฝ๐ฝ.# T
Ö S๐ฝ๐ฝ.# , ๐ฝ๐ฝ.$ Tโฅ
โขCov
Var
โฏ Cov
โข
โฅ
โฎ
โฎ
โฑ
โฎ
โข
โฅ
.
.
Ö
Ö
.
.
.
Ö
VarS๐ฝ๐ฝ$ T โฆ
โฃCovS๐ฝ๐ฝ4 , ๐ฝ๐ฝ$ T CovS๐ฝ๐ฝ# , ๐ฝ๐ฝ$ T โฏ
๐ก๐ก Tests
estimate − hypothesized value
standard error
๐ป๐ป4 : ๐ฝ๐ฝ9 = hypothesized value
๐ก๐ก statistic =
Test Type
Two-tailed
Right-tailed
3. E[๐๐& ] = 0
4. Var[๐๐& ] = ๐๐ %
5. ๐๐& ’s are independent
6. ๐๐& ’s are normally distributed
7. The predictor ๐ฅ๐ฅ9 is not a linear
Estimator for E[๐๐]
Ö S๐ฝ๐ฝ.9 T
๐ ๐ ๐ ๐ 5% = ÑVar
Left-tailed
2. ๐ฅ๐ฅ&,9 ’s are non-random
๐น๐น Tests
Rejection Region
|๐ก๐ก statistic| ≥ ๐ก๐กA⁄%,'"$"#
๐ก๐ก statistic ≤ −๐ก๐กA,'"$"#
๐ก๐ก statistic ≥ ๐ก๐กA,'"$"#
MSR
SSR⁄๐๐
=
MSE SSE⁄(๐๐ − ๐๐ − 1)
๐ป๐ป4 : ๐ฝ๐ฝ# = ๐ฝ๐ฝ% = โฏ = ๐ฝ๐ฝ$ = 0
๐น๐น statistic =
Reject ๐ป๐ป4 if ๐น๐น statistic ≥ ๐น๐นA,@+?,++?
combination of the other ๐๐ predictors,
for ๐๐ = 0, 1, … , ๐๐
© 2023 Coaching Actuaries. All Rights Reserved
MLR Inferences
Notation
Estimator for ๐ฝ๐ฝ9
๐ฝ๐ฝ.9
• ndf = ๐๐
• ddf = ๐๐ − ๐๐ − 1
www.coachingactuaries.com
SRM Formula Sheet 2
Partial ๐น๐น Tests
๐น๐น statistic =
2SSEC − SSED 5ò2๐๐D − ๐๐C 5
SSED ⁄2๐๐ − ๐๐D − 15
๐ป๐ป4 : Some ๐ฝ๐ฝ9 ′s = 0
Reject ๐ป๐ป4 if ๐น๐น statistic ≥ ๐น๐นA,@+?,++?
• ndf = ๐๐D − ๐๐C
• ddf = ๐๐ − ๐๐D − 1
For all hypothesis tests, reject ๐ป๐ป4 if
๐๐-value ≤ ๐ผ๐ผ.
Confidence and Prediction Intervals
estimate ± (๐ก๐ก quantile)(standard error)
Quantity
๐ฝ๐ฝ9
E[๐๐]
๐๐'8#
Interval Expression
๐๐9 ± ๐ก๐ก(#"F)⁄%,'"$"# ⋅ ๐ ๐ ๐ ๐ 5%
๐ฆ๐ฆ+ ± ๐ก๐ก(#"F)⁄%,'"$"# ⋅ ๐ ๐ ๐ ๐ 67
๐ฆ๐ฆ+'8# ± ๐ก๐ก(#"F)⁄%,'"$"# ⋅ ๐ ๐ ๐ ๐ 67#$"
Linear Model Assumptions
Leverage
๐ ๐ ๐ ๐ 67& %
โ& = ๐ฑ๐ฑ&! (๐๐ ! ๐๐)"# ๐ฑ๐ฑ& =
MSE
(๐ฅ๐ฅ& − ๐ฅ๐ฅฬ
)%
1
for SLR
โ& = + '
๐๐ ∑H(#(๐ฅ๐ฅH − ๐ฅ๐ฅฬ
)%
• 1⁄๐๐ ≤ โ& ≤ 1
• ∑'&(# โ& = ๐๐ + 1
$8#
• Frees rule of thumb: โ& > 3 £
'
§
Studentized and Standardized Residuals
๐๐&
๐๐IJ,,& =
ÑMSE(&) (1 − โ& )
๐๐IJ;,& =
๐๐&
•MSE(1 − โ& )
• Frees rule of thumb: a๐๐IJ;,& a > 2
Cook’s Distance
%
∑'H(#2๐ฆ๐ฆ+H − ๐ฆ๐ฆ+(&)H 5
MSE(๐๐ + 1)
๐๐&% โ&
=
MSE(๐๐ + 1)(1 − โ& )%
๐ท๐ท& =
Plots of Residuals
• ๐๐ versus ๐ฆ๐ฆ+
Residuals are well-behaved if
o Points appear to be randomly scattered
o Residuals seem to average to 0
o Spread of residuals does not change
• ๐๐ versus ๐๐
Detects dependence of error terms
• ๐๐๐๐ plot of ๐๐
Variance Inflation Factor
๐ ๐ K%% (๐๐ − 1)
1
VIF9 =
=
๐ ๐ ๐๐5%%
MSE
1 − ๐
๐
9%
Tolerance is the reciprocal of VIF.
• Frees rule of thumb: any VIF9 ≥ 10
Key Ideas
• As realizations of a ๐ก๐ก-distribution,
studentized residuals can help
identify outliers.
• When residuals have a larger spread for
larger predictions, one solution is to
transform the response variable with a
concave function.
• There is no universal approach to
handling multicollinearity; it is even
possible to accept it, such as when there
is a suppressor variable. On the other
hand, it can be eliminated by using a set
of orthogonal predictors.
Model Selection
Notation
๐๐
Total no. of predictors
in consideration
๐๐
No. of predictors for a
specific model
MSEL MSE of the model that uses
all ๐๐ predictors
Μ$ The "best" model with ๐๐ predictors
Best Subset Selection
1. For ๐๐ = 0, 1, … , ๐๐, fit all £L$§ models with ๐๐
predictors. The model with the largest ๐
๐
%
is Μ$ .
2. Choose the best model among Μ4 , … , ΜL
using a selection criterion of choice.
Forward Stepwise Selection
1. Fit all ๐๐ simple linear regression models.
The model with the largest ๐
๐
% is Μ# .
2. For ๐๐ = 2, … , ๐๐, fit the models that add
one of the remaining predictors to Μ$"# .
%
The model with the largest ๐
๐
is Μ$ .
3. Choose the best model among Μ4 , … , ΜL
using a selection criterion of choice.
Backward Stepwise Selection
1. Fit the model with all ๐๐ predictors, ΜL .
2. For ๐๐ = ๐๐ − 1, … , 1, fit the models that
drop one of the predictors from Μ$8# .
The model with the largest ๐
๐
% is Μ$ .
3. Choose the best model among Μ4 , … , ΜL
Selection Criteria
• Mallows’ ๐ถ๐ถ$
SSE + 2๐๐ ⋅ MSEL
๐ถ๐ถ$ =
๐๐
SSE
๐ถ๐ถ$ =
− ๐๐ + 2(๐๐ + 1)
MSEL
• Akaike information criterion
SSE + 2๐๐ ⋅ MSEL
AIC =
๐๐ ⋅ MSEL
• Bayesian information criterion
SSE + ln ๐๐ ⋅ ๐๐ ⋅ MSEL
BIC =
๐๐ ⋅ MSEL
• Adjusted ๐
๐
%
• Cross-validation error
Validation Set
• Randomly splits all available
observations into two groups: the
training set and the validation set.
• Only the observations in the training set
are used to attain the fitted model, and
those in validation set are used to
estimate the test MSE.
๐๐-fold Cross-Validation
1. Randomly divide all available
observations into ๐๐ folds.
2. For ๐ฃ๐ฃ = 1, … , ๐๐, obtain the ๐ฃ๐ฃth fit by
training with all observations except
those in the ๐ฃ๐ฃth fold.
3. For ๐ฃ๐ฃ = 1, … , ๐๐, use ๐ฆ๐ฆ+ from the ๐ฃ๐ฃth fit to
calculate a test MSE estimate with
observations in the ๐ฃ๐ฃth fold.
4. To calculate CV error, average the ๐๐ test
MSE estimates in the previous step.
Leave-one-out Cross-Validation (LOOCV)
• Calculate LOOCV error as a special case of
๐๐-fold cross-validation where ๐๐ = ๐๐.
• For MLR:
LOOCV Error =
'
1
๐ฆ๐ฆ& − ๐ฆ๐ฆ+& %
∞y
z
1 − โ&
๐๐
&(#
Key Ideas on Cross-Validation
• The validation set approach has unstable
results and will tend to overestimate the
test MSE. The two other approaches
mitigate these issues.
• With respect to bias, LOOCV < ๐๐-fold CV <
Validation Set.
• With respect to variance, LOOCV > ๐๐-fold
CV > Validation Set.
using a selection criterion of choice.
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 3
Other Regression Approaches
Standardizing Variables
Key Ideas on Ridge and Lasso
Weighted Least Squares
• Var[๐๐& ] = ๐๐ % ⁄๐ค๐ค&
• ๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ are scaled predictors.
• A centered variable is the result of
subtracting the sample mean from
a variable.
• A scaled variable is the result of
dividing a variable by its sample
standard deviation.
• Equivalent to running OLS with √๐ค๐ค๐ฆ๐ฆ as
• ๐๐ is inversely related to flexibility.
• With a finite ๐๐, none of the ridge
estimates will equal 0, but the lasso
estimates could equal 0.
the response and √๐ค๐ค๐ฑ๐ฑ as the predictors,
hence minimizing ∑'&(# ๐ค๐ค& (๐ฆ๐ฆ& − ๐ฆ๐ฆ+& )% .
๐๐ = (๐๐ ! ๐๐๐๐)"# ๐๐ ! ๐๐๐๐ where ๐๐ is the
diagonal matrix of the weights.
Partial Least Squares
• The first partial least squares direction,
๐ง๐ง# , is a linear combination of standardized
predictors ๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ , with coefficients
• A standardized variable is the result of
first centering a variable, then scaling it.
Ridge Regression
Coefficients are estimated by minimizing
the SSE while constrained by ∑$9(# ๐๐9% ≤ ๐๐
๐๐-Nearest Neighbors (KNN)
1. Identify the "center of the neighborhood",
i.e. the location of an observation with
inputs ๐ฅ๐ฅ# , … , ๐ฅ๐ฅ$ .
based on the relation between ๐ฅ๐ฅ9 and ๐ฆ๐ฆ.
2. Starting from the "center of the
neighborhood", identify the ๐๐ nearest
training observations.
3. For classification, ๐ฆ๐ฆ+ is the most frequent
category among the ๐๐ observations; for
regression, ๐ฆ๐ฆ+ is the average of the
response among the ๐๐ observations.
๐๐ is inversely related to flexibility.
• Every subsequent partial least squares
direction is calculated iteratively as a
linear combination of "updated
predictors" which are the residuals of fits
with the "previous predictors" explained
by the previous direction.
• The directions ๐ง๐ง# , … , ๐ง๐งL are used as
or equivalently, by minimizing the
expression SSE + ๐๐ ∑$9(# ๐๐9% .
Lasso Regression
Coefficients are estimated by minimizing
the SSE while constrained by ∑$9(#a๐๐9 a ≤ ๐๐
predictors in a multiple linear regression.
The number of directions, ๐๐, is a measure
of flexibility.
or equivalently, by minimizing the
expression SSE + ๐๐ ∑$9(#a๐๐9 a.
Key Results for Distributions in the Exponential Family
Distribution
Normal
Binomial
(fixed ๐๐)
Poisson
Negative Binomial
(fixed ๐๐)
Gamma
Inverse Gaussian
Probability Function
1
๐๐√2๐๐
exp π−
๐๐
(๐ฆ๐ฆ − ๐๐)%
∫
2๐๐ %
๐๐ ln21 + ๐๐ N 5
๐๐
ln y
z
๐๐ − ๐๐
1
−๐๐ ln21 − ๐๐ N 5
๐๐
ln y
z
๐๐ + ๐๐
1
๐๐
−√−2๐๐
1
ln(1 − ๐๐)
© 2023 Coaching Actuaries. All Rights Reserved
1
ln ๐๐
Γ(๐ฆ๐ฆ + ๐๐) C
๐๐ (1 − ๐๐)6
๐ฆ๐ฆ! Γ(๐๐)
๐๐
๐๐(๐ฆ๐ฆ − ๐๐)%
h
exp π−
∫
O
2๐๐๐ฆ๐ฆ
2๐๐% ๐ฆ๐ฆ
Canonical Link, ๐๐M "# (๐๐)
๐๐ %
๐๐
ln £
§
1 − ๐๐
๐พ๐พ A A"#
๐ฆ๐ฆ
exp(−๐ฆ๐ฆ๐ฆ๐ฆ)
Γ(๐ผ๐ผ)
๐๐(๐๐)
๐๐
๐๐
y z ๐๐ 6 (1 − ๐๐)'"6
๐ฆ๐ฆ
๐๐6
exp(−๐๐)
๐ฆ๐ฆ!
๐๐
−
−
๐พ๐พ
๐ผ๐ผ
1
2๐๐%
1
๐ผ๐ผ
๐๐ %
2
๐๐ N
− ln(−๐๐)
www.coachingactuaries.com
๐๐
ln ๐๐
−
−
1
๐๐
1
2๐๐%
SRM Formula Sheet 4
NON-LINEAR
MODELS
Non-Linear
Models
Generalized Linear Models
Notation
๐๐, ๐๐
Linear exponential family
parameters
E[๐๐], ๐๐ Mean response
๐๐M (๐๐)
Mean function
๐ฃ๐ฃ(๐๐)
Variance function
โ(๐๐)
Link function
๐๐
Maximum likelihood estimate
of ๐ท๐ท
๐๐(๐๐)
Maximized log-likelihood
๐๐4
Maximized log-likelihood for
null model
๐๐I;J
Maximized log-likelihood for
saturated model
๐๐
Residual
๐๐
Information matrix
%
๐๐#">,+? ๐๐ quantile of a chi-square
distribution
๐ท๐ท∗
Scaled deviance
๐ท๐ท
Deviance statistic
Linear Exponential Family
๐ฆ๐ฆ๐ฆ๐ฆ − ๐๐(๐๐)
Prob. fn. of ๐๐ = exp π
+ ๐๐(๐ฆ๐ฆ, ๐๐)∫
๐๐
E[๐๐] = ๐๐M (๐๐)
Var[๐๐] = ๐๐ ⋅ ๐๐MM (๐๐) = ๐๐ ⋅ ๐ฃ๐ฃ(๐๐)
Model Framework
• โ(๐๐) = ๐ฑ๐ฑ ! ๐ท๐ท
• Canonical link is the link function where
โ(๐๐) = ๐๐
Numerical Results
๐ท๐ท∗ = 2[๐๐I;J − ๐๐(๐๐)]
๐ท๐ท = ๐๐๐ท๐ท∗
For MLR, ๐ท๐ท = SSE
1 − exp{2[๐๐4 − ๐๐(๐๐)]⁄๐๐}
1 − exp{2๐๐4 ⁄๐๐}
๐๐(๐๐) − ๐๐4
=
๐๐I;J − ๐๐4
%
๐
๐
QI
=
%
๐
๐
RI*.
AIC = −2 ⋅ ๐๐(๐๐) + 2 ⋅ (๐๐ + 1)*
BIC = −2 ⋅ ๐๐(๐๐) + ln ๐๐ ⋅ (๐๐ + 1)*
*Assumes only ๐ท๐ท needs to be estimated. If
estimating ๐๐ is required, replace ๐๐ + 1 with
๐๐ + 2.
Residuals
Raw Residual
๐๐& = ๐ฆ๐ฆ& − ๐๐ฬ &
Pearson Residual
๐ฆ๐ฆ& − ๐๐ฬ &
๐๐& =
•๐๐ ⋅ ๐ฃ๐ฃ(๐๐ฬ & )
The Pearson chi-square statistic is ∑'&(# ๐๐&% .
Deviance Residual
๐๐& = ±•๐ท๐ท&∗ whose sign follows the
๐๐th raw residual
Anscombe Residual
á[๐ก๐ก(๐๐& )]
๐ก๐ก(๐ฆ๐ฆ& ) − E
๐๐& =
Ö
•Var[๐ก๐ก(๐๐& )]
Inference
á
• Maximum likelihood estimators ๐ท๐ท
asymptotically have a multivariate
normal distribution with mean ๐ท๐ท
and asymptotic variance-covariance
matrix ๐๐"# .
• To address overdispersion, change the
variance to Var[๐๐& ] = ๐ฟ๐ฟ ⋅ ๐๐ ⋅ ๐๐MM (๐๐& ) and
estimate ๐ฟ๐ฟ as the Pearson chi-square
statistic divided by ๐๐ − ๐๐ − 1.
Likelihood Ratio Tests
๐๐ % statistic = 2S๐๐2๐๐D 5 − ๐๐(๐๐C )T
๐ป๐ป4 : Some ๐ฝ๐ฝ9 ′s = 0
%
Reject ๐ป๐ป4 if ๐๐ % statistic ≥ ๐๐A,$
' "$(
Goodness-of-Fit Tests
๐๐ follows a distribution of choice with ๐๐ free
parameters, whose domain is split into ๐ค๐ค
mutually exclusive intervals.
S
๐๐ % statistic = ∞
3(#
(๐๐3 − ๐๐๐๐3 )%
๐๐๐๐3
๐๐3
for all ๐๐ = 1, … , ๐ค๐ค
๐ป๐ป4 : ๐๐3 =
๐๐
%
Reject ๐ป๐ป4 if ๐๐ % statistic ≥ ๐๐A,S"L"#
Tweedie Distribution
E[๐๐] = ๐๐,
Var[๐๐] = ๐๐ ⋅ ๐๐T
Distribution
Normal
0
Poisson
M "# (๐๐).
Parameter Estimation
๐ฆ๐ฆ& ๐๐& − ๐๐(๐๐& )
๐๐(๐ท๐ท) = ∞ π
+ ๐๐(๐ฆ๐ฆ& , ๐๐) ∫
๐๐
&(#
where ๐๐& = ๐๐M "# Sโ"# 2๐ฑ๐ฑ&! ๐ท๐ท5T
1
Tweedie
(1, 2)
Inverse Gaussian
3
Gamma
'
๐๐
2
The score equations are the partial
derivatives of ๐๐(๐ท๐ท) with respect to each ๐ฝ๐ฝ9
all set equal to 0. The solution to the score
equations is ๐๐. Then, ๐๐ฬ = โ"#(๐ฑ๐ฑ ! ๐๐).
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 5
Logistic and Probit Regression
• The odds of an event are the ratio of the
probability that the event will occur to
the probability that the event will
not occur.
• The odds ratio is the ratio of the odds
of an event with the presence of a
characteristic to the odds of the same
event without the presence of
that characteristic.
Binary Response
Function Name
Logit
Probit
Complementary
log-log
'
โ(๐๐)
ln y
๐๐
z
1 − ๐๐
Φ"# (๐๐)
ln(− ln(1 − ๐๐))
Nominal Response – Generalized Logit
Let ๐๐&,3 be the probability that the ๐๐th
observation is classified as category ๐๐. The
reference category is ๐๐.
๐๐&,3
ln i
j = ๐ฑ๐ฑ&! ๐ท๐ท3
๐๐&,F
๐๐&,3
exp2๐ฑ๐ฑ&! ๐ท๐ท3 5
โง
โช1 + ∑UVF exp2๐ฑ๐ฑ ! ๐ท๐ทU 5 ,
&
=
1
โจ
,
โช
!
โฉ1 + ∑UVF exp2๐ฑ๐ฑ& ๐ท๐ทU 5
'
S
๐๐(๐ท๐ท) = ∞ ∞ ๐ผ๐ผ(๐ฆ๐ฆ& = ๐๐) ln ๐๐&,3
๐๐ ≠ ๐๐
๐๐ = ๐๐
&(# 3(#
Ordinal Response – Proportional Odds
Cumulative
โ(Π3 ) = ๐ผ๐ผ3 + ๐ฑ๐ฑ&! ๐ท๐ท where
• Π3 = ๐๐# + โฏ + ๐๐3
๐ฅ๐ฅ&,#
๐ฝ๐ฝ#
• ๐ฑ๐ฑ& = — โฎ “ , ๐ท๐ท = o โฎ q
๐ฅ๐ฅ&,$
๐ฝ๐ฝ$
๐๐(๐ท๐ท) = ∞[๐ฆ๐ฆ& ln ๐๐& + (1 − ๐ฆ๐ฆ& ) ln(1 − ๐๐& )]
&(#
'
๐๐
๐๐&M
๐๐(๐ท๐ท) = ∞ ๐ฑ๐ฑ& (๐ฆ๐ฆ& − ๐๐& )
= ๐๐
๐๐๐ท๐ท
๐๐& (1 − ๐๐& )
'
&(#
1 − ๐ฆ๐ฆ&
๐ฆ๐ฆ&
๐ท๐ท = 2 ∞ ๐ฆ๐ฆ& ln y z + (1 − ๐ฆ๐ฆ& ) ln y
zÀ
๐๐ฬ &
1 − ๐๐ฬ &
&(#
Pearson residual, ๐๐& =
๐ฆ๐ฆ& − ๐๐ฬ &
'
&(#
(๐ฆ๐ฆ& − ๐๐ฬ & )%
๐๐ฬ & (1 − ๐๐ฬ & )
© 2023 Coaching Actuaries. All Rights Reserved
'
๐๐(๐ท๐ท) = ∞[๐ฆ๐ฆ& ln ๐๐& − ๐๐& − ln(๐ฆ๐ฆ& !) ]
&(#
'
๐๐
๐๐(๐ท๐ท) = ∞ ๐ฑ๐ฑ& (๐ฆ๐ฆ& − ๐๐& ) = ๐๐
๐๐๐ท๐ท
&(#
'
๐๐ = ∞ ๐๐& ๐ฑ๐ฑ& ๐ฑ๐ฑ&!
&(#
'
๐ฆ๐ฆ&
๐ท๐ท = 2 ∞ ”๐ฆ๐ฆ& ln y z − 1À + ๐๐ฬ & ‘
๐๐ฬ &
&(#
Pearson residual, ๐๐& =
•๐๐ฬ &
'
&(#
(๐ฆ๐ฆ& − ๐๐ฬ & )%
๐๐ฬ &
Poisson Regression with Exposures Model
ln ๐๐ = ln ๐ค๐ค + ๐ฑ๐ฑ ! ๐ท๐ท
Alternative Count Models
These models can incorporate a Poisson
distribution while letting the mean of
the response differ from the variance of
the response:
Models
Mean <
Variance
Mean >
Variance
Negative binomial
Yes
No
Hurdle
Yes
Yes
Heterogeneity
www.coachingactuaries.com
๐ฆ๐ฆ& − ๐๐ฬ &
Pearson chi-square statistic = ∞
Zero-inflated
•๐๐ฬ & (1 − ๐๐ฬ & )
Pearson chi-square statistic = ∞
Poisson Count Regression
ln ๐๐ = ๐ฑ๐ฑ ! ๐ท๐ท
Yes
Yes
No
No
SRM Formula Sheet 6
TIME SERIESTime
Series
Trend Models
Notation
Subscript ๐ก๐ก Index for observations
๐๐W
Trends in time
๐๐W
Seasonal trends
๐๐W
Random patterns
๐ฆ๐ฆ+'8X
๐๐-step ahead forecast
๐ ๐ ๐ ๐
Estimated standard error
๐ก๐ก#">,+?
๐๐ quantile of a ๐ก๐ก-distribution
๐๐#
๐๐%
Training sample size
Test sample size
Trends
Additive: ๐๐W = ๐๐W + ๐๐W + ๐๐W
Multiplicative: ๐๐W = ๐๐W × ๐๐W + ๐๐W
Stationarity
Stationarity describes how something does
not vary with respect to time. Control charts
can be used to identify stationarity.
White Noise
๐ฆ๐ฆ+'8X = ๐ฆ๐ฆe
๐ ๐ ๐ ๐ 67#$) = ๐ ๐ 6 •1 + 1⁄๐๐
100๐๐% prediction interval for ๐ฆ๐ฆ'8X is
๐ฆ๐ฆ+'8X ± ๐ก๐ก(#"F)⁄%,'"# ⋅ ๐ ๐ ๐ ๐ 67#$)
Random Walk
๐ค๐คW = ๐ฆ๐ฆW − ๐ฆ๐ฆW"#
๐ฆ๐ฆ+'8X = ๐ฆ๐ฆ' + ๐๐๐ค๐ค
ÿ
๐ ๐ ๐ ๐ 67#$) = ๐ ๐ S √๐๐
Approximate 95% prediction interval for
๐ฆ๐ฆ'8X is ๐ฆ๐ฆ+'8X ± 2 ⋅ ๐ ๐ ๐ ๐ 67#$)
Model Comparison
ME =
'
1
∞ ๐๐W
๐๐%
W('" 8#
'
๐๐W
1
∞
MPE = 100 ⋅
๐ฆ๐ฆW
๐๐%
'
W('" 8#
1
∞ ๐๐W%
MSE =
๐๐%
MAE =
W('" 8#
'
1
∞ |๐๐W |
๐๐%
W('" 8#
'
๐๐W
1
∞ Ÿ Ÿ
MAPE = 100 ⋅
๐ฆ๐ฆW
๐๐%
W('" 8#
© 2023 Coaching Actuaries. All Rights Reserved
Autoregressive Models
Notation
๐๐F
Lag ๐๐ autocorrelation
๐๐F
Lag ๐๐ sample autocorrelation
Variance of white noise
๐๐ %
%
๐ ๐
Estimate of ๐๐ %
๐๐4
Estimate of ๐ฝ๐ฝ4
๐๐#
Estimate of ๐ฝ๐ฝ#
๐ฆ๐ฆe"
Sample mean of first
๐๐ − 1 observations
๐ฆ๐ฆe8
Sample mean of last
๐๐ − 1 observations
Autocorrelation
∑'W(F8#(๐ฆ๐ฆW"F − ๐ฆ๐ฆe)(๐ฆ๐ฆW − ๐ฆ๐ฆe)
๐๐F =
∑'W(#(๐ฆ๐ฆW − ๐ฆ๐ฆe)%
Testing Autocorrelation
test statistic = ๐๐F ⁄๐ ๐ ๐ ๐ C*
where ๐ ๐ ๐ ๐ C* = 1⁄√๐๐
๐ป๐ป4 : ๐๐F = 0 against ๐ป๐ป# : ๐๐F ≠ 0
Reject ๐ป๐ป4 if |test statistic| ≥ ๐ง๐ง#"A⁄%
AR(1) Model
๐๐W = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐๐W"# + ๐๐W
Assumptions
1. E[๐๐W ] = 0
2. Var[๐๐W ] = ๐๐ %
3. Cov[๐๐W8F , ๐๐W ] = 0 for ๐๐ > 0
• If ๐ฝ๐ฝ# = 0, ๐๐W follows a white noise process.
• If ๐ฝ๐ฝ# = 1, ๐๐W follows a random
walk process.
• If −1 < ๐ฝ๐ฝ# < 1, ๐๐W is stationary.
Properties of Stationary AR(1) Model
๐ฝ๐ฝ4
E[๐๐W ] =
1 − ๐ฝ๐ฝ#
๐๐ %
Var[๐๐W ] =
1 − ๐ฝ๐ฝ#%
๐๐F = ๐ฝ๐ฝ#F
Estimation
∑'W(%(๐ฆ๐ฆW"# − ๐ฆ๐ฆe" )(๐ฆ๐ฆW − ๐ฆ๐ฆe8 )
≈ ๐๐#
๐๐# =
∑'W(%(๐ฆ๐ฆW"# − ๐ฆ๐ฆe")%
Smoothing and Predictions
๐ฆ๐ฆ+W = ๐๐4 + ๐๐# ๐ฆ๐ฆW"# ,
2 ≤ ๐ก๐ก ≤ ๐๐
๐๐4 + ๐๐# ๐ฆ๐ฆ'8X"# ,
๐๐ = 1
๐ฆ๐ฆ+'8X = ”
๐๐ > 1
๐๐4 + ๐๐# ๐ฆ๐ฆ+'8X"# ,
%(X"#)
๐ ๐ ๐ ๐ 67#$) = ๐ ๐ Ñ1 + ๐๐#% + ๐๐#Y + โฏ + ๐๐#
100๐๐% prediction interval for ๐ฆ๐ฆ'8X is
๐ฆ๐ฆ+'8X ± ๐ก๐ก(#"F)⁄%,'"O ⋅ ๐ ๐ ๐ ๐ 67#$)
Other Time Series Models
Notation
๐๐
Moving average length
๐ค๐ค
Smoothing parameter
๐๐
Seasonal base
๐๐
No. of trigonometric functions
Smoothing with Moving Averages
๐๐W = ๐ฝ๐ฝ4 + ๐๐W
Smoothing
๐ฆ๐ฆW + ๐ฆ๐ฆW"# + โฏ + ๐ฆ๐ฆW"F8#
๐ ๐ ฬW =
๐๐
๐ฆ๐ฆW − ๐ฆ๐ฆW"F
,
๐๐ = 1, 2, …
๐ ๐ ฬW = ๐ ๐ ฬW"# +
๐๐
Predictions
๐๐4 = ๐ ๐ ฬ'
๐ฆ๐ฆ+'8X = ๐๐4
Double Smoothing with Moving Averages
๐๐W = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐ก๐ก + ๐๐W
Smoothing
๐ ๐ ฬW + ๐ ๐ ฬW"# + โฏ + ๐ ๐ ฬW"F8#
(%)
๐ ๐ ฬW =
๐๐
๐ ๐ ฬW − ๐ ๐ ฬW"F
(%)
(%)
,
๐๐ = 1, 2, …
๐ ๐ ฬW = ๐ ๐ ฬW"# +
๐๐
Predictions
๐๐4 = ๐ ๐ ฬ'
๐๐# =
(%)
2 £๐ ๐ ฬ' − ๐ ๐ ฬ' §
๐๐ − 1
๐ฆ๐ฆ+'8X = ๐๐4 + ๐๐# ⋅ ๐๐
๐๐4 = ๐ฆ๐ฆe8 − ๐๐# ๐ฆ๐ฆe" ≈ ๐ฆ๐ฆe(1 − ๐๐# )
∑'W(%(๐๐W − ๐๐ฬ
)%
๐ ๐ % =
๐๐ − 3
๐ ๐ %
Ö [๐๐W ] =
Var
1 − ๐๐#%
www.coachingactuaries.com
SRM Formula Sheet 7
Exponential Smoothing
๐๐W = ๐ฝ๐ฝ4 + ๐๐W
Smoothing
๐ ๐ ฬW = (1 − ๐ค๐ค)(๐ฆ๐ฆW + ๐ค๐ค๐ฆ๐ฆW"# + โฏ + ๐ค๐ค W ๐ฆ๐ฆ4 )
๐ ๐ ฬW = (1 − ๐ค๐ค)๐ฆ๐ฆW + ๐ค๐ค๐ ๐ ฬW"# ,
0 ≤ ๐ค๐ค < 1
The value of ๐ค๐ค is determined by minimizing
๐๐๐๐(๐ค๐ค) = ∑'W(#(๐ฆ๐ฆW − ๐ ๐ ฬW"# )% .
Predictions
๐๐4 = ๐ ๐ ฬ'
๐ฆ๐ฆ+'8X = ๐๐4
Smoothing
(%)
Predictions
๐๐4 =
๐๐# =
๐ฆ๐ฆ+'8X
W
= (1 − ๐ค๐ค)(๐ ๐ ฬW + ๐ค๐ค๐ ๐ ฬW"# + โฏ + ๐ค๐ค ๐ ๐ ฬ4 )
= (1 − ๐ค๐ค)๐ ๐ ฬW + ๐ค๐ค๐ ๐ ฬW"# ,
T
๐๐W = ∞S๐ฝ๐ฝ#,& sin(๐๐& ๐ก๐ก) + ๐ฝ๐ฝ%,& cos(๐๐& ๐ก๐ก)T
&(#
• ๐๐& = 2๐๐๐๐⁄๐๐
• ๐๐ ≤ ๐๐⁄2
Seasonal Autoregressive Models, SAR(p)
๐๐W = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐๐W"L + โฏ + ๐ฝ๐ฝ$ ๐๐W"$L + ๐๐W
Holt-Winter Seasonal Additive Model
๐๐W = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐ก๐ก + ๐๐W + ๐๐W
• ๐๐W = ๐๐W"L
Double Exponential Smoothing
๐๐W = ๐ฝ๐ฝ4 + ๐ฝ๐ฝ# ๐ก๐ก + ๐๐W
(%)
๐ ๐ ฬW
(%)
๐ ๐ ฬW
Seasonal Time Series Models
Fixed Seasonal Effects – Trigonometric
Functions
L
• ∑W(# ๐๐W = 0
0 ≤ ๐ค๐ค < 1
Unit Root Test
• A unit root test is used to evaluate the fit
of a random walk model.
• A random walk model is a good fit if the
time series possesses a unit root.
• The Dickey-Fuller test and augmented
Dickey-Fuller test are two examples of
unit root tests.
Volatility Models
๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด(๐๐) Model
%
%
+ โฏ + ๐พ๐พ$ ๐๐W"$
๐๐W% = ๐๐ + ๐พ๐พ# ๐๐W"#
๐บ๐บ๐บ๐บ๐บ๐บ๐บ๐บ๐บ๐บ(๐๐, ๐๐) Model
%
%
+ โฏ + ๐พ๐พ$ ๐๐W"$
+
๐๐W% = ๐๐ + ๐พ๐พ# ๐๐W"#
%
%
๐ฟ๐ฟ# ๐๐W"#
+ โฏ + ๐ฟ๐ฟ> ๐๐W">
๐๐
Var[๐๐W ] =
1 − ∑$9(# ๐พ๐พ9 − ∑>9(# ๐ฟ๐ฟ9
Assumptions
• ๐๐ > 0
• ๐พ๐พ9 ≥ 0
(%)
2๐ ๐ ฬ' − ๐ ๐ ฬ'
1 − ๐ค๐ค
(%)
£๐ ๐ ฬ' − ๐ ๐ ฬ' §
๐ค๐ค
= ๐๐4 + ๐๐# ⋅ ๐๐
• ๐ฟ๐ฟ9 ≥ 0
• ∑$9(# ๐พ๐พ9 + ∑>9(# ๐ฟ๐ฟ9 < 1
Key Ideas for Smoothing
• It is only appropriate for time series data
without a linear trend.
• It is related to weighted least squares.
• A double smoothing procedure can be
used to forecast time series data with a
linear trend.
• Holt-Winter double exponential
smoothing is a generalization of the
double exponential smoothing.
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 8
DECISION TREES
Decision
Trees
Regression and Classification Trees
Notation
๐
๐
Region of predictor space
๐๐U
No. of observations in node ๐๐
๐๐U,3 No. of category ๐๐ observations in
node ๐๐
๐ผ๐ผ
Impurity
๐ธ๐ธ
Classification error rate
๐บ๐บ
Gini index
๐ท๐ท
Cross entropy
๐๐
Subtree
|๐๐|
No. of terminal nodes in ๐๐
๐๐
Tuning parameter
Algorithm
1. Construct a large tree with ๐๐ terminal
nodes using recursive binary splitting.
2. Obtain a sequence of best subtrees,
as a function of ๐๐, using cost
complexity pruning.
3. Choose ๐๐ by applying ๐๐-fold cross
validation. Select the ๐๐ that results in the
lowest cross-validation error.
4. The best subtree is the subtree created in
step 2 with the selected ๐๐ value.
Recursive Binary Splitting
Regression:
L
%
Minimize ∞ ∞ 2๐ฆ๐ฆ& − ๐ฆ๐ฆeZ+ 5
U(# &:๐ฑ๐ฑ & ∈Z+
Classification:
L
1
Minimize ∞ ๐๐U ⋅ ๐ผ๐ผU
๐๐
U(#
More Under Classification:
๐๐ฬU,3 = ๐๐U,3 ⁄๐๐U
๐ธ๐ธU = 1 − max ๐๐ฬU,3
3
๐บ๐บU = ∑S
3(# ๐๐ฬU,3 21 − ๐๐ฬ U,3 5
๐ท๐ทU = − ∑S
3(# ๐๐ฬU,3 ln ๐๐ฬ U,3
deviance = −2 ∑LU(# ∑S
3(# ๐๐U,3 ln ๐๐ฬ U,3
deviance
residual mean deviance =
๐๐ − ๐๐
Cost Complexity Pruning
Regression:
|!|
%
Minimize ∞ ∞ 2๐ฆ๐ฆ& − ๐ฆ๐ฆeZ+ 5 + ๐๐|๐๐|
U(# &:๐ฑ๐ฑ & ∈Z+
Classification:
|!|
1
Minimize ∞ ๐๐U ⋅ ๐ผ๐ผU + ๐๐|๐๐|
๐๐
Key Ideas
U(#
• Terminal nodes or leaves represent the
partitions of the predictor space.
• Internal nodes are points along the tree
where splits occur.
• Terminal nodes do not have child nodes,
but internal nodes do.
• Branches are lines that connect any
two nodes.
• A decision tree with only one internal
node is called a stump.
Advantages of Trees
• Easy to interpret and explain
• Can be presented visually
• Manage categorical variables without the
need of dummy variables
• Mimic human decision-making
Disadvantages of Trees
• Not robust
• Do not have the same degree of predictive
accuracy as other statistical methods
Multiple Trees
Bagging
1. Create ๐๐ bootstrap samples from the
original training dataset.
2. Construct a decision tree for each
bootstrap sample using recursive
binary splitting.
3. Predict the response of a new observation
by averaging the predictions (regression
trees) or by using the most frequent
category (classification trees) across
all ๐๐ trees.
Random Forests
1. Create ๐๐ bootstrap samples from the
original training dataset.
2. Construct a decision tree for each
bootstrap sample using recursive binary
splitting. At each split, a random subset of
๐๐ variables are considered.
3. Predict the response of a new observation
by averaging the predictions (regression
trees) or by using the most frequent
category (classification trees) across
all ๐๐ trees.
Properties
• Bagging is a special case of
random forests.
• Increasing ๐๐ does not cause overfitting.
• Decreasing ๐๐ reduces the correlation
between predictions.
Boosting
Let ๐ง๐ง# be the actual response variable, ๐ฆ๐ฆ.
1. For ๐๐ = 1, 2, … , ๐๐:
• Use recursive binary splitting to fit a
tree with ๐๐ splits to the data with ๐ง๐งF as
the response.
• Update ๐ง๐งF by subtracting ๐๐ ⋅ ๐๐.F (๐ฑ๐ฑ), i.e.
let ๐ง๐งF8# = ๐ง๐งF − ๐๐ ⋅ ๐๐.F (๐ฑ๐ฑ).
2. Calculate the boosted model prediction as
๐๐.(๐ฑ๐ฑ) = ∑5F(# ๐๐ ⋅ ๐๐.F (๐ฑ๐ฑ).
Properties
• Increasing ๐๐ can cause overfitting.
• Boosting reduces bias.
• ๐๐ controls complexity of the
boosted model.
• ๐๐ controls the rate at which
boosting learns.
Properties
• Increasing ๐๐ does not cause overfitting.
• Bagging reduces variance.
• Out-of-bag error is a valid estimate of
test error.
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 9
UNSUPERVISED
LEARNING
Unsupervised
Learning
Principal Components Analysis
Notation
๐ง๐ง, ๐๐
Principal component
(score)
Subscript ๐๐ Index for principal
components
๐๐
Principal component
loading
๐ฅ๐ฅ, ๐๐
Centered explanatory
variable
Principal Components
$
๐ง๐งU = ∞ ๐๐9,U ๐ฅ๐ฅ9 ,
$
๐ง๐ง&,U = ∞ ๐๐9,U ๐ฅ๐ฅ&,9
9(#
9(#
%
• ∑$9(# ๐๐9,U
=1
$
• ∑9(# ๐๐9,U ⋅ ๐๐9,H = 0, ๐๐ ≠ ๐ข๐ข
Proportion of Variance Explained (PVE)
$
$
9(#
9(#
'
1
%
∞ ๐ ๐ K%% = ∞
∞ ๐ฅ๐ฅ&,9
๐๐ − 1
๐ ๐ _%+
'
&(#
1
%
=
∞ ๐ง๐ง&,U
๐๐ − 1
PVE =
&(#
Cluster Analysis
Notation
๐ถ๐ถ
Cluster containing indices
๐๐(๐ถ๐ถ) Within-cluster variation
of cluster
|๐ถ๐ถ|
No. of observations in cluster
$
%
Euclidean Distance = Ñ∑9(#2๐ฅ๐ฅ&,9 − ๐ฅ๐ฅU,9 5
๐๐-Means Clustering
1. Randomly assign a cluster to each
observation. This serves as the initial
cluster assignments.
2. Calculate the centroid of each cluster.
3. For each observation, identify the closest
centroid and reassign to that cluster.
4. Repeat steps 2 and 3 until the cluster
assignments stop changing.
$
1
%
๐๐(๐ถ๐ถH ) =
∞ ∞2๐ฅ๐ฅ&,9 − ๐ฅ๐ฅU,9 5
|๐ถ๐ถH |
&,U∈`, 9(#
$
%
= 2 ∞ ∞2๐ฅ๐ฅ&,9 − ๐ฅ๐ฅฬ
H,9 5
&∈`, 9(#
๐ ๐ _%+
∑$9(# ๐ ๐ K%%
Key Ideas
• The variance explained by each
subsequent principal component is
always less than the variance explained
by the previous principal component.
• All principal components are
uncorrelated with one another.
• A dataset has min(๐๐ − 1, ๐๐) distinct
principal components.
• The first ๐๐ principal component scores
and loadings approximate the original
dataset, ๐ฅ๐ฅ&,9 ≈ ∑FU(# ๐ง๐ง&,U ๐๐9,U .
Hierarchical Clustering
1. Select the dissimilarity measure and
linkage to be used. Treat each
observation as its own cluster.
2. For ๐๐ = ๐๐, ๐๐ − 1, … , 2:
• Compute the inter-cluster dissimilarity
between all ๐๐ clusters.
• Examine all 2F%5 pairwise
dissimilarities. The two clusters with
the lowest inter-cluster dissimilarity
are fused. The dissimilarity indicates
the height in the dendrogram at which
these two clusters join.
Linkage
Inter-cluster dissimilarity
Complete
The largest dissimilarity
Average
The arithmetic mean
Single
Centroid
The smallest dissimilarity
The dissimilarity between
the cluster centroids
Key Ideas
• For ๐๐-means clustering, the algorithm
needs to be repeated for each ๐๐.
• For hierarchical clustering, the algorithm
only needs to be performed once for any
number of clusters.
• The result of clustering depends on many
parameters, such as:
o Choice of ๐๐ in ๐๐-means clustering
o Choice of number of clusters, linkage,
and dissimilarity measure in
hierarchical clustering
o Choice to standardize variables
Principal Components Regression
๐๐ = ๐๐4 + ๐๐# ๐ง๐ง# + โฏ + ๐๐F ๐ง๐งF + ๐๐
• If ๐๐ = ๐๐, then ๐ฝ๐ฝ9 = ∑FU(# ๐๐U ๐๐9,U .
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 10