Uploaded by Hemarkesh Sathiadeva

srm formula sheet

advertisement
SRM
Updated 03/13/23
STATISTICAL LEARNING
Statistical
Learning
Data
Modeling Problems
Types of Variables
Response
A variable of primary interest
Explanatory A variable used to study the response variable
Count
A quantitative variable usually valid on
non-negative integers
Continuous
A real-valued quantitative variable
Nominal
A categorical/qualitative variable having categories
without a meaningful or logical order
Ordinal
A categorical/qualitative variable having categories
with a meaningful or logical order
Notation
๐‘ฆ๐‘ฆ, ๐‘Œ๐‘Œ
๐‘ฅ๐‘ฅ, ๐‘‹๐‘‹
Subscript ๐‘–๐‘–
๐‘›๐‘›
Subscript ๐‘—๐‘—
๐‘๐‘
๐€๐€!
๐€๐€"#
๐œ€๐œ€
,
๐‘ฆ๐‘ฆ+, ๐‘Œ๐‘Œ, ๐‘“๐‘“.(๐‘ฅ๐‘ฅ)
∑'&(#(๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆ+& )%
๐‘›๐‘›
For fixed inputs ๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ , the test MSE is
which can be estimated using
%
Var[๐œ€๐œ€]
VXY
.))*+,-./0* *))2)
Test Error Rate = ES๐ผ๐ผ2๐‘Œ๐‘Œ ≠ ๐‘Œ๐‘Œ,5T,
∑'&(# ๐ผ๐ผ(๐‘ฆ๐‘ฆ& ≠ ๐‘ฆ๐‘ฆ+& )
which can be estimated using
๐‘›๐‘›
Bayes Classifier:
๐‘“๐‘“2๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ 5 = arg max Pr2๐‘Œ๐‘Œ = ๐‘๐‘a๐‘‹๐‘‹# = ๐‘ฅ๐‘ฅ# , … , ๐‘‹๐‘‹$ = ๐‘ฅ๐‘ฅ$ 5
3
Unsupervised
No response variable
Key Ideas
• The disadvantage to parametric methods is the danger of
choosing a form for ๐‘“๐‘“ that is not close to the truth.
• The disadvantage to non-parametric methods is the need for an
abundance of observations.
• Flexibility and interpretability are typically at odds.
• As flexibility increases, the training MSE (or error rate) decreases,
but the test MSE (or error rate) follows a u-shaped pattern.
• Low flexibility leads to a method with low variance and high bias;
high flexibility leads to a method with high variance and low bias.
Classification
Categorical response
variable
Parametric
Functional form
of f specified
Non-Parametric
Functional form
of f not specified
Prediction
Output of fˆ
Inference
Comprehension
of f
Flexibility
,
fˆ s ability to
follow the data
Interpretability
,
fˆ s ability to
be understood
© 2023 Coaching Actuaries. All Rights Reserved
%
Test MSE = E B2๐‘Œ๐‘Œ − ๐‘Œ๐‘Œ,5 D ,
Classification Problems
Statistical Learning Problems
Method
Properties
๐‘Œ๐‘Œ = ๐‘“๐‘“2๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ 5 + ๐œ€๐œ€ where E[๐œ€๐œ€] = 0, so E[๐‘Œ๐‘Œ] = ๐‘“๐‘“2๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ 5
)*+,-./0* *))2)
Contrasting Statistical Learning Elements
Regression
Quantitative
response variable
Regression Problems
Test
Observations not used
to train/obtain fฬ‚
VarS๐‘“๐‘“.2๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ 5T + 2BiasS๐‘“๐‘“.2๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ 5T5 +
VWWWWWWWWWWWWXWWWWWWWWWWWWY
Response variable
Explanatory variable
Index for observations
No. of observations
Index for variables except response
No. of variables except response
Transpose of matrix ๐€๐€
Inverse of matrix ๐€๐€
Error term
Estimate/Estimator of ๐‘“๐‘“(๐‘ฅ๐‘ฅ)
Supervised
Has response variable
Training
Observations used
to train/obtain fฬ‚
www.coachingactuaries.com
SRM Formula Sheet 1
LINEAR MODELS
Linear Models
Simple Linear Regression (SLR)
Special case of MLR where ๐‘๐‘ = 1
Estimation
∑' (๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )(๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆe)
๐‘๐‘# = &(# '
∑&(#(๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )%
๐‘๐‘4 = ๐‘ฆ๐‘ฆe − ๐‘๐‘# ๐‘ฅ๐‘ฅฬ…
SLR Inferences
Standard Errors
1
๐‘ฅ๐‘ฅฬ… %
๐‘ ๐‘ ๐‘ ๐‘ 5! = hMSE i + '
j
๐‘›๐‘› ∑&(#(๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )%
MSE
๐‘ ๐‘ ๐‘ ๐‘ 5" = h '
∑&(#(๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )%
(๐‘ฅ๐‘ฅ − ๐‘ฅ๐‘ฅฬ… )%
1
๐‘ ๐‘ ๐‘ ๐‘ 67 = hMSE i + '
j
๐‘›๐‘› ∑&(#(๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )%
๐‘ ๐‘ ๐‘ ๐‘ 67#$" = hMSE i1 +
(๐‘ฅ๐‘ฅ'8# − ๐‘ฅ๐‘ฅฬ… )%
1
+ '
j
๐‘›๐‘› ∑&(#(๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )%
Multiple Linear Regression (MLR)
๐‘Œ๐‘Œ = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘ฅ๐‘ฅ# + โ‹ฏ + ๐›ฝ๐›ฝ$ ๐‘ฅ๐‘ฅ$ + ๐œ€๐œ€
Notation
๐›ฝ๐›ฝ9
๐‘๐‘9
๐œŽ๐œŽ
%
MSE
X
๐‡๐‡
๐‘’๐‘’
SST
SSR
SSE
The ๐‘—๐‘—th regression coefficient
Estimate of ๐›ฝ๐›ฝ9
Variance of response /
Irreducible error
Estimate of ๐œŽ๐œŽ %
Design matrix
Hat matrix
Residual
Total sum of squares
Regression sum of squares
Error sum of squares
Assumptions
1. ๐‘Œ๐‘Œ& = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘ฅ๐‘ฅ&,# + โ‹ฏ + ๐›ฝ๐›ฝ$ ๐‘ฅ๐‘ฅ&,$ + ๐œ€๐œ€&
Estimation – Ordinary Least Squares (OLS)
๐‘ฆ๐‘ฆ+ = ๐‘๐‘4 + ๐‘๐‘# ๐‘ฅ๐‘ฅ# + โ‹ฏ + ๐‘๐‘$ ๐‘ฅ๐‘ฅ$
๐‘๐‘4
o โ‹ฎ q = ๐›๐› = (๐—๐— ! ๐—๐—)"# ๐—๐— ! ๐ฒ๐ฒ
๐‘๐‘$
MSE = SSE⁄(๐‘›๐‘› − ๐‘๐‘ − 1)
residual standard error = √MSE
Other Numerical Results
๐‡๐‡ = ๐—๐—(๐—๐— ! ๐—๐—)"# ๐—๐— !
๐ฒ๐ฒ+ = ๐‡๐‡๐‡๐‡
๐‘’๐‘’ = ๐‘ฆ๐‘ฆ − ๐‘ฆ๐‘ฆ+
SST = ∑'&(#(๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆe)% = total variability
SSR = ∑'&(#(๐‘ฆ๐‘ฆ+& − ๐‘ฆ๐‘ฆe)% = explained
SSE = ∑'&(#(๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆ+& )% = unexplained
SST = SSR + SSE
๐‘…๐‘…% = SSR⁄SST
๐‘›๐‘› − 1
MSE
%
= 1 − % = 1 − (1 − ๐‘…๐‘…% ) y
z
๐‘…๐‘…;+<.
๐‘›๐‘› − ๐‘๐‘ − 1
๐‘ ๐‘ 6
Key Ideas
• ๐‘…๐‘…% is a poor measure for model
comparison because it will increase
simply by adding more predictors
to a model.
• Polynomials do not change consistently
by unit increases of its variable, i.e. no
constant slope.
• Only ๐‘ค๐‘ค − 1 dummy variables are
needed to represent ๐‘ค๐‘ค classes of a
categorical predictor; one of the classes
acts as a baseline.
• In effect, dummy variables define a
distinct intercept for each class. Without
the interaction between a dummy
variable and a predictor, the dummy
variable cannot additionally affect that
predictor's regression coefficient.
๐‘Œ๐‘Œ,
๐‘ ๐‘ ๐‘ ๐‘ 
๐ป๐ป4
๐ป๐ป#
df
๐‘ก๐‘ก#">,+?
๐›ผ๐›ผ
๐‘˜๐‘˜
ndf
ddf
๐น๐น#">,@+?,++?
๐‘Œ๐‘Œ'8#
Subscript ๐‘Ÿ๐‘Ÿ
Subscript ๐‘“๐‘“
Standard Errors
Estimated standard error
Null hypothesis
Alternative hypothesis
Degrees of freedom
๐‘ž๐‘ž quantile of
a ๐‘ก๐‘ก-distribution
Significance level
Confidence level
Numerator degrees
of freedom
Denominator degrees
of freedom
๐‘ž๐‘ž quantile of
an ๐น๐น-distribution
Response of
new observation
Reduced model
Full model
Variance-Covariance Matrix
á T = MSE(๐—๐— ! ๐—๐—)"# =
Ö S๐œท๐œท
Var
Ö S๐›ฝ๐›ฝ.4 T
Ö S๐›ฝ๐›ฝ.4 , ๐›ฝ๐›ฝ.# T โ‹ฏ Cov
Ö S๐›ฝ๐›ฝ.4 , ๐›ฝ๐›ฝ.$ T
Var
Cov
โŽก
โŽค
Ö S๐›ฝ๐›ฝ.4 , ๐›ฝ๐›ฝ.# T
Ö S๐›ฝ๐›ฝ.# T
Ö S๐›ฝ๐›ฝ.# , ๐›ฝ๐›ฝ.$ TโŽฅ
โŽขCov
Var
โ‹ฏ Cov
โŽข
โŽฅ
โ‹ฎ
โ‹ฎ
โ‹ฑ
โ‹ฎ
โŽข
โŽฅ
.
.
Ö
Ö
.
.
.
Ö
VarS๐›ฝ๐›ฝ$ T โŽฆ
โŽฃCovS๐›ฝ๐›ฝ4 , ๐›ฝ๐›ฝ$ T CovS๐›ฝ๐›ฝ# , ๐›ฝ๐›ฝ$ T โ‹ฏ
๐‘ก๐‘ก Tests
estimate − hypothesized value
standard error
๐ป๐ป4 : ๐›ฝ๐›ฝ9 = hypothesized value
๐‘ก๐‘ก statistic =
Test Type
Two-tailed
Right-tailed
3. E[๐œ€๐œ€& ] = 0
4. Var[๐œ€๐œ€& ] = ๐œŽ๐œŽ %
5. ๐œ€๐œ€& ’s are independent
6. ๐œ€๐œ€& ’s are normally distributed
7. The predictor ๐‘ฅ๐‘ฅ9 is not a linear
Estimator for E[๐‘Œ๐‘Œ]
Ö S๐›ฝ๐›ฝ.9 T
๐‘ ๐‘ ๐‘ ๐‘ 5% = ÑVar
Left-tailed
2. ๐‘ฅ๐‘ฅ&,9 ’s are non-random
๐น๐น Tests
Rejection Region
|๐‘ก๐‘ก statistic| ≥ ๐‘ก๐‘กA⁄%,'"$"#
๐‘ก๐‘ก statistic ≤ −๐‘ก๐‘กA,'"$"#
๐‘ก๐‘ก statistic ≥ ๐‘ก๐‘กA,'"$"#
MSR
SSR⁄๐‘๐‘
=
MSE SSE⁄(๐‘›๐‘› − ๐‘๐‘ − 1)
๐ป๐ป4 : ๐›ฝ๐›ฝ# = ๐›ฝ๐›ฝ% = โ‹ฏ = ๐›ฝ๐›ฝ$ = 0
๐น๐น statistic =
Reject ๐ป๐ป4 if ๐น๐น statistic ≥ ๐น๐นA,@+?,++?
combination of the other ๐‘๐‘ predictors,
for ๐‘—๐‘— = 0, 1, … , ๐‘๐‘
© 2023 Coaching Actuaries. All Rights Reserved
MLR Inferences
Notation
Estimator for ๐›ฝ๐›ฝ9
๐›ฝ๐›ฝ.9
• ndf = ๐‘๐‘
• ddf = ๐‘›๐‘› − ๐‘๐‘ − 1
www.coachingactuaries.com
SRM Formula Sheet 2
Partial ๐น๐น Tests
๐น๐น statistic =
2SSEC − SSED 5ò2๐‘๐‘D − ๐‘๐‘C 5
SSED ⁄2๐‘›๐‘› − ๐‘๐‘D − 15
๐ป๐ป4 : Some ๐›ฝ๐›ฝ9 ′s = 0
Reject ๐ป๐ป4 if ๐น๐น statistic ≥ ๐น๐นA,@+?,++?
• ndf = ๐‘๐‘D − ๐‘๐‘C
• ddf = ๐‘›๐‘› − ๐‘๐‘D − 1
For all hypothesis tests, reject ๐ป๐ป4 if
๐‘๐‘-value ≤ ๐›ผ๐›ผ.
Confidence and Prediction Intervals
estimate ± (๐‘ก๐‘ก quantile)(standard error)
Quantity
๐›ฝ๐›ฝ9
E[๐‘Œ๐‘Œ]
๐‘Œ๐‘Œ'8#
Interval Expression
๐‘๐‘9 ± ๐‘ก๐‘ก(#"F)⁄%,'"$"# ⋅ ๐‘ ๐‘ ๐‘ ๐‘ 5%
๐‘ฆ๐‘ฆ+ ± ๐‘ก๐‘ก(#"F)⁄%,'"$"# ⋅ ๐‘ ๐‘ ๐‘ ๐‘ 67
๐‘ฆ๐‘ฆ+'8# ± ๐‘ก๐‘ก(#"F)⁄%,'"$"# ⋅ ๐‘ ๐‘ ๐‘ ๐‘ 67#$"
Linear Model Assumptions
Leverage
๐‘ ๐‘ ๐‘ ๐‘ 67& %
โ„Ž& = ๐ฑ๐ฑ&! (๐—๐— ! ๐—๐—)"# ๐ฑ๐ฑ& =
MSE
(๐‘ฅ๐‘ฅ& − ๐‘ฅ๐‘ฅฬ… )%
1
for SLR
โ„Ž& = + '
๐‘›๐‘› ∑H(#(๐‘ฅ๐‘ฅH − ๐‘ฅ๐‘ฅฬ… )%
• 1⁄๐‘›๐‘› ≤ โ„Ž& ≤ 1
• ∑'&(# โ„Ž& = ๐‘๐‘ + 1
$8#
• Frees rule of thumb: โ„Ž& > 3 £
'
§
Studentized and Standardized Residuals
๐‘’๐‘’&
๐‘’๐‘’IJ,,& =
ÑMSE(&) (1 − โ„Ž& )
๐‘’๐‘’IJ;,& =
๐‘’๐‘’&
•MSE(1 − โ„Ž& )
• Frees rule of thumb: a๐‘’๐‘’IJ;,& a > 2
Cook’s Distance
%
∑'H(#2๐‘ฆ๐‘ฆ+H − ๐‘ฆ๐‘ฆ+(&)H 5
MSE(๐‘๐‘ + 1)
๐‘’๐‘’&% โ„Ž&
=
MSE(๐‘๐‘ + 1)(1 − โ„Ž& )%
๐ท๐ท& =
Plots of Residuals
• ๐‘’๐‘’ versus ๐‘ฆ๐‘ฆ+
Residuals are well-behaved if
o Points appear to be randomly scattered
o Residuals seem to average to 0
o Spread of residuals does not change
• ๐‘’๐‘’ versus ๐‘–๐‘–
Detects dependence of error terms
• ๐‘ž๐‘ž๐‘ž๐‘ž plot of ๐‘’๐‘’
Variance Inflation Factor
๐‘ ๐‘ K%% (๐‘›๐‘› − 1)
1
VIF9 =
=
๐‘ ๐‘ ๐‘’๐‘’5%%
MSE
1 − ๐‘…๐‘…9%
Tolerance is the reciprocal of VIF.
• Frees rule of thumb: any VIF9 ≥ 10
Key Ideas
• As realizations of a ๐‘ก๐‘ก-distribution,
studentized residuals can help
identify outliers.
• When residuals have a larger spread for
larger predictions, one solution is to
transform the response variable with a
concave function.
• There is no universal approach to
handling multicollinearity; it is even
possible to accept it, such as when there
is a suppressor variable. On the other
hand, it can be eliminated by using a set
of orthogonal predictors.
Model Selection
Notation
๐‘”๐‘”
Total no. of predictors
in consideration
๐‘๐‘
No. of predictors for a
specific model
MSEL MSE of the model that uses
all ๐‘”๐‘” predictors
Μ$ The "best" model with ๐‘๐‘ predictors
Best Subset Selection
1. For ๐‘๐‘ = 0, 1, … , ๐‘”๐‘”, fit all £L$§ models with ๐‘๐‘
predictors. The model with the largest ๐‘…๐‘…%
is Μ$ .
2. Choose the best model among Μ4 , … , ΜL
using a selection criterion of choice.
Forward Stepwise Selection
1. Fit all ๐‘”๐‘” simple linear regression models.
The model with the largest ๐‘…๐‘…% is Μ# .
2. For ๐‘๐‘ = 2, … , ๐‘”๐‘”, fit the models that add
one of the remaining predictors to Μ$"# .
%
The model with the largest ๐‘…๐‘… is Μ$ .
3. Choose the best model among Μ4 , … , ΜL
using a selection criterion of choice.
Backward Stepwise Selection
1. Fit the model with all ๐‘”๐‘” predictors, ΜL .
2. For ๐‘๐‘ = ๐‘”๐‘” − 1, … , 1, fit the models that
drop one of the predictors from Μ$8# .
The model with the largest ๐‘…๐‘…% is Μ$ .
3. Choose the best model among Μ4 , … , ΜL
Selection Criteria
• Mallows’ ๐ถ๐ถ$
SSE + 2๐‘๐‘ ⋅ MSEL
๐ถ๐ถ$ =
๐‘›๐‘›
SSE
๐ถ๐ถ$ =
− ๐‘›๐‘› + 2(๐‘๐‘ + 1)
MSEL
• Akaike information criterion
SSE + 2๐‘๐‘ ⋅ MSEL
AIC =
๐‘›๐‘› ⋅ MSEL
• Bayesian information criterion
SSE + ln ๐‘›๐‘› ⋅ ๐‘๐‘ ⋅ MSEL
BIC =
๐‘›๐‘› ⋅ MSEL
• Adjusted ๐‘…๐‘…%
• Cross-validation error
Validation Set
• Randomly splits all available
observations into two groups: the
training set and the validation set.
• Only the observations in the training set
are used to attain the fitted model, and
those in validation set are used to
estimate the test MSE.
๐‘˜๐‘˜-fold Cross-Validation
1. Randomly divide all available
observations into ๐‘˜๐‘˜ folds.
2. For ๐‘ฃ๐‘ฃ = 1, … , ๐‘˜๐‘˜, obtain the ๐‘ฃ๐‘ฃth fit by
training with all observations except
those in the ๐‘ฃ๐‘ฃth fold.
3. For ๐‘ฃ๐‘ฃ = 1, … , ๐‘˜๐‘˜, use ๐‘ฆ๐‘ฆ+ from the ๐‘ฃ๐‘ฃth fit to
calculate a test MSE estimate with
observations in the ๐‘ฃ๐‘ฃth fold.
4. To calculate CV error, average the ๐‘˜๐‘˜ test
MSE estimates in the previous step.
Leave-one-out Cross-Validation (LOOCV)
• Calculate LOOCV error as a special case of
๐‘˜๐‘˜-fold cross-validation where ๐‘˜๐‘˜ = ๐‘›๐‘›.
• For MLR:
LOOCV Error =
'
1
๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆ+& %
∞y
z
1 − โ„Ž&
๐‘›๐‘›
&(#
Key Ideas on Cross-Validation
• The validation set approach has unstable
results and will tend to overestimate the
test MSE. The two other approaches
mitigate these issues.
• With respect to bias, LOOCV < ๐‘˜๐‘˜-fold CV <
Validation Set.
• With respect to variance, LOOCV > ๐‘˜๐‘˜-fold
CV > Validation Set.
using a selection criterion of choice.
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 3
Other Regression Approaches
Standardizing Variables
Key Ideas on Ridge and Lasso
Weighted Least Squares
• Var[๐œ€๐œ€& ] = ๐œŽ๐œŽ % ⁄๐‘ค๐‘ค&
• ๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ are scaled predictors.
• A centered variable is the result of
subtracting the sample mean from
a variable.
• A scaled variable is the result of
dividing a variable by its sample
standard deviation.
• Equivalent to running OLS with √๐‘ค๐‘ค๐‘ฆ๐‘ฆ as
• ๐œ†๐œ† is inversely related to flexibility.
• With a finite ๐œ†๐œ†, none of the ridge
estimates will equal 0, but the lasso
estimates could equal 0.
the response and √๐‘ค๐‘ค๐ฑ๐ฑ as the predictors,
hence minimizing ∑'&(# ๐‘ค๐‘ค& (๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆ+& )% .
๐›๐› = (๐—๐— ! ๐–๐–๐–๐–)"# ๐—๐— ! ๐–๐–๐–๐– where ๐–๐– is the
diagonal matrix of the weights.
Partial Least Squares
• The first partial least squares direction,
๐‘ง๐‘ง# , is a linear combination of standardized
predictors ๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ , with coefficients
• A standardized variable is the result of
first centering a variable, then scaling it.
Ridge Regression
Coefficients are estimated by minimizing
the SSE while constrained by ∑$9(# ๐‘๐‘9% ≤ ๐‘Ž๐‘Ž
๐‘˜๐‘˜-Nearest Neighbors (KNN)
1. Identify the "center of the neighborhood",
i.e. the location of an observation with
inputs ๐‘ฅ๐‘ฅ# , … , ๐‘ฅ๐‘ฅ$ .
based on the relation between ๐‘ฅ๐‘ฅ9 and ๐‘ฆ๐‘ฆ.
2. Starting from the "center of the
neighborhood", identify the ๐‘˜๐‘˜ nearest
training observations.
3. For classification, ๐‘ฆ๐‘ฆ+ is the most frequent
category among the ๐‘˜๐‘˜ observations; for
regression, ๐‘ฆ๐‘ฆ+ is the average of the
response among the ๐‘˜๐‘˜ observations.
๐‘˜๐‘˜ is inversely related to flexibility.
• Every subsequent partial least squares
direction is calculated iteratively as a
linear combination of "updated
predictors" which are the residuals of fits
with the "previous predictors" explained
by the previous direction.
• The directions ๐‘ง๐‘ง# , … , ๐‘ง๐‘งL are used as
or equivalently, by minimizing the
expression SSE + ๐œ†๐œ† ∑$9(# ๐‘๐‘9% .
Lasso Regression
Coefficients are estimated by minimizing
the SSE while constrained by ∑$9(#a๐‘๐‘9 a ≤ ๐‘Ž๐‘Ž
predictors in a multiple linear regression.
The number of directions, ๐‘”๐‘”, is a measure
of flexibility.
or equivalently, by minimizing the
expression SSE + ๐œ†๐œ† ∑$9(#a๐‘๐‘9 a.
Key Results for Distributions in the Exponential Family
Distribution
Normal
Binomial
(fixed ๐‘›๐‘›)
Poisson
Negative Binomial
(fixed ๐‘Ÿ๐‘Ÿ)
Gamma
Inverse Gaussian
Probability Function
1
๐œŽ๐œŽ√2๐œ‹๐œ‹
exp π−
๐œƒ๐œƒ
(๐‘ฆ๐‘ฆ − ๐œ‡๐œ‡)%
∫
2๐œŽ๐œŽ %
๐‘›๐‘› ln21 + ๐‘’๐‘’ N 5
๐œ‡๐œ‡
ln y
z
๐‘›๐‘› − ๐œ‡๐œ‡
1
−๐‘Ÿ๐‘Ÿ ln21 − ๐‘’๐‘’ N 5
๐œ‡๐œ‡
ln y
z
๐‘Ÿ๐‘Ÿ + ๐œ‡๐œ‡
1
๐œ†๐œ†
−√−2๐œƒ๐œƒ
1
ln(1 − ๐‘๐‘)
© 2023 Coaching Actuaries. All Rights Reserved
1
ln ๐œ†๐œ†
Γ(๐‘ฆ๐‘ฆ + ๐‘Ÿ๐‘Ÿ) C
๐‘๐‘ (1 − ๐‘๐‘)6
๐‘ฆ๐‘ฆ! Γ(๐‘Ÿ๐‘Ÿ)
๐œ†๐œ†
๐œ†๐œ†(๐‘ฆ๐‘ฆ − ๐œ‡๐œ‡)%
h
exp π−
∫
O
2๐œ‹๐œ‹๐‘ฆ๐‘ฆ
2๐œ‡๐œ‡% ๐‘ฆ๐‘ฆ
Canonical Link, ๐‘๐‘M "# (๐œ‡๐œ‡)
๐œŽ๐œŽ %
๐œ‹๐œ‹
ln £
§
1 − ๐œ‹๐œ‹
๐›พ๐›พ A A"#
๐‘ฆ๐‘ฆ
exp(−๐‘ฆ๐‘ฆ๐‘ฆ๐‘ฆ)
Γ(๐›ผ๐›ผ)
๐‘๐‘(๐œƒ๐œƒ)
๐œ‡๐œ‡
๐‘›๐‘›
y z ๐œ‹๐œ‹ 6 (1 − ๐œ‹๐œ‹)'"6
๐‘ฆ๐‘ฆ
๐œ†๐œ†6
exp(−๐œ†๐œ†)
๐‘ฆ๐‘ฆ!
๐œ™๐œ™
−
−
๐›พ๐›พ
๐›ผ๐›ผ
1
2๐œ‡๐œ‡%
1
๐›ผ๐›ผ
๐œƒ๐œƒ %
2
๐‘’๐‘’ N
− ln(−๐œƒ๐œƒ)
www.coachingactuaries.com
๐œ‡๐œ‡
ln ๐œ‡๐œ‡
−
−
1
๐œ‡๐œ‡
1
2๐œ‡๐œ‡%
SRM Formula Sheet 4
NON-LINEAR
MODELS
Non-Linear
Models
Generalized Linear Models
Notation
๐œƒ๐œƒ, ๐œ™๐œ™
Linear exponential family
parameters
E[๐‘Œ๐‘Œ], ๐œ‡๐œ‡ Mean response
๐‘๐‘M (๐œƒ๐œƒ)
Mean function
๐‘ฃ๐‘ฃ(๐œ‡๐œ‡)
Variance function
โ„Ž(๐œ‡๐œ‡)
Link function
๐›๐›
Maximum likelihood estimate
of ๐œท๐œท
๐‘™๐‘™(๐›๐›)
Maximized log-likelihood
๐‘™๐‘™4
Maximized log-likelihood for
null model
๐‘™๐‘™I;J
Maximized log-likelihood for
saturated model
๐‘’๐‘’
Residual
๐ˆ๐ˆ
Information matrix
%
๐œ’๐œ’#">,+? ๐‘ž๐‘ž quantile of a chi-square
distribution
๐ท๐ท∗
Scaled deviance
๐ท๐ท
Deviance statistic
Linear Exponential Family
๐‘ฆ๐‘ฆ๐‘ฆ๐‘ฆ − ๐‘๐‘(๐œƒ๐œƒ)
Prob. fn. of ๐‘Œ๐‘Œ = exp π
+ ๐‘Ž๐‘Ž(๐‘ฆ๐‘ฆ, ๐œ™๐œ™)∫
๐œ™๐œ™
E[๐‘Œ๐‘Œ] = ๐‘๐‘M (๐œƒ๐œƒ)
Var[๐‘Œ๐‘Œ] = ๐œ™๐œ™ ⋅ ๐‘๐‘MM (๐œƒ๐œƒ) = ๐œ™๐œ™ ⋅ ๐‘ฃ๐‘ฃ(๐œ‡๐œ‡)
Model Framework
• โ„Ž(๐œ‡๐œ‡) = ๐ฑ๐ฑ ! ๐œท๐œท
• Canonical link is the link function where
โ„Ž(๐œ‡๐œ‡) = ๐‘๐‘
Numerical Results
๐ท๐ท∗ = 2[๐‘™๐‘™I;J − ๐‘™๐‘™(๐›๐›)]
๐ท๐ท = ๐œ™๐œ™๐ท๐ท∗
For MLR, ๐ท๐ท = SSE
1 − exp{2[๐‘™๐‘™4 − ๐‘™๐‘™(๐›๐›)]⁄๐‘›๐‘›}
1 − exp{2๐‘™๐‘™4 ⁄๐‘›๐‘›}
๐‘™๐‘™(๐›๐›) − ๐‘™๐‘™4
=
๐‘™๐‘™I;J − ๐‘™๐‘™4
%
๐‘…๐‘…QI
=
%
๐‘…๐‘…RI*.
AIC = −2 ⋅ ๐‘™๐‘™(๐›๐›) + 2 ⋅ (๐‘๐‘ + 1)*
BIC = −2 ⋅ ๐‘™๐‘™(๐›๐›) + ln ๐‘›๐‘› ⋅ (๐‘๐‘ + 1)*
*Assumes only ๐œท๐œท needs to be estimated. If
estimating ๐œ™๐œ™ is required, replace ๐‘๐‘ + 1 with
๐‘๐‘ + 2.
Residuals
Raw Residual
๐‘’๐‘’& = ๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡ฬ‚ &
Pearson Residual
๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡ฬ‚ &
๐‘’๐‘’& =
•๐œ™๐œ™ ⋅ ๐‘ฃ๐‘ฃ(๐œ‡๐œ‡ฬ‚ & )
The Pearson chi-square statistic is ∑'&(# ๐‘’๐‘’&% .
Deviance Residual
๐‘’๐‘’& = ±•๐ท๐ท&∗ whose sign follows the
๐‘–๐‘–th raw residual
Anscombe Residual
á[๐‘ก๐‘ก(๐‘Œ๐‘Œ& )]
๐‘ก๐‘ก(๐‘ฆ๐‘ฆ& ) − E
๐‘’๐‘’& =
Ö
•Var[๐‘ก๐‘ก(๐‘Œ๐‘Œ& )]
Inference
á
• Maximum likelihood estimators ๐œท๐œท
asymptotically have a multivariate
normal distribution with mean ๐œท๐œท
and asymptotic variance-covariance
matrix ๐ˆ๐ˆ"# .
• To address overdispersion, change the
variance to Var[๐‘Œ๐‘Œ& ] = ๐›ฟ๐›ฟ ⋅ ๐œ™๐œ™ ⋅ ๐‘๐‘MM (๐œƒ๐œƒ& ) and
estimate ๐›ฟ๐›ฟ as the Pearson chi-square
statistic divided by ๐‘›๐‘› − ๐‘๐‘ − 1.
Likelihood Ratio Tests
๐œ’๐œ’ % statistic = 2S๐‘™๐‘™2๐›๐›D 5 − ๐‘™๐‘™(๐›๐›C )T
๐ป๐ป4 : Some ๐›ฝ๐›ฝ9 ′s = 0
%
Reject ๐ป๐ป4 if ๐œ’๐œ’ % statistic ≥ ๐œ’๐œ’A,$
' "$(
Goodness-of-Fit Tests
๐‘Œ๐‘Œ follows a distribution of choice with ๐‘”๐‘” free
parameters, whose domain is split into ๐‘ค๐‘ค
mutually exclusive intervals.
S
๐œ’๐œ’ % statistic = ∞
3(#
(๐‘›๐‘›3 − ๐‘›๐‘›๐‘ž๐‘ž3 )%
๐‘›๐‘›๐‘ž๐‘ž3
๐‘›๐‘›3
for all ๐‘๐‘ = 1, … , ๐‘ค๐‘ค
๐ป๐ป4 : ๐‘ž๐‘ž3 =
๐‘›๐‘›
%
Reject ๐ป๐ป4 if ๐œ’๐œ’ % statistic ≥ ๐œ’๐œ’A,S"L"#
Tweedie Distribution
E[๐‘Œ๐‘Œ] = ๐œ‡๐œ‡,
Var[๐‘Œ๐‘Œ] = ๐œ™๐œ™ ⋅ ๐œ‡๐œ‡T
Distribution
Normal
0
Poisson
M "# (๐œ‡๐œ‡).
Parameter Estimation
๐‘ฆ๐‘ฆ& ๐œƒ๐œƒ& − ๐‘๐‘(๐œƒ๐œƒ& )
๐‘™๐‘™(๐œท๐œท) = ∞ π
+ ๐‘Ž๐‘Ž(๐‘ฆ๐‘ฆ& , ๐œ™๐œ™) ∫
๐œ™๐œ™
&(#
where ๐œƒ๐œƒ& = ๐‘๐‘M "# Sโ„Ž"# 2๐ฑ๐ฑ&! ๐œท๐œท5T
1
Tweedie
(1, 2)
Inverse Gaussian
3
Gamma
'
๐‘‘๐‘‘
2
The score equations are the partial
derivatives of ๐‘™๐‘™(๐œท๐œท) with respect to each ๐›ฝ๐›ฝ9
all set equal to 0. The solution to the score
equations is ๐›๐›. Then, ๐œ‡๐œ‡ฬ‚ = โ„Ž"#(๐ฑ๐ฑ ! ๐›๐›).
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 5
Logistic and Probit Regression
• The odds of an event are the ratio of the
probability that the event will occur to
the probability that the event will
not occur.
• The odds ratio is the ratio of the odds
of an event with the presence of a
characteristic to the odds of the same
event without the presence of
that characteristic.
Binary Response
Function Name
Logit
Probit
Complementary
log-log
'
โ„Ž(๐œ‡๐œ‡)
ln y
๐œ‡๐œ‡
z
1 − ๐œ‡๐œ‡
Φ"# (๐œ‡๐œ‡)
ln(− ln(1 − ๐œ‡๐œ‡))
Nominal Response – Generalized Logit
Let ๐œ‹๐œ‹&,3 be the probability that the ๐‘–๐‘–th
observation is classified as category ๐‘๐‘. The
reference category is ๐‘˜๐‘˜.
๐œ‹๐œ‹&,3
ln i
j = ๐ฑ๐ฑ&! ๐œท๐œท3
๐œ‹๐œ‹&,F
๐œ‹๐œ‹&,3
exp2๐ฑ๐ฑ&! ๐œท๐œท3 5
โŽง
โŽช1 + ∑UVF exp2๐ฑ๐ฑ ! ๐œท๐œทU 5 ,
&
=
1
โŽจ
,
โŽช
!
โŽฉ1 + ∑UVF exp2๐ฑ๐ฑ& ๐œท๐œทU 5
'
S
๐‘™๐‘™(๐œท๐œท) = ∞ ∞ ๐ผ๐ผ(๐‘ฆ๐‘ฆ& = ๐‘๐‘) ln ๐œ‹๐œ‹&,3
๐‘๐‘ ≠ ๐‘˜๐‘˜
๐‘๐‘ = ๐‘˜๐‘˜
&(# 3(#
Ordinal Response – Proportional Odds
Cumulative
โ„Ž(Π3 ) = ๐›ผ๐›ผ3 + ๐ฑ๐ฑ&! ๐œท๐œท where
• Π3 = ๐œ‹๐œ‹# + โ‹ฏ + ๐œ‹๐œ‹3
๐‘ฅ๐‘ฅ&,#
๐›ฝ๐›ฝ#
• ๐ฑ๐ฑ& = — โ‹ฎ “ , ๐œท๐œท = o โ‹ฎ q
๐‘ฅ๐‘ฅ&,$
๐›ฝ๐›ฝ$
๐‘™๐‘™(๐œท๐œท) = ∞[๐‘ฆ๐‘ฆ& ln ๐œ‡๐œ‡& + (1 − ๐‘ฆ๐‘ฆ& ) ln(1 − ๐œ‡๐œ‡& )]
&(#
'
๐œ•๐œ•
๐œ‡๐œ‡&M
๐‘™๐‘™(๐œท๐œท) = ∞ ๐ฑ๐ฑ& (๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡& )
= ๐ŸŽ๐ŸŽ
๐œ•๐œ•๐œท๐œท
๐œ‡๐œ‡& (1 − ๐œ‡๐œ‡& )
'
&(#
1 − ๐‘ฆ๐‘ฆ&
๐‘ฆ๐‘ฆ&
๐ท๐ท = 2 ∞ ๐‘ฆ๐‘ฆ& ln y z + (1 − ๐‘ฆ๐‘ฆ& ) ln y
zÀ
๐œ‡๐œ‡ฬ‚ &
1 − ๐œ‡๐œ‡ฬ‚ &
&(#
Pearson residual, ๐‘’๐‘’& =
๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡ฬ‚ &
'
&(#
(๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡ฬ‚ & )%
๐œ‡๐œ‡ฬ‚ & (1 − ๐œ‡๐œ‡ฬ‚ & )
© 2023 Coaching Actuaries. All Rights Reserved
'
๐‘™๐‘™(๐œท๐œท) = ∞[๐‘ฆ๐‘ฆ& ln ๐œ‡๐œ‡& − ๐œ‡๐œ‡& − ln(๐‘ฆ๐‘ฆ& !) ]
&(#
'
๐œ•๐œ•
๐‘™๐‘™(๐œท๐œท) = ∞ ๐ฑ๐ฑ& (๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡& ) = ๐ŸŽ๐ŸŽ
๐œ•๐œ•๐œท๐œท
&(#
'
๐ˆ๐ˆ = ∞ ๐œ‡๐œ‡& ๐ฑ๐ฑ& ๐ฑ๐ฑ&!
&(#
'
๐‘ฆ๐‘ฆ&
๐ท๐ท = 2 ∞ ”๐‘ฆ๐‘ฆ& ln y z − 1À + ๐œ‡๐œ‡ฬ‚ & ‘
๐œ‡๐œ‡ฬ‚ &
&(#
Pearson residual, ๐‘’๐‘’& =
•๐œ‡๐œ‡ฬ‚ &
'
&(#
(๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡ฬ‚ & )%
๐œ‡๐œ‡ฬ‚ &
Poisson Regression with Exposures Model
ln ๐œ‡๐œ‡ = ln ๐‘ค๐‘ค + ๐ฑ๐ฑ ! ๐œท๐œท
Alternative Count Models
These models can incorporate a Poisson
distribution while letting the mean of
the response differ from the variance of
the response:
Models
Mean <
Variance
Mean >
Variance
Negative binomial
Yes
No
Hurdle
Yes
Yes
Heterogeneity
www.coachingactuaries.com
๐‘ฆ๐‘ฆ& − ๐œ‡๐œ‡ฬ‚ &
Pearson chi-square statistic = ∞
Zero-inflated
•๐œ‡๐œ‡ฬ‚ & (1 − ๐œ‡๐œ‡ฬ‚ & )
Pearson chi-square statistic = ∞
Poisson Count Regression
ln ๐œ‡๐œ‡ = ๐ฑ๐ฑ ! ๐œท๐œท
Yes
Yes
No
No
SRM Formula Sheet 6
TIME SERIESTime
Series
Trend Models
Notation
Subscript ๐‘ก๐‘ก Index for observations
๐‘‡๐‘‡W
Trends in time
๐‘†๐‘†W
Seasonal trends
๐œ€๐œ€W
Random patterns
๐‘ฆ๐‘ฆ+'8X
๐‘™๐‘™-step ahead forecast
๐‘ ๐‘ ๐‘ ๐‘ 
Estimated standard error
๐‘ก๐‘ก#">,+?
๐‘ž๐‘ž quantile of a ๐‘ก๐‘ก-distribution
๐‘›๐‘›#
๐‘›๐‘›%
Training sample size
Test sample size
Trends
Additive: ๐‘Œ๐‘ŒW = ๐‘‡๐‘‡W + ๐‘†๐‘†W + ๐œ€๐œ€W
Multiplicative: ๐‘Œ๐‘ŒW = ๐‘‡๐‘‡W × ๐‘†๐‘†W + ๐œ€๐œ€W
Stationarity
Stationarity describes how something does
not vary with respect to time. Control charts
can be used to identify stationarity.
White Noise
๐‘ฆ๐‘ฆ+'8X = ๐‘ฆ๐‘ฆe
๐‘ ๐‘ ๐‘ ๐‘ 67#$) = ๐‘ ๐‘ 6 •1 + 1⁄๐‘›๐‘›
100๐‘˜๐‘˜% prediction interval for ๐‘ฆ๐‘ฆ'8X is
๐‘ฆ๐‘ฆ+'8X ± ๐‘ก๐‘ก(#"F)⁄%,'"# ⋅ ๐‘ ๐‘ ๐‘ ๐‘ 67#$)
Random Walk
๐‘ค๐‘คW = ๐‘ฆ๐‘ฆW − ๐‘ฆ๐‘ฆW"#
๐‘ฆ๐‘ฆ+'8X = ๐‘ฆ๐‘ฆ' + ๐‘™๐‘™๐‘ค๐‘ค
ÿ
๐‘ ๐‘ ๐‘ ๐‘ 67#$) = ๐‘ ๐‘ S √๐‘™๐‘™
Approximate 95% prediction interval for
๐‘ฆ๐‘ฆ'8X is ๐‘ฆ๐‘ฆ+'8X ± 2 ⋅ ๐‘ ๐‘ ๐‘ ๐‘ 67#$)
Model Comparison
ME =
'
1
∞ ๐‘’๐‘’W
๐‘›๐‘›%
W('" 8#
'
๐‘’๐‘’W
1
∞
MPE = 100 ⋅
๐‘ฆ๐‘ฆW
๐‘›๐‘›%
'
W('" 8#
1
∞ ๐‘’๐‘’W%
MSE =
๐‘›๐‘›%
MAE =
W('" 8#
'
1
∞ |๐‘’๐‘’W |
๐‘›๐‘›%
W('" 8#
'
๐‘’๐‘’W
1
∞ Ÿ Ÿ
MAPE = 100 ⋅
๐‘ฆ๐‘ฆW
๐‘›๐‘›%
W('" 8#
© 2023 Coaching Actuaries. All Rights Reserved
Autoregressive Models
Notation
๐œŒ๐œŒF
Lag ๐‘˜๐‘˜ autocorrelation
๐‘Ÿ๐‘ŸF
Lag ๐‘˜๐‘˜ sample autocorrelation
Variance of white noise
๐œŽ๐œŽ %
%
๐‘ ๐‘ 
Estimate of ๐œŽ๐œŽ %
๐‘๐‘4
Estimate of ๐›ฝ๐›ฝ4
๐‘๐‘#
Estimate of ๐›ฝ๐›ฝ#
๐‘ฆ๐‘ฆe"
Sample mean of first
๐‘›๐‘› − 1 observations
๐‘ฆ๐‘ฆe8
Sample mean of last
๐‘›๐‘› − 1 observations
Autocorrelation
∑'W(F8#(๐‘ฆ๐‘ฆW"F − ๐‘ฆ๐‘ฆe)(๐‘ฆ๐‘ฆW − ๐‘ฆ๐‘ฆe)
๐‘Ÿ๐‘ŸF =
∑'W(#(๐‘ฆ๐‘ฆW − ๐‘ฆ๐‘ฆe)%
Testing Autocorrelation
test statistic = ๐‘Ÿ๐‘ŸF ⁄๐‘ ๐‘ ๐‘ ๐‘ C*
where ๐‘ ๐‘ ๐‘ ๐‘ C* = 1⁄√๐‘›๐‘›
๐ป๐ป4 : ๐œŒ๐œŒF = 0 against ๐ป๐ป# : ๐œŒ๐œŒF ≠ 0
Reject ๐ป๐ป4 if |test statistic| ≥ ๐‘ง๐‘ง#"A⁄%
AR(1) Model
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘Œ๐‘ŒW"# + ๐œ€๐œ€W
Assumptions
1. E[๐œ€๐œ€W ] = 0
2. Var[๐œ€๐œ€W ] = ๐œŽ๐œŽ %
3. Cov[๐œ€๐œ€W8F , ๐‘Œ๐‘ŒW ] = 0 for ๐‘˜๐‘˜ > 0
• If ๐›ฝ๐›ฝ# = 0, ๐‘Œ๐‘ŒW follows a white noise process.
• If ๐›ฝ๐›ฝ# = 1, ๐‘Œ๐‘ŒW follows a random
walk process.
• If −1 < ๐›ฝ๐›ฝ# < 1, ๐‘Œ๐‘ŒW is stationary.
Properties of Stationary AR(1) Model
๐›ฝ๐›ฝ4
E[๐‘Œ๐‘ŒW ] =
1 − ๐›ฝ๐›ฝ#
๐œŽ๐œŽ %
Var[๐‘Œ๐‘ŒW ] =
1 − ๐›ฝ๐›ฝ#%
๐œŒ๐œŒF = ๐›ฝ๐›ฝ#F
Estimation
∑'W(%(๐‘ฆ๐‘ฆW"# − ๐‘ฆ๐‘ฆe" )(๐‘ฆ๐‘ฆW − ๐‘ฆ๐‘ฆe8 )
≈ ๐‘Ÿ๐‘Ÿ#
๐‘๐‘# =
∑'W(%(๐‘ฆ๐‘ฆW"# − ๐‘ฆ๐‘ฆe")%
Smoothing and Predictions
๐‘ฆ๐‘ฆ+W = ๐‘๐‘4 + ๐‘๐‘# ๐‘ฆ๐‘ฆW"# ,
2 ≤ ๐‘ก๐‘ก ≤ ๐‘›๐‘›
๐‘๐‘4 + ๐‘๐‘# ๐‘ฆ๐‘ฆ'8X"# ,
๐‘™๐‘™ = 1
๐‘ฆ๐‘ฆ+'8X = ”
๐‘™๐‘™ > 1
๐‘๐‘4 + ๐‘๐‘# ๐‘ฆ๐‘ฆ+'8X"# ,
%(X"#)
๐‘ ๐‘ ๐‘ ๐‘ 67#$) = ๐‘ ๐‘ Ñ1 + ๐‘๐‘#% + ๐‘๐‘#Y + โ‹ฏ + ๐‘๐‘#
100๐‘˜๐‘˜% prediction interval for ๐‘ฆ๐‘ฆ'8X is
๐‘ฆ๐‘ฆ+'8X ± ๐‘ก๐‘ก(#"F)⁄%,'"O ⋅ ๐‘ ๐‘ ๐‘ ๐‘ 67#$)
Other Time Series Models
Notation
๐‘˜๐‘˜
Moving average length
๐‘ค๐‘ค
Smoothing parameter
๐‘”๐‘”
Seasonal base
๐‘‘๐‘‘
No. of trigonometric functions
Smoothing with Moving Averages
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐œ€๐œ€W
Smoothing
๐‘ฆ๐‘ฆW + ๐‘ฆ๐‘ฆW"# + โ‹ฏ + ๐‘ฆ๐‘ฆW"F8#
๐‘ ๐‘ ฬ‚W =
๐‘˜๐‘˜
๐‘ฆ๐‘ฆW − ๐‘ฆ๐‘ฆW"F
,
๐‘˜๐‘˜ = 1, 2, …
๐‘ ๐‘ ฬ‚W = ๐‘ ๐‘ ฬ‚W"# +
๐‘˜๐‘˜
Predictions
๐‘๐‘4 = ๐‘ ๐‘ ฬ‚'
๐‘ฆ๐‘ฆ+'8X = ๐‘๐‘4
Double Smoothing with Moving Averages
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘ก๐‘ก + ๐œ€๐œ€W
Smoothing
๐‘ ๐‘ ฬ‚W + ๐‘ ๐‘ ฬ‚W"# + โ‹ฏ + ๐‘ ๐‘ ฬ‚W"F8#
(%)
๐‘ ๐‘ ฬ‚W =
๐‘˜๐‘˜
๐‘ ๐‘ ฬ‚W − ๐‘ ๐‘ ฬ‚W"F
(%)
(%)
,
๐‘˜๐‘˜ = 1, 2, …
๐‘ ๐‘ ฬ‚W = ๐‘ ๐‘ ฬ‚W"# +
๐‘˜๐‘˜
Predictions
๐‘๐‘4 = ๐‘ ๐‘ ฬ‚'
๐‘๐‘# =
(%)
2 £๐‘ ๐‘ ฬ‚' − ๐‘ ๐‘ ฬ‚' §
๐‘˜๐‘˜ − 1
๐‘ฆ๐‘ฆ+'8X = ๐‘๐‘4 + ๐‘๐‘# ⋅ ๐‘™๐‘™
๐‘๐‘4 = ๐‘ฆ๐‘ฆe8 − ๐‘๐‘# ๐‘ฆ๐‘ฆe" ≈ ๐‘ฆ๐‘ฆe(1 − ๐‘Ÿ๐‘Ÿ# )
∑'W(%(๐‘’๐‘’W − ๐‘’๐‘’ฬ…)%
๐‘ ๐‘  % =
๐‘›๐‘› − 3
๐‘ ๐‘  %
Ö [๐‘Œ๐‘ŒW ] =
Var
1 − ๐‘๐‘#%
www.coachingactuaries.com
SRM Formula Sheet 7
Exponential Smoothing
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐œ€๐œ€W
Smoothing
๐‘ ๐‘ ฬ‚W = (1 − ๐‘ค๐‘ค)(๐‘ฆ๐‘ฆW + ๐‘ค๐‘ค๐‘ฆ๐‘ฆW"# + โ‹ฏ + ๐‘ค๐‘ค W ๐‘ฆ๐‘ฆ4 )
๐‘ ๐‘ ฬ‚W = (1 − ๐‘ค๐‘ค)๐‘ฆ๐‘ฆW + ๐‘ค๐‘ค๐‘ ๐‘ ฬ‚W"# ,
0 ≤ ๐‘ค๐‘ค < 1
The value of ๐‘ค๐‘ค is determined by minimizing
๐‘†๐‘†๐‘†๐‘†(๐‘ค๐‘ค) = ∑'W(#(๐‘ฆ๐‘ฆW − ๐‘ ๐‘ ฬ‚W"# )% .
Predictions
๐‘๐‘4 = ๐‘ ๐‘ ฬ‚'
๐‘ฆ๐‘ฆ+'8X = ๐‘๐‘4
Smoothing
(%)
Predictions
๐‘๐‘4 =
๐‘๐‘# =
๐‘ฆ๐‘ฆ+'8X
W
= (1 − ๐‘ค๐‘ค)(๐‘ ๐‘ ฬ‚W + ๐‘ค๐‘ค๐‘ ๐‘ ฬ‚W"# + โ‹ฏ + ๐‘ค๐‘ค ๐‘ ๐‘ ฬ‚4 )
= (1 − ๐‘ค๐‘ค)๐‘ ๐‘ ฬ‚W + ๐‘ค๐‘ค๐‘ ๐‘ ฬ‚W"# ,
T
๐‘†๐‘†W = ∞S๐›ฝ๐›ฝ#,& sin(๐‘“๐‘“& ๐‘ก๐‘ก) + ๐›ฝ๐›ฝ%,& cos(๐‘“๐‘“& ๐‘ก๐‘ก)T
&(#
• ๐‘“๐‘“& = 2๐œ‹๐œ‹๐œ‹๐œ‹⁄๐‘”๐‘”
• ๐‘‘๐‘‘ ≤ ๐‘”๐‘”⁄2
Seasonal Autoregressive Models, SAR(p)
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘Œ๐‘ŒW"L + โ‹ฏ + ๐›ฝ๐›ฝ$ ๐‘Œ๐‘ŒW"$L + ๐œ€๐œ€W
Holt-Winter Seasonal Additive Model
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘ก๐‘ก + ๐‘†๐‘†W + ๐œ€๐œ€W
• ๐‘†๐‘†W = ๐‘†๐‘†W"L
Double Exponential Smoothing
๐‘Œ๐‘ŒW = ๐›ฝ๐›ฝ4 + ๐›ฝ๐›ฝ# ๐‘ก๐‘ก + ๐œ€๐œ€W
(%)
๐‘ ๐‘ ฬ‚W
(%)
๐‘ ๐‘ ฬ‚W
Seasonal Time Series Models
Fixed Seasonal Effects – Trigonometric
Functions
L
• ∑W(# ๐‘†๐‘†W = 0
0 ≤ ๐‘ค๐‘ค < 1
Unit Root Test
• A unit root test is used to evaluate the fit
of a random walk model.
• A random walk model is a good fit if the
time series possesses a unit root.
• The Dickey-Fuller test and augmented
Dickey-Fuller test are two examples of
unit root tests.
Volatility Models
๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด(๐‘๐‘) Model
%
%
+ โ‹ฏ + ๐›พ๐›พ$ ๐œ€๐œ€W"$
๐œŽ๐œŽW% = ๐œƒ๐œƒ + ๐›พ๐›พ# ๐œ€๐œ€W"#
๐บ๐บ๐บ๐บ๐บ๐บ๐บ๐บ๐บ๐บ(๐‘๐‘, ๐‘ž๐‘ž) Model
%
%
+ โ‹ฏ + ๐›พ๐›พ$ ๐œ€๐œ€W"$
+
๐œŽ๐œŽW% = ๐œƒ๐œƒ + ๐›พ๐›พ# ๐œ€๐œ€W"#
%
%
๐›ฟ๐›ฟ# ๐œŽ๐œŽW"#
+ โ‹ฏ + ๐›ฟ๐›ฟ> ๐œŽ๐œŽW">
๐œƒ๐œƒ
Var[๐œ€๐œ€W ] =
1 − ∑$9(# ๐›พ๐›พ9 − ∑>9(# ๐›ฟ๐›ฟ9
Assumptions
• ๐œƒ๐œƒ > 0
• ๐›พ๐›พ9 ≥ 0
(%)
2๐‘ ๐‘ ฬ‚' − ๐‘ ๐‘ ฬ‚'
1 − ๐‘ค๐‘ค
(%)
£๐‘ ๐‘ ฬ‚' − ๐‘ ๐‘ ฬ‚' §
๐‘ค๐‘ค
= ๐‘๐‘4 + ๐‘๐‘# ⋅ ๐‘™๐‘™
• ๐›ฟ๐›ฟ9 ≥ 0
• ∑$9(# ๐›พ๐›พ9 + ∑>9(# ๐›ฟ๐›ฟ9 < 1
Key Ideas for Smoothing
• It is only appropriate for time series data
without a linear trend.
• It is related to weighted least squares.
• A double smoothing procedure can be
used to forecast time series data with a
linear trend.
• Holt-Winter double exponential
smoothing is a generalization of the
double exponential smoothing.
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 8
DECISION TREES
Decision
Trees
Regression and Classification Trees
Notation
๐‘…๐‘…
Region of predictor space
๐‘›๐‘›U
No. of observations in node ๐‘š๐‘š
๐‘›๐‘›U,3 No. of category ๐‘๐‘ observations in
node ๐‘š๐‘š
๐ผ๐ผ
Impurity
๐ธ๐ธ
Classification error rate
๐บ๐บ
Gini index
๐ท๐ท
Cross entropy
๐‘‡๐‘‡
Subtree
|๐‘‡๐‘‡|
No. of terminal nodes in ๐‘‡๐‘‡
๐œ†๐œ†
Tuning parameter
Algorithm
1. Construct a large tree with ๐‘”๐‘” terminal
nodes using recursive binary splitting.
2. Obtain a sequence of best subtrees,
as a function of ๐œ†๐œ†, using cost
complexity pruning.
3. Choose ๐œ†๐œ† by applying ๐‘˜๐‘˜-fold cross
validation. Select the ๐œ†๐œ† that results in the
lowest cross-validation error.
4. The best subtree is the subtree created in
step 2 with the selected ๐œ†๐œ† value.
Recursive Binary Splitting
Regression:
L
%
Minimize ∞ ∞ 2๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆeZ+ 5
U(# &:๐ฑ๐ฑ & ∈Z+
Classification:
L
1
Minimize ∞ ๐‘›๐‘›U ⋅ ๐ผ๐ผU
๐‘›๐‘›
U(#
More Under Classification:
๐‘๐‘ฬ‚U,3 = ๐‘›๐‘›U,3 ⁄๐‘›๐‘›U
๐ธ๐ธU = 1 − max ๐‘๐‘ฬ‚U,3
3
๐บ๐บU = ∑S
3(# ๐‘๐‘ฬ‚U,3 21 − ๐‘๐‘ฬ‚ U,3 5
๐ท๐ทU = − ∑S
3(# ๐‘๐‘ฬ‚U,3 ln ๐‘๐‘ฬ‚ U,3
deviance = −2 ∑LU(# ∑S
3(# ๐‘›๐‘›U,3 ln ๐‘๐‘ฬ‚ U,3
deviance
residual mean deviance =
๐‘›๐‘› − ๐‘”๐‘”
Cost Complexity Pruning
Regression:
|!|
%
Minimize ∞ ∞ 2๐‘ฆ๐‘ฆ& − ๐‘ฆ๐‘ฆeZ+ 5 + ๐œ†๐œ†|๐‘‡๐‘‡|
U(# &:๐ฑ๐ฑ & ∈Z+
Classification:
|!|
1
Minimize ∞ ๐‘›๐‘›U ⋅ ๐ผ๐ผU + ๐œ†๐œ†|๐‘‡๐‘‡|
๐‘›๐‘›
Key Ideas
U(#
• Terminal nodes or leaves represent the
partitions of the predictor space.
• Internal nodes are points along the tree
where splits occur.
• Terminal nodes do not have child nodes,
but internal nodes do.
• Branches are lines that connect any
two nodes.
• A decision tree with only one internal
node is called a stump.
Advantages of Trees
• Easy to interpret and explain
• Can be presented visually
• Manage categorical variables without the
need of dummy variables
• Mimic human decision-making
Disadvantages of Trees
• Not robust
• Do not have the same degree of predictive
accuracy as other statistical methods
Multiple Trees
Bagging
1. Create ๐‘๐‘ bootstrap samples from the
original training dataset.
2. Construct a decision tree for each
bootstrap sample using recursive
binary splitting.
3. Predict the response of a new observation
by averaging the predictions (regression
trees) or by using the most frequent
category (classification trees) across
all ๐‘๐‘ trees.
Random Forests
1. Create ๐‘๐‘ bootstrap samples from the
original training dataset.
2. Construct a decision tree for each
bootstrap sample using recursive binary
splitting. At each split, a random subset of
๐‘˜๐‘˜ variables are considered.
3. Predict the response of a new observation
by averaging the predictions (regression
trees) or by using the most frequent
category (classification trees) across
all ๐‘๐‘ trees.
Properties
• Bagging is a special case of
random forests.
• Increasing ๐‘๐‘ does not cause overfitting.
• Decreasing ๐‘˜๐‘˜ reduces the correlation
between predictions.
Boosting
Let ๐‘ง๐‘ง# be the actual response variable, ๐‘ฆ๐‘ฆ.
1. For ๐‘˜๐‘˜ = 1, 2, … , ๐‘๐‘:
• Use recursive binary splitting to fit a
tree with ๐‘‘๐‘‘ splits to the data with ๐‘ง๐‘งF as
the response.
• Update ๐‘ง๐‘งF by subtracting ๐œ†๐œ† ⋅ ๐‘“๐‘“.F (๐ฑ๐ฑ), i.e.
let ๐‘ง๐‘งF8# = ๐‘ง๐‘งF − ๐œ†๐œ† ⋅ ๐‘“๐‘“.F (๐ฑ๐ฑ).
2. Calculate the boosted model prediction as
๐‘“๐‘“.(๐ฑ๐ฑ) = ∑5F(# ๐œ†๐œ† ⋅ ๐‘“๐‘“.F (๐ฑ๐ฑ).
Properties
• Increasing ๐‘๐‘ can cause overfitting.
• Boosting reduces bias.
• ๐‘‘๐‘‘ controls complexity of the
boosted model.
• ๐œ†๐œ† controls the rate at which
boosting learns.
Properties
• Increasing ๐‘๐‘ does not cause overfitting.
• Bagging reduces variance.
• Out-of-bag error is a valid estimate of
test error.
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 9
UNSUPERVISED
LEARNING
Unsupervised
Learning
Principal Components Analysis
Notation
๐‘ง๐‘ง, ๐‘๐‘
Principal component
(score)
Subscript ๐‘š๐‘š Index for principal
components
๐œ™๐œ™
Principal component
loading
๐‘ฅ๐‘ฅ, ๐‘‹๐‘‹
Centered explanatory
variable
Principal Components
$
๐‘ง๐‘งU = ∞ ๐œ™๐œ™9,U ๐‘ฅ๐‘ฅ9 ,
$
๐‘ง๐‘ง&,U = ∞ ๐œ™๐œ™9,U ๐‘ฅ๐‘ฅ&,9
9(#
9(#
%
• ∑$9(# ๐œ™๐œ™9,U
=1
$
• ∑9(# ๐œ™๐œ™9,U ⋅ ๐œ™๐œ™9,H = 0, ๐‘š๐‘š ≠ ๐‘ข๐‘ข
Proportion of Variance Explained (PVE)
$
$
9(#
9(#
'
1
%
∞ ๐‘ ๐‘ K%% = ∞
∞ ๐‘ฅ๐‘ฅ&,9
๐‘›๐‘› − 1
๐‘ ๐‘ _%+
'
&(#
1
%
=
∞ ๐‘ง๐‘ง&,U
๐‘›๐‘› − 1
PVE =
&(#
Cluster Analysis
Notation
๐ถ๐ถ
Cluster containing indices
๐‘Š๐‘Š(๐ถ๐ถ) Within-cluster variation
of cluster
|๐ถ๐ถ|
No. of observations in cluster
$
%
Euclidean Distance = Ñ∑9(#2๐‘ฅ๐‘ฅ&,9 − ๐‘ฅ๐‘ฅU,9 5
๐‘˜๐‘˜-Means Clustering
1. Randomly assign a cluster to each
observation. This serves as the initial
cluster assignments.
2. Calculate the centroid of each cluster.
3. For each observation, identify the closest
centroid and reassign to that cluster.
4. Repeat steps 2 and 3 until the cluster
assignments stop changing.
$
1
%
๐‘Š๐‘Š(๐ถ๐ถH ) =
∞ ∞2๐‘ฅ๐‘ฅ&,9 − ๐‘ฅ๐‘ฅU,9 5
|๐ถ๐ถH |
&,U∈`, 9(#
$
%
= 2 ∞ ∞2๐‘ฅ๐‘ฅ&,9 − ๐‘ฅ๐‘ฅฬ…H,9 5
&∈`, 9(#
๐‘ ๐‘ _%+
∑$9(# ๐‘ ๐‘ K%%
Key Ideas
• The variance explained by each
subsequent principal component is
always less than the variance explained
by the previous principal component.
• All principal components are
uncorrelated with one another.
• A dataset has min(๐‘›๐‘› − 1, ๐‘๐‘) distinct
principal components.
• The first ๐‘˜๐‘˜ principal component scores
and loadings approximate the original
dataset, ๐‘ฅ๐‘ฅ&,9 ≈ ∑FU(# ๐‘ง๐‘ง&,U ๐œ™๐œ™9,U .
Hierarchical Clustering
1. Select the dissimilarity measure and
linkage to be used. Treat each
observation as its own cluster.
2. For ๐‘˜๐‘˜ = ๐‘›๐‘›, ๐‘›๐‘› − 1, … , 2:
• Compute the inter-cluster dissimilarity
between all ๐‘˜๐‘˜ clusters.
• Examine all 2F%5 pairwise
dissimilarities. The two clusters with
the lowest inter-cluster dissimilarity
are fused. The dissimilarity indicates
the height in the dendrogram at which
these two clusters join.
Linkage
Inter-cluster dissimilarity
Complete
The largest dissimilarity
Average
The arithmetic mean
Single
Centroid
The smallest dissimilarity
The dissimilarity between
the cluster centroids
Key Ideas
• For ๐‘˜๐‘˜-means clustering, the algorithm
needs to be repeated for each ๐‘˜๐‘˜.
• For hierarchical clustering, the algorithm
only needs to be performed once for any
number of clusters.
• The result of clustering depends on many
parameters, such as:
o Choice of ๐‘˜๐‘˜ in ๐‘˜๐‘˜-means clustering
o Choice of number of clusters, linkage,
and dissimilarity measure in
hierarchical clustering
o Choice to standardize variables
Principal Components Regression
๐‘Œ๐‘Œ = ๐œƒ๐œƒ4 + ๐œƒ๐œƒ# ๐‘ง๐‘ง# + โ‹ฏ + ๐œƒ๐œƒF ๐‘ง๐‘งF + ๐œ€๐œ€
• If ๐‘˜๐‘˜ = ๐‘๐‘, then ๐›ฝ๐›ฝ9 = ∑FU(# ๐œƒ๐œƒU ๐œ™๐œ™9,U .
© 2023 Coaching Actuaries. All Rights Reserved
www.coachingactuaries.com
SRM Formula Sheet 10
Download