COMPARISON OF GLMs AND MACHINE LEARNING ALGORITHMS 2016 CASE SPRING MEETING

advertisement

COMPARISON OF GLMs AND

MACHINE LEARNING ALGORITHMS

2016 CASE SPRING MEETING

Scott Sobel, FCAS, MAAA, MSPA

March 22, 2016

© OLIVER WYMAN

What are the differences between statistics and machine learning?

Statistics Machine Learning

X

X

X

Y

X

X

X

Y

Topics

1. Background

2. Analytical Process

3. Methodologies

4. Implementation

Section 1 Background

Linear Regression

• Earliest form of regression was the method of least squares, published by Legendre (1805) used to determine the orbits of comets and planets around the Sun

• Form was: e( b

) = Y - X b where the b ’s were estimated by minimizing the sum of squared residuals

S e 2

• Gauss (1809) introduced using the Normal distribution for the error terms

• Galton coined the term “regression” (1800s) to describe how the heights of children of tall parents tend to “regress down” and children of short parents tend to “regress up”. (“Regression towards the mean”).

• Current regression model form:

Y = b

X + e = b

0

+ b

1

X

1

+ b

2

X

2

+ … + e

Source: https://en.wikipedia.org/wiki/Regression_analysis

Linear Regression

Current model form:

𝑌 = 𝛽

0

+ 𝛽

1

𝑋

1

+ 𝛽

2

𝑋

2

+ ⋯ + 𝜀

Major variations:

• Logistic regression:

• Generalized linear model:

Pr 𝑌 𝑋 = 𝑥 = 𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝜀

1 + 𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝜀

𝑌 = 𝑔 −1 𝛽

0

+ 𝛽

1

𝑋

1

+ 𝛽

2

𝑋

2

+ ⋯ + 𝜀

Major assumptions:

• Error term e follows a distribution in the exponential family – Normal, exponential, Poisson, gamma, chi-squared, binomial, others

• Independence of observations, X i

’s, constant variance

• Linearity – implies a regression coefficient b i holding all other X’s constant describes the relationship between X i and Y,

Machine Learning

Machine learning uses an algorithm to learn a function that maps input variables to the target output from training data:

Y = f(X)

– Specific model form is unknown

Statistical Learning = Parametric Models

– No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs

– Suited to simpler problems

Machine Learning = Nonparametric Models

– Feature selection is less important

– Requires fewer assumptions and can result in superior performance

– May require more training data

– Risk of overfitting, often more difficult to explain

Source: http://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/

Advancement of Machine Learning

1952 Arthur Samuel wrote the first computer learning program to play checkers that improved as it played.

1957 Frank Rosenblatt designed the first neural network.

1967 Nearest neighbor algorithm allowed computers to begin using basic pattern recognition.

1979 Students at Stanford University invent the “Stanford Cart” that navigated obstacles.

1981 Gerald Dejong introduced Explanation Based Learning (EBL), where a computer creates rules by discarding unimportant data.

1985 Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does.

1990s Scientists begin creating algorithms to learn from large amounts of data.

1997

IBM’s Deep Blue beats the world champion at chess.

2006 Geoffrey Hinton coined “deep learning” algorithms that let computers distinguish objects and text in images and videos.

2010 Microsoft Kinect tracks 20 human features at 30x/sec, allowing interaction with the computer via movements and gestures.

2011 Google Brain is developed, and its deep neural network can learn to discover and categorize objects much the way a cat does.

2011 IBM’s Watson beats its human competitors at Jeopardy.

2012

Google’s X Lab develops an algorithm to browse YouTube videos to identify ones containing cats.

2014 Facebook develops DeepFace, that is able to recognize people in photos to the same level as humans can.

2015 Amazon launches its own machine learning platform.

2015 Microsoft created Distributed Machine Learning Toolkit, enabling distribution of problems across multiple computers.

2016 Google’s artificial intelligence algorithm beats Lee Sedol at the Chinese board game Go. (Update! Lee wins a game!)

Source: http://www.datasciencecentral.com/profiles/blogs/a-short-history-of-machine-learning

Advancement of Machine Learning

Seeing, crawling, walking, learning, speaking, playing, driving, challenging humans…

1952 Arthur Samuel wrote the first computer learning program to play checkers that improved as it played.

1957 Frank Rosenblatt designed the first neural network.

1967 Nearest neighbor algorithm allowed computers to begin using basic pattern recognition .

1979 Students at Stanford University invent the “Stanford Cart” that navigated obstacles.

1981 Gerald Dejong introduced Explanation Based Learning (EBL), where a computer creates rules by discarding unimportant data.

1985 Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does.

1990s Scientists begin creating algorithms to learn from large amounts of data.

1997

IBM’s Deep Blue beats the world champion at chess .

2006 Geoffrey Hinton coined “deep learning” algorithms that let computers distinguish objects and text in images and videos.

2010 Microsoft Kinect tracks 20 human features at 30x/sec, allowing interaction with the computer via movements and gestures .

2011 Google Brain is developed, and its deep neural network can learn to discover and categorize objects much the way a cat does.

2011 IBM’s Watson beats its human competitors at Jeopardy .

2012

Google’s X Lab develops an algorithm to browse YouTube videos to identify ones containing cats.

2014 Facebook develops DeepFace, that is able to recognize people in photos to the same level as humans can.

2015 Amazon launches its own machine learning platform.

2015 Microsoft created Distributed Machine Learning Toolkit, enabling distribution of problems across multiple computers.

2016 Google’s artificial intelligence algorithm beats Lee Sedol at the Chinese board game Go. ( Update! Lee wins a game!

)

Source: http://www.datasciencecentral.com/profiles/blogs/a-short-history-of-machine-learning

Why is machine learning is popular right now?

4 reasons:

1. The field has matured both in terms of identity and in terms of methods and tools

2. There is an abundance of data available

3. There is an abundance of computation to run methods

4. There have been impressive results, increasing acceptance, respect, and competition

Stanford's coursera machine learning course had more than 100,000 expressing interest in the first year. That is crazy!

Resources + Ingredients + Tools + Desire = Popularity

Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf

Does machine learning leave statistical theory behind?

In his book “Statistical Modeling: The Two Cultures”, Leo Breiman discusses two cultures that contrast traditional statistical theory and machine learning.

Traditional statistical theory, or data modeling, culture:

• Assumes data is generated by a stochastic process.

• Breiman argues this “has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems .”

Machine learning, or algorithmic modeling, culture:

• Has developed rapidly largely outside theoretical statistics

• Uses algorithmic models, treating the data mechanism as unknown

“If our goal is to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools .”

Source: http://cdn2.hubspot.net/hub/160602/file-18146220-pdf/docs/statistical_modeling_the_two_cultures.pdf

Does machine learning leave statistical theory behind?

(“Statistical Modeling: The Two Cultures”)

Data modeling

• Response = f(predictors, parameters, random noise)

• Parameters are estimated, random noise is what’s left over

• Model evaluation based on assumed fit to a statistical distribution

Algorithmic modeling

• Finds a function/algorithm f(x) to predict y

• Model evaluation based on predictive accuracy

How well can we fit the data using a given model?

How well can we fit a model to the given data?

Source: http://cdn2.hubspot.net/hub/160602/file-18146220-pdf/docs/statistical_modeling_the_two_cultures.pdf

Does machine learning leave statistical theory behind?

Origins

Statistical Learning

Subfield of statistics

Machine Learning

Subfield of artificial intelligence

Emphasis Greater emphasis on model structure, interpretability, precision and uncertainty

Greater emphasis on model flexibility, large-scale applications, and prediction accuracy

Distinction has become more and more blurred, with increasing cross-fertilization

Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013

Section 2 Analytical Process

Common analytical process

Source of image: https://leanpub.com/artofdatascience

In all algorithms, whether machine learning or statistical learning, you need to:

• Develop expectations

• Match expectations with data

• Collect the data

• Establish the specific modeling question

• Perform EDA, adjust the data

• Build model(s)

– Make assumptions

– Define “best fit” function

– Recognize concerns

– Evaluation

– Produce predictions

• Interpret model(s)

• Communicate results

Differences

Input Data

Data

Adjustments

Modeling

Assumptions

Statistical Learning

• Wants “tidy data”

Machine Learning

• Generally wants “tidy data” but some algorithms are designed for unstructured data

• Can easily handle missing values • Does not handle missing values well

• Assumes linear relationships between Y and input X’s

• Sensitive to multicollinearity

• Assumes residuals follow distribution in exponential family

• Looks for patterns; does not assume linearity

• Generally more forgiving about multicollinearity

• Does not assume data follows a certain distribution

Differences

Minimization

Task

Concerns

Model

Evaluation

Statistical Learning

• Minimizes squared error

Machine Learning

• Minimizes a defined “loss” function

• Feature selection very important

• Underfitting & Overfitting

• Multicollinearity

• External confounding effects

• Patterns in the residuals

• Missed nonlinear relationships

• Feature selection not as crucial

• Overfitting! (Bias vs. Variance)

• General lack of transparency

• Statistically-based goodness-of-fit:

Adj R 2 , F-test, p-values, AIC/BIC, confidence intervals

• Residual analysis

• Can use cross-validation and/or hold-out data testing

• Uses cross-validation and/or holdout data testing as measures of convergence

Some concepts

Training Error

– Measure of how well the model performs on the input data

– Not the most important thing

Validation Error

– Measure of how well the model performs on unseen data

– More important than training error, it indicates generalization performance

Signal

– Data representative of the true underlying process

– Noise – data not representative

Overfitting

– Degree to which the model is influenced by noise

– Results from an overly complex model

– Undesirable for good generalization/prediction

Just right

Section 3 Methodologies

Methodology types

Statistical Learning

– Regression

Machine Learning

– Clustering

Decision Trees

k-Means

– Instance-Based Algorithms

k-Nearest Neighbors

Bayesian Algorithms

Support Vector Machines (SVM)

– Artificial Neural Networks

Self-Organizing Maps (SOMs) / Kohonen maps

– Ensembles

Random Forests

Regression

General Model Form:

𝑌 = 𝛽

0

+ 𝛽

1

𝑋

1

+ 𝛽

2

𝑋

2

+ ⋯ + 𝜀

Loss Function:

RSS = 𝑦 𝑖

− 𝛽

0

+ 𝛽

1 𝑥

1𝑖

+ 𝛽

2 𝑥

2𝑖

+ ⋯

2

Y

X

Source of images: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013

Regression

Models the independent linear relationship between input variables and the known target:

– General model structure assumed

– Can be used for classification

Variations:

– Ordinary least squares (OLS)

– Generalized linear models (GLMs)

– Logistic

– Multivariate adaptive regression splines (MARS)

– Locally estimated scatterplot smoothing (LOESS)

Model Evaluation:

– Residual analyses

– Many goodness-of-fit tests based on assumption of statistical distribution:

RSS, Adj R 2 , F-test, AIC, BIC, p-values, standard errors/confidence intervals

– Should also use cross-validation and/or hold-out test data

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Regression

Concerns:

– Unknown data points

– Sensitive to outliers

– Nonlinearity between X’s and Y

– Correlations within X variables

– Sufficient spread across each X values

– Extrapolation beyond data

– Heteroscedasticity / patterns in residuals / errors don’t follow assumed distribution

Enhancements:

– Feature selection

Forward, backward, stepwise regression

LASSO, Ridge regression

– Variable interactions

CHAID, Machine learning algorithms

– Categorical value groupings

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Clustering: Decision Trees

General Model Form:

𝑇 =

Loss Function:

RSS =

R

1

-or-

R

3

R

2

R

1

R

2

R

3

Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013

Decision Trees vs. Linear Models – Depends on underlying data structure

1. True linear boundary:

Linear model fits perfectly

3. True linear boundary:

Flexible decision tree can approximate

2. True nonlinear boundary:

Linear model fits poorly

Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013

4. True nonlinear boundary:

Simple decision tree fits perfectly

Clustering: Decision Trees

Stratifying or segmenting predictor space into simple axis-parallel regions

– Specific model form is data-driven

– Can be used for classification or regression

– Very fast but typically not competitive with better learning approaches, especially ensembles

Variations:

– Classification and Regression Tree (CART), Iterative Dichotomiser 3 (ID3), C4.5 and C5.0

– Chi-squared Automatic Interaction Detection (CHAID)

– Conditional Decision Trees

Model Evaluation

—Goodness-of-fit tests based on empirical performance:

– Confusion matrix for accuracy/misclassification

– Between vs. within variance

– Cross-validation and/or hold-out test data

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Clustering: Decision Trees

Concerns:

– Overfitting – use pruning

– Sensitive to data structure, not as much to outliers

– Local minima & maxima

– Small variations in experiments may lead to seemingly dramatically different decision trees – test by similarity of predictions

Enhancements:

– Variable interactions – can create curved boundaries in addition to axis-parallel

– Categorical value groupings – how to handle low credibility?

– Ensembling

Y < X

1

X

2

R

1

Y ≥ X

1

X

2

R

2

R

1

R

2

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Clustering: k-Means

General Model Form:

Loss Function: Variance or Distance measure

(minimize within cluster variance)

Source: “The Elements of Statistical Learning”, Hastie et al., 2 nd ed., 2013

Clustering: k-Means

Finds k clusters based on a measure of “closeness”:

– Not necessarily axis parallel regions

– Specific model form is data-driven

– Diagnostics and modeler preference helps determine k

– Typically can only use numeric predictors

Variations:

– k-Medians

– k-Modes for categorical predictors

– Hierarchical clustering

Model Evaluation

—Goodness-of-fit tests based on empirical performance:

– Confusion matrix

– Between vs. within variance

– Cross-validation and/or hold-out test data

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Clustering: k-Means

Concerns:

– Overfitting

– Sensitive to data structure & outliers

– Sensitive to lopsided cluster variances such as with insurance data, especially fraud

– Local minimums & maximums

Enhancements:

– Feature selection

– Variable interactions – can create additional feature space dimensions

– Ensembling

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Instance-Based Algorithm: k-Nearest Neighbors

General Model Form:

Loss Function: % Misclassified

X2 X2

X1 X1

Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013

X1

Instance-Based Algorithm: k-Nearest Neighbors

Finds separation boundary based on k neighboring data points:

– Certainly not axis-parallel regions

– Specific model form is data driven

– Diagnostics and modeler preference helps determine k

– Can be used for classification or regression

Model Evaluation —Goodness-of-fit tests based on empirical performance:

– Confusion matrix

– RMSE

– Cross-validation and/or hold-out test data

Concerns:

– Sensitive to data structure, can overfit

– Local minima & maxima

– Sensitive to lopsided cluster variances such as with insurance data, esp. fraud

Enhancements:

– Feature selection

– Extended k-NN (ENN)

– Weighted distances

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Ensembling

Produces more robust model by combining many base learners:

– Think of diversifying risk in Finance

– Prediction quality benefits from averaging many models

– Complexity increases but variance decreases faster as wayward errors cancel out

– Base learners can be simple/weak, decorrelation is key to reducing the variance

Variations:

– Boosting

– Bootstrapped Aggregation (Bagging)

– AdaBoost

– Stacked Generalization (blending)

– Gradient Boosting Machines (GBM)

– Gradient Boosted Regression Trees (GBRT)

– Random Forests

– Unlimited

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Ensembling: Random Forests

Bootstrapping

– A way of creating additional training data to build a more stable model

– Build a dataset of the same size as your training data by repeatedly sampling with replacement from your original training data

– Some data points may be included more than once while others will be left out

Random Forests Algorithm

1. Build many deep decision trees on bootstrapped training datasets

2. For each tree, only consider a random subset of predictors (

 √ p is typical)

3. Average the predictions from all individual decision trees

– Random forests generally performs very well

– Very robust

– A go-to favorite among many data scientists

Ensembling: Random Forests

Concerns:

– Correlation of individual trees

– Can overfit – literature is divided on this

– Not as sensitive to data structure, outliers, local minima or maxima

– Transparency, communicability, feature contributions

Enhancements:

– Depth of decision trees

– Number of training iterations

– Variable interactions – can create curved boundaries in addition to axis-parallel

– Categorical value groupings – how to handle low credibility?

– Confidence intervals for input variables

Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Section 3 Implementation

Implementation

GLMs

– Individual tables of factors, one for each input variable

– Easy to interpret

– Simply replace current factors in same structure of ratepages

Decision Trees

– Set of if/then conditions that define groupings/tiers/segments

– Easy to interpret

– New step in rating algorithm k-Means

– Based on distance to cluster centroids

– Simply assign new data point to closest cluster centroid

– Since unsupervised, more difficult to accept since human intuition has not formed

– Could be used for target marketing to customer lifestyles

– Can enhance variable groupings

Variable X Factor

A

B

C

1.200

1.000

0.800

if (condition) group 1 else if (condition) group 2

… else group n end if

Implementation

k-Nearest Neighbors

– Assign new data points according to decision boundaries

– Supervised, but nonlinear and multidimensional, so assignments may be not be intuitive, may seem arbitrary near boundaries

– New step in rating algorithm

Ensembling

– Based on learned composite set of rules to assign prediction

– May be non-intuitive, but influence of variables can be determined

– Can significantly improve accuracy

– Calculation of confidence intervals for Random Forests is possible

– New step in rating algorithm

Score

1-100

101-200

Factor

1.200

1.000

901-1000 0.800

Testing & explaining impacts

GLMs

– Re-rate inforce book, calculate distribution of changes as some DOIs require

– Examine extremes, but unclear how to fix – temper selections (how?) or cap changes?

– Multiple factors may change, but can test impacts by hand

– Easy to explain to customer

Decision Trees

– Impacts are few and discrete

– Easy to see and temper extremes

– Could be large jumps to different segments

– May be difficult to explain k-Means

– Re-groupings – simpler to determine distribution of changes

– Easy to see and temper transition to new grouping

– Could be large jumps to different clusters

– May be difficult to explain

Testing & explaining impacts

k-Nearest Neighbors

– Could be large jumps to different classes for points near the border

– May be difficult to explain

Ensembling: Random Forests

– Provides a granular and gradual spectrum of indications, often grouped (deciles, etc.)

– Easy to see and temper extremes

– May be difficult to explain – “natural” progressions could cause unexpected, undesirable, and compounded impacts

So what’s the verdict?

GLMs Machine Learning

P Vast literature

O Linear, dependence on exponential distribution – models real-world

Theoretical

P Vast literature

P Nonlinear, nonparametric

– could be more accurate relationships?

O Feature selection, outliers, missing data, multicollinearity very important

O Underfitting & overfitting

P ok

Less concern over feature selection, outliers, missing data, multicollinearity

Risk of overfitting but this can be controlled

P

P

Easier to interpret

Accepted by DOIs

Practical

O May be difficult to interpret

P / O Beginning to be accepted (remember early stages of filing CAT models?)

O Most likely need a new rating step P No changes to infrastructure

P Easy to productionalize

O / P May not be as accurate, but may be better choice if you can’t take advantage of full machine learning indications

P

O Most likely need a rules engine

/ ? May be more accurate if you can change your infrastructure to take advantage

The practical issues are rapidly going away in favor of machine learning…

Scott Sobel, FCAS, MAAA, MSPA

Oliver Wyman Actuarial Consulting, Inc.

325 John H. McConnell Blvd

Columbus, OH 43215 phone: +1 614 227 6225 cell: +1 803 429 3153 scott.sobel@oliverwyman.com

www.oliverwyman.com

@ssobel2010

Download