© OLIVER WYMAN
Statistics Machine Learning
X
X
X
Y
X
X
X
Y
1. Background
2. Analytical Process
3. Methodologies
4. Implementation
• Earliest form of regression was the method of least squares, published by Legendre (1805) used to determine the orbits of comets and planets around the Sun
• Form was: e( b
) = Y - X b where the b ’s were estimated by minimizing the sum of squared residuals
S e 2
• Gauss (1809) introduced using the Normal distribution for the error terms
• Galton coined the term “regression” (1800s) to describe how the heights of children of tall parents tend to “regress down” and children of short parents tend to “regress up”. (“Regression towards the mean”).
• Current regression model form:
Y = b
X + e = b
0
+ b
1
X
1
+ b
2
X
2
+ … + e
Source: https://en.wikipedia.org/wiki/Regression_analysis
Current model form:
𝑌 = 𝛽
0
+ 𝛽
1
𝑋
1
+ 𝛽
2
𝑋
2
+ ⋯ + 𝜀
Major variations:
• Logistic regression:
• Generalized linear model:
Pr 𝑌 𝑋 = 𝑥 = 𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝜀
1 + 𝑒 𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝜀
𝑌 = 𝑔 −1 𝛽
0
+ 𝛽
1
𝑋
1
+ 𝛽
2
𝑋
2
+ ⋯ + 𝜀
Major assumptions:
• Error term e follows a distribution in the exponential family – Normal, exponential, Poisson, gamma, chi-squared, binomial, others
• Independence of observations, X i
’s, constant variance
• Linearity – implies a regression coefficient b i holding all other X’s constant describes the relationship between X i and Y,
Machine learning uses an algorithm to learn a function that maps input variables to the target output from training data:
Y = f(X)
– Specific model form is unknown
Statistical Learning = Parametric Models
– No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs
– Suited to simpler problems
Machine Learning = Nonparametric Models
– Feature selection is less important
– Requires fewer assumptions and can result in superior performance
– May require more training data
– Risk of overfitting, often more difficult to explain
Source: http://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/
1952 Arthur Samuel wrote the first computer learning program to play checkers that improved as it played.
1957 Frank Rosenblatt designed the first neural network.
1967 Nearest neighbor algorithm allowed computers to begin using basic pattern recognition.
1979 Students at Stanford University invent the “Stanford Cart” that navigated obstacles.
1981 Gerald Dejong introduced Explanation Based Learning (EBL), where a computer creates rules by discarding unimportant data.
1985 Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does.
1990s Scientists begin creating algorithms to learn from large amounts of data.
1997
IBM’s Deep Blue beats the world champion at chess.
2006 Geoffrey Hinton coined “deep learning” algorithms that let computers distinguish objects and text in images and videos.
2010 Microsoft Kinect tracks 20 human features at 30x/sec, allowing interaction with the computer via movements and gestures.
2011 Google Brain is developed, and its deep neural network can learn to discover and categorize objects much the way a cat does.
2011 IBM’s Watson beats its human competitors at Jeopardy.
2012
Google’s X Lab develops an algorithm to browse YouTube videos to identify ones containing cats.
2014 Facebook develops DeepFace, that is able to recognize people in photos to the same level as humans can.
2015 Amazon launches its own machine learning platform.
2015 Microsoft created Distributed Machine Learning Toolkit, enabling distribution of problems across multiple computers.
2016 Google’s artificial intelligence algorithm beats Lee Sedol at the Chinese board game Go. (Update! Lee wins a game!)
Source: http://www.datasciencecentral.com/profiles/blogs/a-short-history-of-machine-learning
Seeing, crawling, walking, learning, speaking, playing, driving, challenging humans…
1952 Arthur Samuel wrote the first computer learning program to play checkers that improved as it played.
1957 Frank Rosenblatt designed the first neural network.
1967 Nearest neighbor algorithm allowed computers to begin using basic pattern recognition .
1979 Students at Stanford University invent the “Stanford Cart” that navigated obstacles.
1981 Gerald Dejong introduced Explanation Based Learning (EBL), where a computer creates rules by discarding unimportant data.
1985 Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does.
1990s Scientists begin creating algorithms to learn from large amounts of data.
1997
IBM’s Deep Blue beats the world champion at chess .
2006 Geoffrey Hinton coined “deep learning” algorithms that let computers distinguish objects and text in images and videos.
2010 Microsoft Kinect tracks 20 human features at 30x/sec, allowing interaction with the computer via movements and gestures .
2011 Google Brain is developed, and its deep neural network can learn to discover and categorize objects much the way a cat does.
2011 IBM’s Watson beats its human competitors at Jeopardy .
2012
Google’s X Lab develops an algorithm to browse YouTube videos to identify ones containing cats.
2014 Facebook develops DeepFace, that is able to recognize people in photos to the same level as humans can.
2015 Amazon launches its own machine learning platform.
2015 Microsoft created Distributed Machine Learning Toolkit, enabling distribution of problems across multiple computers.
2016 Google’s artificial intelligence algorithm beats Lee Sedol at the Chinese board game Go. ( Update! Lee wins a game!
)
Source: http://www.datasciencecentral.com/profiles/blogs/a-short-history-of-machine-learning
4 reasons:
1. The field has matured both in terms of identity and in terms of methods and tools
2. There is an abundance of data available
3. There is an abundance of computation to run methods
4. There have been impressive results, increasing acceptance, respect, and competition
Stanford's coursera machine learning course had more than 100,000 expressing interest in the first year. That is crazy!
Resources + Ingredients + Tools + Desire = Popularity
Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf
In his book “Statistical Modeling: The Two Cultures”, Leo Breiman discusses two cultures that contrast traditional statistical theory and machine learning.
Traditional statistical theory, or data modeling, culture:
• Assumes data is generated by a stochastic process.
• Breiman argues this “has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems .”
Machine learning, or algorithmic modeling, culture:
• Has developed rapidly largely outside theoretical statistics
• Uses algorithmic models, treating the data mechanism as unknown
“If our goal is to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools .”
Source: http://cdn2.hubspot.net/hub/160602/file-18146220-pdf/docs/statistical_modeling_the_two_cultures.pdf
(“Statistical Modeling: The Two Cultures”)
Data modeling
• Response = f(predictors, parameters, random noise)
• Parameters are estimated, random noise is what’s left over
• Model evaluation based on assumed fit to a statistical distribution
Algorithmic modeling
• Finds a function/algorithm f(x) to predict y
• Model evaluation based on predictive accuracy
How well can we fit the data using a given model?
How well can we fit a model to the given data?
Source: http://cdn2.hubspot.net/hub/160602/file-18146220-pdf/docs/statistical_modeling_the_two_cultures.pdf
Origins
Statistical Learning
Subfield of statistics
Machine Learning
Subfield of artificial intelligence
Emphasis Greater emphasis on model structure, interpretability, precision and uncertainty
Greater emphasis on model flexibility, large-scale applications, and prediction accuracy
Distinction has become more and more blurred, with increasing cross-fertilization
Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013
Source of image: https://leanpub.com/artofdatascience
In all algorithms, whether machine learning or statistical learning, you need to:
• Develop expectations
• Match expectations with data
• Collect the data
• Establish the specific modeling question
• Perform EDA, adjust the data
• Build model(s)
– Make assumptions
– Define “best fit” function
– Recognize concerns
– Evaluation
– Produce predictions
• Interpret model(s)
• Communicate results
Input Data
Data
Adjustments
Modeling
Assumptions
Statistical Learning
• Wants “tidy data”
Machine Learning
• Generally wants “tidy data” but some algorithms are designed for unstructured data
• Can easily handle missing values • Does not handle missing values well
• Assumes linear relationships between Y and input X’s
• Sensitive to multicollinearity
• Assumes residuals follow distribution in exponential family
• Looks for patterns; does not assume linearity
• Generally more forgiving about multicollinearity
• Does not assume data follows a certain distribution
Minimization
Task
Concerns
Model
Evaluation
Statistical Learning
• Minimizes squared error
Machine Learning
• Minimizes a defined “loss” function
• Feature selection very important
• Underfitting & Overfitting
• Multicollinearity
• External confounding effects
• Patterns in the residuals
• Missed nonlinear relationships
• Feature selection not as crucial
• Overfitting! (Bias vs. Variance)
• General lack of transparency
• Statistically-based goodness-of-fit:
Adj R 2 , F-test, p-values, AIC/BIC, confidence intervals
• Residual analysis
• Can use cross-validation and/or hold-out data testing
• Uses cross-validation and/or holdout data testing as measures of convergence
Training Error
– Measure of how well the model performs on the input data
– Not the most important thing
Validation Error
– Measure of how well the model performs on unseen data
– More important than training error, it indicates generalization performance
Signal
– Data representative of the true underlying process
– Noise – data not representative
Overfitting
– Degree to which the model is influenced by noise
– Results from an overly complex model
– Undesirable for good generalization/prediction
Just right
Statistical Learning
– Regression
Machine Learning
– Clustering
Decision Trees
k-Means
– Instance-Based Algorithms
k-Nearest Neighbors
Bayesian Algorithms
Support Vector Machines (SVM)
– Artificial Neural Networks
Self-Organizing Maps (SOMs) / Kohonen maps
– Ensembles
Random Forests
General Model Form:
𝑌 = 𝛽
0
+ 𝛽
1
𝑋
1
+ 𝛽
2
𝑋
2
+ ⋯ + 𝜀
Loss Function:
RSS = 𝑦 𝑖
− 𝛽
0
+ 𝛽
1 𝑥
1𝑖
+ 𝛽
2 𝑥
2𝑖
+ ⋯
2
Y
X
Source of images: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013
Models the independent linear relationship between input variables and the known target:
– General model structure assumed
– Can be used for classification
Variations:
– Ordinary least squares (OLS)
– Generalized linear models (GLMs)
– Logistic
– Multivariate adaptive regression splines (MARS)
– Locally estimated scatterplot smoothing (LOESS)
Model Evaluation:
– Residual analyses
– Many goodness-of-fit tests based on assumption of statistical distribution:
RSS, Adj R 2 , F-test, AIC, BIC, p-values, standard errors/confidence intervals
– Should also use cross-validation and/or hold-out test data
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Concerns:
– Unknown data points
– Sensitive to outliers
– Nonlinearity between X’s and Y
– Correlations within X variables
– Sufficient spread across each X values
– Extrapolation beyond data
– Heteroscedasticity / patterns in residuals / errors don’t follow assumed distribution
Enhancements:
– Feature selection
Forward, backward, stepwise regression
LASSO, Ridge regression
– Variable interactions
CHAID, Machine learning algorithms
– Categorical value groupings
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
General Model Form:
𝑇 =
Loss Function:
RSS =
R
1
-or-
R
3
R
2
R
1
R
2
R
3
Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013
1. True linear boundary:
Linear model fits perfectly
3. True linear boundary:
Flexible decision tree can approximate
2. True nonlinear boundary:
Linear model fits poorly
Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013
4. True nonlinear boundary:
Simple decision tree fits perfectly
Stratifying or segmenting predictor space into simple axis-parallel regions
– Specific model form is data-driven
– Can be used for classification or regression
– Very fast but typically not competitive with better learning approaches, especially ensembles
Variations:
– Classification and Regression Tree (CART), Iterative Dichotomiser 3 (ID3), C4.5 and C5.0
– Chi-squared Automatic Interaction Detection (CHAID)
– Conditional Decision Trees
Model Evaluation
—Goodness-of-fit tests based on empirical performance:
– Confusion matrix for accuracy/misclassification
– Between vs. within variance
– Cross-validation and/or hold-out test data
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Concerns:
– Overfitting – use pruning
– Sensitive to data structure, not as much to outliers
– Local minima & maxima
– Small variations in experiments may lead to seemingly dramatically different decision trees – test by similarity of predictions
Enhancements:
– Variable interactions – can create curved boundaries in addition to axis-parallel
– Categorical value groupings – how to handle low credibility?
– Ensembling
Y < X
1
X
2
R
1
Y ≥ X
1
X
2
R
2
R
1
R
2
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
General Model Form:
Loss Function: Variance or Distance measure
(minimize within cluster variance)
Source: “The Elements of Statistical Learning”, Hastie et al., 2 nd ed., 2013
Finds k clusters based on a measure of “closeness”:
– Not necessarily axis parallel regions
– Specific model form is data-driven
– Diagnostics and modeler preference helps determine k
– Typically can only use numeric predictors
Variations:
– k-Medians
– k-Modes for categorical predictors
– Hierarchical clustering
Model Evaluation
—Goodness-of-fit tests based on empirical performance:
– Confusion matrix
– Between vs. within variance
– Cross-validation and/or hold-out test data
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Concerns:
– Overfitting
– Sensitive to data structure & outliers
– Sensitive to lopsided cluster variances such as with insurance data, especially fraud
– Local minimums & maximums
Enhancements:
– Feature selection
– Variable interactions – can create additional feature space dimensions
– Ensembling
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
General Model Form:
Loss Function: % Misclassified
X2 X2
X1 X1
Source: “An Introduction to Statistical Learning with Applications in R”, James et al., 2013
X1
Finds separation boundary based on k neighboring data points:
– Certainly not axis-parallel regions
– Specific model form is data driven
– Diagnostics and modeler preference helps determine k
– Can be used for classification or regression
Model Evaluation —Goodness-of-fit tests based on empirical performance:
– Confusion matrix
– RMSE
– Cross-validation and/or hold-out test data
Concerns:
– Sensitive to data structure, can overfit
– Local minima & maxima
– Sensitive to lopsided cluster variances such as with insurance data, esp. fraud
Enhancements:
– Feature selection
– Extended k-NN (ENN)
– Weighted distances
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Produces more robust model by combining many base learners:
– Think of diversifying risk in Finance
– Prediction quality benefits from averaging many models
– Complexity increases but variance decreases faster as wayward errors cancel out
– Base learners can be simple/weak, decorrelation is key to reducing the variance
Variations:
– Boosting
– Bootstrapped Aggregation (Bagging)
– AdaBoost
– Stacked Generalization (blending)
– Gradient Boosting Machines (GBM)
– Gradient Boosted Regression Trees (GBRT)
– Random Forests
– Unlimited
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Bootstrapping
– A way of creating additional training data to build a more stable model
– Build a dataset of the same size as your training data by repeatedly sampling with replacement from your original training data
– Some data points may be included more than once while others will be left out
Random Forests Algorithm
1. Build many deep decision trees on bootstrapped training datasets
2. For each tree, only consider a random subset of predictors (
√ p is typical)
3. Average the predictions from all individual decision trees
– Random forests generally performs very well
– Very robust
– A go-to favorite among many data scientists
Concerns:
– Correlation of individual trees
– Can overfit – literature is divided on this
– Not as sensitive to data structure, outliers, local minima or maxima
– Transparency, communicability, feature contributions
Enhancements:
– Depth of decision trees
– Number of training iterations
– Variable interactions – can create curved boundaries in addition to axis-parallel
– Categorical value groupings – how to handle low credibility?
– Confidence intervals for input variables
Source of image: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
GLMs
– Individual tables of factors, one for each input variable
– Easy to interpret
– Simply replace current factors in same structure of ratepages
Decision Trees
– Set of if/then conditions that define groupings/tiers/segments
– Easy to interpret
– New step in rating algorithm k-Means
– Based on distance to cluster centroids
– Simply assign new data point to closest cluster centroid
– Since unsupervised, more difficult to accept since human intuition has not formed
– Could be used for target marketing to customer lifestyles
– Can enhance variable groupings
Variable X Factor
A
B
…
C
1.200
1.000
…
0.800
if (condition) group 1 else if (condition) group 2
… else group n end if
k-Nearest Neighbors
– Assign new data points according to decision boundaries
– Supervised, but nonlinear and multidimensional, so assignments may be not be intuitive, may seem arbitrary near boundaries
– New step in rating algorithm
Ensembling
– Based on learned composite set of rules to assign prediction
– May be non-intuitive, but influence of variables can be determined
– Can significantly improve accuracy
– Calculation of confidence intervals for Random Forests is possible
– New step in rating algorithm
Score
1-100
101-200
…
Factor
1.200
1.000
…
901-1000 0.800
GLMs
– Re-rate inforce book, calculate distribution of changes as some DOIs require
– Examine extremes, but unclear how to fix – temper selections (how?) or cap changes?
– Multiple factors may change, but can test impacts by hand
– Easy to explain to customer
Decision Trees
– Impacts are few and discrete
– Easy to see and temper extremes
– Could be large jumps to different segments
– May be difficult to explain k-Means
– Re-groupings – simpler to determine distribution of changes
– Easy to see and temper transition to new grouping
– Could be large jumps to different clusters
– May be difficult to explain
k-Nearest Neighbors
– Could be large jumps to different classes for points near the border
– May be difficult to explain
Ensembling: Random Forests
– Provides a granular and gradual spectrum of indications, often grouped (deciles, etc.)
– Easy to see and temper extremes
– May be difficult to explain – “natural” progressions could cause unexpected, undesirable, and compounded impacts
GLMs Machine Learning
P Vast literature
O Linear, dependence on exponential distribution – models real-world
Theoretical
P Vast literature
P Nonlinear, nonparametric
– could be more accurate relationships?
O Feature selection, outliers, missing data, multicollinearity very important
O Underfitting & overfitting
P ok
Less concern over feature selection, outliers, missing data, multicollinearity
Risk of overfitting but this can be controlled
P
P
Easier to interpret
Accepted by DOIs
Practical
O May be difficult to interpret
P / O Beginning to be accepted (remember early stages of filing CAT models?)
O Most likely need a new rating step P No changes to infrastructure
P Easy to productionalize
O / P May not be as accurate, but may be better choice if you can’t take advantage of full machine learning indications
P
O Most likely need a rules engine
/ ? May be more accurate if you can change your infrastructure to take advantage
The practical issues are rapidly going away in favor of machine learning…
Scott Sobel, FCAS, MAAA, MSPA
Oliver Wyman Actuarial Consulting, Inc.
325 John H. McConnell Blvd
Columbus, OH 43215 phone: +1 614 227 6225 cell: +1 803 429 3153 scott.sobel@oliverwyman.com
www.oliverwyman.com
@ssobel2010