Regularized Regression

advertisement
Introduction to Modern Regression:
From OLS to GPS® to MARS®
September 2014
Dan Steinberg
Mikhail Golovnya
Salford Systems
Salford Systems ©2014 Introduction to Modern Regression
1
Course Outline
Today’s Topics
•
•
•
•
Regression Problem – quick overview
Classical OLS/LOGIT – the starting point
RIDGE/LASSO/GPS – regularized regression
MARS – adaptive non-linear regression
Next Week Topics
• Regression Trees
• GBM – stochastic gradient boosting
• ISLE/RULE-LEARNER – model compression
Salford Systems ©2014 Introduction to Modern Regression
2
Regression
• Regression analysis at least 200 years old
• Surely most used predictive modeling technique
(including logistic regression)
• American Statistical Association reports 18,900 members
• Bureau of Labor Statistics reported more than 22,000
statisticians in the work force in 2008 in the USA
• Many other professionals involved in the sophisticated
analysis of data not included in these counts
o Statistical specialists in scientific disciplines such as economics, medicine
bioinformatics
o Machine Learning specialists, ‘Data Scientists’, database experts
o Market researchers studying traditional targeted marketing
o Web analytics, social media analytics, text analytics
• Few of these other researchers will call themselves
statisticians but may make extensive use of variations of
regression
Salford Systems ©2014 Introduction to Modern Regression
3
Regression Challenges
• Preparation of data – errors, missing values, etc.
• Determination of predictors to include in model
o Hundreds, thousands, even tens and hundreds of thousands available
• Transformation or coding of predictors
o Conventional approaches consider logarithm, power, inverse, etc..
• Detecting and modeling important interactions
• Possibly huge number of records
o Super large samples render all predictors “significant”
• Complexity of underlying relationship
• Lack of external knowledge
Salford Systems ©2014 Introduction to Modern Regression
4
OLS Regression
• OLS – ordinary least squares regression
o Discovered by Legendre (1805) and Gauss (1809) to solve problems in
astronomy using pen and paper
o Solid statistical foundation by Fisher in 1920s
o 1950s – use of electro-mechanical calculators
• The model is always of the form
Response = A + B1X1 + B2X2 + B3X3 + …
The response surface is a hyper-plane!
A – the intercept term
B1, B2, B3, … – parameter estimates
A usually unique combination of values exists which
minimizes the mean squared error of predictions on
the learn sample
• Step-wise approaches to determine model size
•
•
•
•
Salford Systems ©2014 Introduction to Modern Regression
5
Logistic Regression
• Models Bernoulli response (binary outcome)
• The log-odds of the probability of the event of
interest is a linear combination of predictors
F(X) = Log[p/(1-p)] = A + B1X1 + B2X2 + B3X3 + …
• The solution minimizes the loss (or equivalently
maximizes the logistic likelihood) on the available
data (assuming the response is coded as +1 and -1)
Loss =  i=1,N {Log[1 + exp(-Yi Fi)]}
• The solution has no closed form and is usually found
by series of iterations utilizing Newton’s algorithm
Salford Systems ©2014 Introduction to Modern Regression
6
Key Problems with OLS and LOGIT
• In addition to the challenges already mentioned
above, note the following features:
o OLS optimizes specific loss function (Mean Squared Error / Log-likelihood)
using all available data (learn sample)
o Solution over-fits to the learn sample
o Solution becomes unstable in the presence of collinearity
o Unique solution does not exist when data becomes wide
Look for
• Alternative strategies to construct useful linear
combinations of predictors
o
o
o
o
By jointly optimizing MSE and L2-norm of the coefficients – Ridge regression
By jointly optimizing MSE and L1-norm of the coefficients – Lasso regression
Hybrid of the two above
All of the above plus extensions into very sparse solutions – Generalized
Path Seeker
Salford Systems ©2014 Introduction to Modern Regression
7
Motivation for Regularized Regression
• Unsatisfactory regression results based on data modeling
physical processes (1970) when predictors correlated
o
o
o
o
Coefficient estimates could change dramatically with small changes in data
Some coefficients judged to be much too large
Frequent appearance of coefficients with “wrong” sign
Problem severe with substantial multicollinearity but always present
• Solution proposed by Hoerl and Kennard,
(Technometrics, 1970) was “Ridge regression”
o First proposal by Hoerl just for stabilization of coefficients 1962
o Initially poorly received by academic statistics profession
• Ridge Intentionally biased but yields more satisfactory
coefficient estimates and superior generalization
o Better performance (test MSE) on previously unseen data (lower variance)
• “Shrinkage” of regression coefficients towards zero
• OLS coefficient vector length is biased upwards
Salford Systems ©2014 Introduction to Modern Regression
8
Lasso Regularized Regression
• Introduced by Tibshirani in 1996 explicitly as an
improvement on the RIDGE regression
• Least Absolute Shrinkage and Selection Operator
• Desire to gain the stability and lower variance of ridge
regression while also performing variable selection
• Especially in the context of many possible predictors
looking for a simple, stable, low predictive variance
model
• Historical note: The Lasso was inspired by related work in
1993 by Leo Breiman (of CART and RandomForests
fame). This was the ‘non-negative garotte’.
• Breiman’s simulation studies showed the potential for
improved prediction via selection and shrinkage
Salford Systems ©2014 Introduction to Modern Regression
9
Regularized Regression - Theory
OLS Regression
Ridge:
Sum of squared
coefficients
Minimize
Mean Squared Error
λ
Model Complexity
Minimize
Regularized Regression
Lasso:
Sum of absolute
coefficients
Best Subsets:
Number of
coefficients
• Regularized regression approach balances model
performance and model complexity
• λ – regularization parameter
o λ=∞
o λ=0
Zero-coefficient solution
OLS solution
• Model complexity expression defines type of model
Salford Systems ©2014 Introduction to Modern Regression
10
Penalty Function – Model Complexity
• RIDGE penalty
• LASSO penalty
• COMPACT penalty
• Extended power family
• Elastic net family
aj2
|aj|
|aj|0
|aj|g
squared
absolute value
count
0<=g<=2
 {(β - 1)a2j/2 + (2 - β)|aj|},1 ≤ β ≤ 2
• Regularization parameter λ multiplies a penalty function of the
a vector
• Each elasticity is based on fitting a model that minimizes the
sum (residual sum of squared errors + penalty)
• Intermediate elasticities are mixtures, e.g. we could have a
50/50 mix of RIDGE and LASSO (g = 1.5)
• GPS extends beyond the “elastic net” of Tibshirani/Hastie to
include the “other half” of the power family 0 ≤ β ≤ 1
Salford Systems ©2014 Introduction to Modern Regression
11
Lambda-Centric Approach
• A number of regularized regression algorithms were
developed to find the value of the coefficients for
the given lambda
o Ridge regression: a(ridge) = (X’X +l I)-1X’y
o LARS algorithm and its modification to obtain LASSO regression
o Elastic Net algorithm to generate a family of solutions “between” LASSO
and RIDGE regressions that closely approximate the power family
Pβ(a) = ∑ {(β - 1)a2j/2 + (2 - β)|aj|},
1≤β≤2
Ridge
β = 2.0
Lasso
β = 1.0
• All of these approaches are lambda-centric! Pick
any combination of beta and lambda (usually on a
user-defined grid) and then solve for the
coefficients
Salford Systems ©2014 Introduction to Modern Regression
12
Path-Centric Approach
• Instead of focusing on the lambda, observe that
solutions traverse a path in the parameter space
• The terminating points of the path are always
known:
o OLS solution corresponding to λ = 0
o Zero-coefficient solution corresponding to λ = ∞
• Start at either of the terminating points and work out
an update algorithm to find the next point along
the path
o Starting at the OLS end requires finding the OLS solution first, which may
be time consuming
o Starting at the zero end is more convenient and allows early path
termination upon reaching excessive compute burden or acceptable
model performance
• Every point on the path maps (implicitly) to a
monotone sequence of lambdas
Salford Systems ©2014 Introduction to Modern Regression
13
Regularized Regression – Practical Algorithm
Introducing New Variable
Next
Model
Current
Model
•
•
Next
Model
Current
Model
X1
0.0
X1
0.0
X1
0.0
X1
0.0
X2
0.0
X2
0.0
X2
0.0
X2
0.0
X3
0.2
X3
0.2
X3
0.2
X3
0.3
X4
0.0
X4
0.1
X4
0.0
X4
0.1
X5
0.4
X5
0.4
X5
0.4
X5
0.4
X6
0.5
X6
0.5
X6
0.5
X6
0.5
X7
0.0
X7
0.0
X7
0.0
X7
0.0
X8
0.0
X8
0.0
X8
0.0
X8
0.0
Start with the zero-coefficient solution
Series of iterations
o
•
Updating Existing Model
Update one of the coefficients by a small amount
• If the selected coefficient was zero, a new variable effectively enters into the model
• If the selected coefficient was not zero, the model is simply updated
The end result is a collection of linear models which can be
visualized as a path in the parameter space
Salford Systems ©2014 Introduction to Modern Regression
14
Path Building Process
Zero
Coefficient
Model
λ=∞
Sequence of
2-variable
models
Sequence of
1-variable
models
A Variable
is Added
A Variable
is Added
Sequence of
3-variable
models
A Variable
is Added
Final
OLS
Solution
…
λ=0
Variable Selection Strategy
• Elasticity Parameter – controls the variable selection
strategy along the path (using the LEARN sample
only), it can be between 0 and 2, inclusive
o Elasticity = 2 – fast approximation of Ridge Regression, introduces
variables as quickly as possible and then jointly varies the magnitude of
coefficients – lowest degree of compression
o Elasticity = 1 – fast approximation of Lasso Regression, introduces
variables sparingly letting the current active variables develop their
coefficients – good degree of compression versus accuracy
o Elasticity = 0 – fast approximation of Stepwise Regression, introduces new
variables only after the current active variables were fully developed –
excellent degree of compression but may loose accuracy
Salford Systems ©2014 Introduction to Modern Regression
15
Points Versus Steps
Path 1: Steps
Path 2: Steps
Zero
Solution
Path 3: Steps
Points
Path 1
Path 2
Path 3
OLS
Solution
Point Selection Strategy
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
• Each path will have different number of steps
• To facilitate model comparison among different
paths, the Point Selection Strategy extracts a fixed
collection of models into the points grid
o This eliminates some of the original irregularity among individual paths and
facilitates model extraction and comparison
Salford Systems ©2014 Introduction to Modern Regression
16
OLS versus GPS
OLS Regression
A Sequence of Linear Models
1-variable model
2-variable model
3-variable model
…
X1, X2 , X3, X4, X5, X6,…
X1, X2 , X3, X4, X5, X6,…
Learn Sample
GPS Regression
•
•
•
Test Sample
Large Collection of Linear Models (Paths)
1-variable models, varying coefficients
2-variable models, varying coefficients
3-variable models, varying coefficients
…
GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast
Sparse Regression and Classification)
Dramatically expands the pool of potential linear models by including different sets
of variables in addition to varying the magnitude of coefficients
The optimal model of any desirable size can then be selected based on its
performance on the TEST sample
Salford Systems ©2014 Introduction to Modern Regression
17
Boston Housing Data Set
• Concerns the housing values in Boston area
• Harrison, D. and D. Rubinfeld. Hedonic Prices and the
Demand For Clean Air. Journal of Environmental
Economics and Management, v5, 81-102 , 1978
• Combined information from 10 separate governmental
and educational sources to produce this data set
• 506 census tracts in City of Boston for the year 1970
o
o
o
o
o
o
o
o
o
o
o
o
o
Goal:
MV
CRIM
NOX
AGE
DIS
RM
LSTAT
RAD
CHAS
INDUS
TAX
PT
study relationship between quality of life variables and property values
median value of owner-occupied homes in tract ($1,000’s)
per capita crime rates
concentration of nitric oxides (pp 10 million)
percent built before 1940
weighted distance to centers of employment
average number of rooms per house
% lower status of the population
accessibility to radial highways
borders Charles River (0/1)
percent non-retail business
property tax rate per $10,000
pupil teacher ratio
Salford Systems ©2014 Introduction to Modern Regression
18
OLS on Boston Data
3-variable
Solution
-0.597
+5.247
-0.858
Salford Systems ©2014 Introduction to Modern Regression
• 414 records in the
learn sample
• 92 records in the
test sample
• Good agreement
o LEARN MSE = 27.455
o TEST MSE = 26.147
19
Paths Produced by GPS
• Example of 21 paths with different variable selection
strategies
Salford Systems ©2014 Introduction to Modern Regression
20
Path Points on Boston Data
Path Development
Point 30
Point 100
Point 150
Point 190
• Each path uses a different variable selection
strategy and separate coefficient updates
Salford Systems ©2014 Introduction to Modern Regression
21
GPS on Boston Data
3-variable
Solution
OLS
26.147
+5.247
-0.858
• 414 records in the learn
sample
• 92 records in the test sample
• 15% performance
improvement on the test
sample
o TEST MSE = 22.669
(OLS MSE was 26.147)
-0.597
Salford Systems ©2014 Introduction to Modern Regression
22
Key Problems with GPS
• GPS model is still a linear regression!
• Response surface is still a global hyper-plane
• Incapable of discovering local structure in the data
Look for
• Develop non-linear algorithms that build response
surface locally based on the data itself
o By trying all possible data cuts as local boundaries
o By fitting first-order adaptive splines locally
Salford Systems ©2014 Introduction to Modern Regression
23
Motivation for MARS
Multivariate Adaptive Regression Splines
• Developed by Jerome H. Friedman in late 1980s
• After his work on CART (for classification)
• Adapting many CART ideas for regression
o Automatic variable selection
o Automatic missing value handling
o Allow for nonlinearity
o Allow for interactions
o Leverage the power of regression where linearity
can be exploited
Salford Systems ©2014 Introduction to Modern Regression
24
From Linear to Non-linear
60
60
50
40
30
20
10
0
-10
50
Knots
MV
MV
40
Localize
30
20
10
0
0
10
20
LSTAT
30
40
0
10
20
LSTAT
30
40
• Classical regression and regularized regression
builds globally linear models
• Further accuracy can be achieved by building
locally linear models nicely connected to each
other at the boundary points called knots
Salford Systems ©2014 Introduction to Modern Regression
25
Key Concept for Spline is the “knot”
• Knot marks end of one region of data and beginning of
another
• Knot is where behavior of function changes
• In a classical spline knot positions are predetermined
and are often evenly spaced
• In MARS, knots are determined by search procedure
• Only as many knots as needed end up in the MARS
model
• If a straight line is adequate fit there will be no interior
knots
o in MARS there is always at least one knot
o Could correspond to smallest observed value of the predictor
Salford Systems ©2014 Introduction to Modern Regression
26
Placement of Knots
• With only one predictor and one knot to select,
placement is straightforward:
o test every possible knot location
o choose model with best fit (smallest SSE)
o perhaps constrain by requiring a minimum amount of
data in each interval
• Prevents interior knot being placed too close to a
boundary
• For computational efficiency, knots are always
placed exactly at observed predictor values
o Can cause rare modeling artifacts that sometimes
appear due to discrete nature of data
Salford Systems ©2014 Introduction to Modern Regression
27
Finding Knots Automatically
True Knots
Knot 1
Knot 2
Knot 3
80
Y
60
40
20
0
0
30
0
30
X
60
90
60
90
80
Y
60
40
20
0
X
Knot 4
Knot 5
Knot 6
• Stage-wise knot placement process on a flat-top function
Salford Systems ©2014 Introduction to Modern Regression
28
Basis Functions
• Example for knot selection works very well to illustrate
splines in one dimension
• Thinking in terms of knot locations is unwieldy for working
with a large number of variables simultaneously
o need a concise notation and programming expressions that
are easy to manipulate
o Not clear how to construct or represent interactions using knot
locations
• Basis Functions (BF) provide analytical machinery to
express the knot placement strategy
• MARS creates sets of basis functions to decompose the
information in each variable individually
Salford Systems ©2014 Introduction to Modern Regression
29
The Hockey Stick Basis Function
• Hockey Stick basis function is core MARS
building
o can be applied to a single variable multiple times
• Hockey stick function:
o Direct: max (0, X -c)
o Mirror:
max (0, c - X)
o Maps variable X to new variable X*
o X* =0 for all values of X up to some threshold value c
o X* =X – c for all values of X greater than c
• the amount by which X exceeds threshold c
Salford Systems ©2014 Introduction to Modern Regression
30
Basis Function Example
• X ranges from 0 to 100
• 8 basis functions displayed (c=10,20,30,…,80)
BF10
80
BF20
60
BF30
40
BF40
Value
100
BF50
20
BF60
0
0
10 20 30 40 50 60 70 80 90 100 110
X
Salford Systems ©2014 Introduction to Modern Regression
BF70
BF80
31
Basis Functions: Separate Displays
o
o
o
o
o
Each function is graphed with same dimensions
BF10 is offset from original value by 10
BF80 is zero for most of its range
Basis functions can be constructed for any value of c
MARS considers constructing one for EVERY actual data value
Salford Systems ©2014 Introduction to Modern Regression
32
Tabular Display of Basis Functions
• Each new BF results in a different number of zeroes in
transformed variable
• The resulting collection is resistant to multicollinearity
issues
• Three basis functions with knots at 25, 55, and 65:
Salford Systems ©2014 Introduction to Modern Regression
33
Spline with 1 Basis Function
MV = 27.395 - 0.659*(INDUS -4)+
Slope = 0
Knot = 4
Slope = -0.659
40
MV
30
20
10
0
0
5
10
15
INDUS
Salford Systems ©2014 Introduction to Modern Regression
20
25
30
34
Spline with 2 Basis Functions
MV = 30.290 - 2.439*(INDUS - 4)+ + 2.215*(INDUS-8) +
Slope = 0
Slope = -2.439
Slope = -2.439+2.215=-0.224
40
MV
30
20
10
Knot = 4
Knot = 8
0
0
10
20
30
INDUS
Salford Systems ©2014 Introduction to Modern Regression
35
Adding Mirror Image BF
•
•
A standard basis function (X - knot)+ does not provide for a non-zero slope
for values below the knot
To handle this MARS uses a “mirror image” basis function, which looks at
the interval of a variable X which lies below the threshold c
MV= 29.433 + 0.925*(4 - INDUS)+ -2.180*(INDUS-4)+ +1.939*(INDUS-8)+
Slope = -0.925
Slope = -2.180
Slope = -2.180+1.939=-0.241
40
MV
30
20
10
Knot = 4
Knot = 8
0
0
5
10
15
INDUS
Salford Systems ©2014 Introduction to Modern Regression
20
25
30
36
MARS Algorithm
• Stands for Multivariate Adaptive Regression Splines
• Forward stage:
o Add pairs of BFs (direct and mirror pair of basis functions represents a single
knot) in a step-wise regression manner
o The process stops once a user specified upper limit is reached
o Possible linear dependency is handled automatically by discarding
redundant BFs
• Backward stage:
o Remove BFs one at a time in a step-wise regression manner
o This creates a sequence of candidate models of varying complexity
• Selection stage:
o Select optimal model based on the TEST performance (modern approach)
o Select optimal model based on GCV criterion (legacy approach)
Salford Systems ©2014 Introduction to Modern Regression
37
MARS on Boston Data
9-BF (7-variable)
Solution
• 414 records in the learn
sample
• 92 records in the test
sample
• 40% performance
improvement on the
test sample
o TEST MSE = 15.749
(OLS was 26.147)
Salford Systems ©2014 Introduction to Modern Regression
38
Non-linear Response Surface
• MARS automatically determined transition points
between various local regions
• This model provides major insights into the nature of the
relationship
Salford Systems ©2014 Introduction to Modern Regression
39
200 Replications
Regression
GPS
MARS
Salford Systems ©2014 Introduction to Modern Regression
• All of the above
models were
repeated on 200
randomly selected
20% test partitions
• GPS shows marginal
performance
improvement but
much smaller model
• MARS shows
dramatic
performance
improvement
40
Regression Trees
• Regression trees result to piece-wise constant
models (multi-dimensional staircase) on an
orthogonal partition of the data space
o Thus usually not the best possible performer in terms of conventional
regression loss functions
• Only a very limited number of controls is available
to influence the modeling process
o Priors and costs are no longer applicable
o There are two splitting rules: LS and LAD
• Very powerful in capturing high-order interactions
but somewhat weak in explaining simple main
effects
Salford Systems ©2014 Introduction to Modern Regression
41
Split Improvement
•
•
•
•
Parent node
Left child
Right child
Improvement
• One can show
𝑆𝑆𝐸𝑃 = 𝑖∈𝑃 𝑦𝑖 − 𝑦𝑃 2
𝑆𝑆𝐸𝐿 = 𝑖∈𝐿 𝑦𝑖 − 𝑦𝐿 2
𝑆𝑆𝐸𝑅 = 𝑖∈𝑅 𝑦𝑖 − 𝑦𝑅 2
∆ = 𝑆𝑆𝐸𝑃 − 𝑆𝑆𝐸𝐿 − 𝑆𝑆𝐸𝑅
∆=
𝑁𝑃
𝑁𝐿 𝑁𝑅
𝑦𝐿 − 𝑦𝑅
2
• Find the split with the largest improvement by conducting
exhaustive search over all splits in the parent node
Salford Systems ©2014 Introduction to Modern Regression
42
Splitting the Root Node
•
Improvement is defined in terms of the greatest reduction in the sum of
squared errors when a single constant prediction is replaced by two separate
constants on each side
Salford Systems ©2014 Introduction to Modern Regression
43
Regression Tree Model
• All cases in the given node are assigned the same
predicted response – the node average of the original
target
• Nodes are color-coded according to the predicted
response
• We have a convenient segmentation of the population
according to the average response levels
Salford Systems ©2014 Introduction to Modern Regression
44
The Best and the Worst Segments
Salford Systems ©2014 Introduction to Modern Regression
45
Stochastic Gradient Boosting
• New approach to machine learning /function
approximation developed by Jerome H. Friedman at
Stanford University
o Co-author of CART® with Breiman, Olshen and Stone
o Author of MARS®, PRIM, Projection Pursuit
• Also known as Treenet®
• Good for classification and regression problems
• Built on small trees and thus
o
o
o
o
Fast and efficient
Data driven
Immune to outliers
Invariant to monotone transformations of variables
• Resistant to over training – generalizes very well
• Can be remarkably accurate with little effort
• BUT resulting model may be very complex
Salford Systems ©2014 Introduction to Modern Regression
46
The Algorithm
• Begin with a very small tree as initial model
o Could be as small as ONE split generating 2 terminal nodes
o Typical model will have 3-5 splits in a tree, generating 4-6 terminal nodes
o Model is intentionally “weak”
• Compute “residuals” (prediction errors) for this simple
model for every record in data
• Grow a second small tree to predict the residuals from the
first tree
• Compute residuals from this new 2-tree model and grow
a 3rd tree to predict revised residuals
• Repeat this process to grow a sequence of trees
Tree 1
Tree 2
+
Tree 3
+
Salford Systems ©2014 Introduction to Modern Regression
More trees
+ …
47
Illustration: Saddle Function
• 500 {X1,X2} points randomly drawn from a [-3,+3] box to
produce the XOR response surface Y = X1 * X2
• Will use 3-node trees to show the evolution of Treenet
response surface
1 Tree
2 Trees
3 Trees
4 Trees
10 Trees
20 Trees
30 Trees
40 Trees
100 Trees
195 Trees
Salford Systems ©2014 Introduction to Modern Regression
48
Notes on Treenet Solution
• The solution evolves slowly and
usually includes hundreds or even
thousands of small trees
• The process is myopic – only the next
best tree given the current set of
conditions is added
• There is a high degree of similarity
and overlap among the resulting
trees
• Very large tree sequences make the
model scoring time and resource
intensive
• Thus, ever present need to simplify
(reduce) model complexity
Salford Systems ©2014 Introduction to Modern Regression
49
A Tree is a Variable Transformation
Node 1
Avg = -0.105
N = 506
X1 <= 1.47
Node 2
Avg = -0.068
N = 476
X2 <= -1.83
X2 > -1.83
Terminal
Node 1
Avg = -0.250
N = 15
Terminal
Node 2
Avg = -0.062
N = 461
X1 > 1.47
Terminal
Node 3
Avg = -0.695
N = 30
TREE_1 = F(X1,X2)
• Any tree in a Treenet model
can be represented by a
derived continuous variable
as a function of inputs
Salford Systems ©2014 Introduction to Modern Regression
50
ISLE Compression of Treenet
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.3
0.0
0.4
0.0
0.0
1.8
1.2
0.3
0.0
0.0
0.0
0.1
0.0
TN Model
ISLE Model
• The original Treenet model combines all trees with equal
coefficients
• ISLE accomplishes model compression by removing
redundant trees and changing the relative contribution
of the remaining trees by adjusting the coefficients
• Regularized Regression methodology provides the
required machinery to accomplish this task!
Salford Systems ©2014 Introduction to Modern Regression
51
A Node is a Variable Transformation
Node 1
Avg = -0.105
N = 506
X1 <= 1.47
Node 2
Avg = -0.068
N = 476
X2 <= -1.83
X2 > -1.83
Terminal
Node 1
Avg = -0.250
N = 15
Terminal
Node 2
Avg = -0.062
N = 461
X1 > 1.47
Terminal
Node 3
Avg = -0.695
N = 30
NODE_X = F(X1,X2)
• Any node in a Treenet model
can be represented by a
derived dummy variable as a
function of inputs
Salford Systems ©2014 Introduction to Modern Regression
52
RuleLearner Compression
GPS Regression
TN Model
T1_N1 + T1_T1 + T1_T2 + T1_T3 + T2_N1 + T2_T1 + T2_T2 + T2_T3 + T3_N1 + T3_T1 + …
Coefficient 1
Rule-set 1
Coefficient 2
Rule-set 2
Coefficient 3
Rule-set 3
Coefficient 4
Rule-set 4
• Create an exhaustive set of dummy variables for every
node (internal and terminal) and every tree in a TN
model
• Run GPS regularized regression to extract an informative
subset of node dummies along with the modeling
coefficients
• Thus, a model compression can be achieved by
eliminating redundant nodes
• Each selected node dummy represents a specific ruleset which can be interpreted directly for further insights
Salford Systems ©2014 Introduction to Modern Regression
53
Salford Predictive Modeler SPM
• Download a current version from our website
http://www.salford-systems.com
• Version will run without a license key for 10-days
• Request a license key from
unlock@salford-systems.com
• Request configuration to meet your needs
o Data handling capacity
o Data mining engines made available
Salford Systems ©2014 Introduction to Modern Regression
54
Download