A. Review - Pegasus @ UCF

advertisement
3.5 Neural Networks for Predictive Data Mining
A. Review
Data mining is concerned with extracting knowledge from large (mainly
corporate) database that were originally developed from operational
purposes.
Application of Predictive Data Mining:
 Database Marketing
 Response modeling
 Attrition prediction
 Credit Scoring
 Fraud Detection
Types of Predictive Data Mining:
 Regression Models- linear and generalized linear model.




Easy to train
Interpretable
Inflexible
Low power in prediction
 Decision Tree Model




Relatively easy to train
Easy in interpretation
Flexible
Moderate power in prediction
1
 Artificial Neural Networks




Flexible
High power in prediction
Troublesome in training
Incomprehensibility
Some Remarks about ANN:
 Considerable computational effort is often required to optimize the
large number of parameters in a typical neural network model. This is
partially self-inflicted. Data analysts often use inefficient optimization
methods such as back-propagation. Even with an efficient algorithm,
local minima are troublesome.
 There is no simple method for selecting the best network from the
enormous number of possible architectures. Input variable selection
increases the combinatorial complexity of the problem. Overfitting is
always a serious concern with flexible models.
 Multilayer perceptrons are black boxes with respect to interpretation.
( y, x ) 
 yˆ
In some pattern recognition application, such as hand writing
recognition, pure prediction is the goal; understanding how the inputs
affect the prediction is immaterial. In many scientific applications, the
opposite is true. Understanding is the goal, and predictive power is a
consideration only to the extent that it validates the interpretive power
of the model. This is the domain for formal statistical inference such
as hypothesis testing and confidence intervals.
Domains such as database marketing often have both goals. Scoring
new cases is the ultimate purpose of predictive modeling. However,
some understanding, even informal, of the factors affecting the
prediction can be helpful in determining how to market to segments of
likely responders. Understanding the effects of the inputs can be
useful for decisions about costly data acquisitions. In credit scoring,
2
the opaqueness of the model can have legal ramifications. The US
Equal Credit Opportunity Act requires that creditors provide a
statement of specific reasons why an adverse action was taken.
 Neural networks have an allure unmatched by other methods with
similar capabilities. Neural networks are often characterized as
capable of learning complex patterns by thinking about past
experiences. This is falsely contrasted with more difficult statistical
methods that require manual tuning and adherence to restrictive
assumptions. The chief benefit of neural networks for predictive
modeling is their flexibility in representing nonlinear phenomenon
(universal approximation). However, their ascribed similarities to
organic neural networks have lead to unhealthy anthropomorphism
and with it, unrealistic expectations.
 Most of the effort in predictive modeling is in the data preparation
steps: creating and eliminating input variables, imputing missing
values, collapsing and dummy-coding nominal inputs, and nullifying
the effect of outliers. Even more effort is spent on determining and
acquiring the data that is relevant to the particular business problem.
Neural network provide no particular relief with regard to these effort.
Keep in Mind!!!
While performance improvements are possible with sophisticated
modeling methods, such as neural networks, the greatest benefit is not
from a better modeling technique, but from better data, for example,
discovering some new input that is strongly associated with the target is
mostly beneficial to prediction.
3
B. Generalized Additive Neural Networks (GANN)
Question: Can we have a class of models that is between generalized
regression model and neural network model in terms of efficiency and
training complexity?
Solution: Generalized additive models. (GAM)
They are a compromise between inflexible, but docile, linear models and
flexible troublesome, universal approximators.
GAMs
g01  E( y)   w0  f1 ( x1 )  f2 ( x2 ) 
 f k ( xk )
In a GAM, the link-transformed expected target is expressed as the sum
of individual unspecified univariate function.
GAMs are usually presented as extensions of linear models, but they can
also be portrayed as constrained forms of multivariate function estimators
such as projection pursuit regression and neural networks.
References:
 Hastie, T.J. and Tibshirani, R. J. (1990) Generalized Additive Models,
New York: Chapman and Hall.
 Friedman, J.H. and Stuetzle, D.L. (1981) “Projection Pursuit
Regression.” Journal of the American Statistical Association, 76, 817823.
 Sarle, W.S. (1994) “Neural Networks and Statistical Models,”
Proceedings of the Nineteenth Annual SAS Users Group International
Conference.
4
Generalized Additive Neural Network
H11  tanh  w011  w111x1 
w111
w11
w10
x1
w20
y
w21
x2
w121
H 21  tanh  w021  w121 x2 
g01  E( y)   w0   w11H11  w10 x1    w21H 21  w20 x2 
 The basic architecture for a generalized additive neural network
(GANN) has a separate set of layers for each input variable. The
hidden layers do not have connections from more than one input
variable.
 The combined output from the layers associated with each input (a
mini neural network) form the individual univariate functions.
 The type of architecture could vary across inputs.
 Including a skip layer for each input makes the generalized linear
model (GAM) a special case of GANN.
5
Consolidation Layer:
H11  tanh  w011  w111x1 
w111
w11
1
f1
x1
w10
f2
w21
w20
x2
w121
1 y
H 21  tanh  w021  w121 x2 
g 01  E ( y )   w0   w11 H11  w10 x1    w21H 21  w20 x2 

f1

f2
 A useful modification for PROC NEURAL is the addition of a
linear layer to consolidate the output from the hidden layer for
each input.
 This layer does not transform its input (identity activation
function).
 If the output connections are set to zero, then the output from the
consolidation layer are the individual univariate function.
 PROC NEURAL automatically creates a variable in the output
data set for each unit of the consolidation layer. This facilitates
plotting the individual functions.
6
GANN Estimation
Stepwise Construction:
1. Fit a generalized linear model to give initial estimates of the skip
layer and the output bias.
g01  E( y)   w0  w10 x1  w20 x2
2. Construct a GANN with one neuron and a skip layer for each input.
This gives four parameters for each input variable. Binary inputs
(dummy variables) only need a direct connection (1df)
3. Initialize the remaining three parameters in each hidden layer as
small random numbers.
g 01  E ( y )   w0  f1 ( x1 )  f 2 ( x2 )
f1 ( x1 )  w11 H11  w10 x1 , H11  tanh  w011  w111 x1 
f 2 ( x2 )  w21 H 22  w20 x2 , H 22  tanh  w021  w121 x2 
7
w111 , w011 , and w11 and w121, w121, and w21 need to be initialized .
w10 and w20 can be chosen by the estimates from Step 1.
4. Train the full GANN (4k+1 df) model.
5. Examine each of the fitted univariate functions, fˆi ( xi ) , overlaid on
their partial residuals.
Partial Residuals (Identity Link):


PRij  yi   wˆ 0   fˆm ( xim )   yi  yˆi  fˆj ( x j )
m j


Partial residuals are calculated as the difference between the target
and part of additive model that does not include the input under
investigation.
In contrast to simple plots of the target versus an input, partial
residual plots show the effects of the input variables, adjusted for the
effects of the other inputs and thus are a more effective visual
diagnostic.
Upshot:
 If no trend presents, probably the corresponding input variable
should be eliminated
 If there is linear trend, probably only the skip layer should be
included.
 If there a curvature presented, the corresponding architecture
should be included or even add new neurons.
6. Prune the hidden layers with apparently linear effects and add
neurons to hidden layers where the nonlinear trend appears to be
underfitted.
8
Notes:
 The random initialization of the hidden layer is designed to give
well-placed and stable starting values. However, local minima can
still occur. Inferior local minima can usually be seen in the partial
residual plots, in which case a new random initialization should be
tried.
 The growing and pruning process starts with a single neuron plus a
skip layer instead of the linear model (the linear fit is only for
initialization). The partial residuals based on a general additive fit
are more reliable than those based on a linear fit. Moreover,
starting with a 4 degree of freedom smoother is a common practice
with GAM estimation.
 Model selection based on partial residual plots is subjective.
Formal measures of fit can be incorporated into the process
9
Example 3.16 (GANN: Interval Target) In this example, we retrain the
data set HOUSING along the procedure of generalized additive neural
network model.
First, a generalized linear model is fitted to provide initial estimates for
the GANN. A generalized linear model could be fitted easily by using the
GLIM option on the ARCH statement. However, to initialize the GANN
a more complicated program is needed. The linear model needs to have
the same structure as the GANN so that the parameter estimates have the
same names. Furthermore, the inputs are not standardized in a model
using the GLIM option. By setting up a parallel structure for the linear
model, the GANN model can automatically use the final estimates for
starting values.
The GANN structure has a hidden and skip layer for each input. It also
has a linear consolidation layer between the hidden layer and the output.
The connections between the consolidation layer and the output layer are
set to one and frozen.
The PROC NEURAL syntax for GANNs is long because separate hidden
layers are built for each input variable. In the HOUSING data, there are
13 inputs, so there are 13 INPUT statements, 13 HIDDEN statements for
the hidden layer, 13 HIDDEN statements for the linear consolidation
layer, and the CONNECT statements to connect all the layers.
Step 1. Covert the data set to data mining data set
proc dmdb batch data=neuralnt.housing dmdbcat=cbh
out=dbh;
var crim zn indus chas nox rm age dis rad tax
ptratio b lstat medv;
run;
Step 2. Linear initialization model
proc neural data=dbh
input crim
input zn
input indus
input chas
input nox
input rm
dmdbcat=cbh
/ level=int
/ level=int
/ level=int
/ level=int
/ level=int
/ level=int
graph;
id=x1;
id=x2;
id=x3;
id=x4;
id=x5;
id=x6;
10
input age
input dis
input rad
input tax
input ptratio
input b
input lstat
target medv
/
/
/
/
/
/
/
/
level=int id=x7;
level=int id=x8;
level=int id=x9;
level=int id=x10;
level=int id=x11;
level=int id=x12;
level=int id=x13;
level=int id=out act=exp
error=poisson;
hidden 1 / id=f1 act=identity comb=lin nobias;
/* Consolidation layer */
connect f1 out;
connect x1 f1;
/* Skip layer */
hidden 1 / id=f2
connect f2 out;
connect x2 f2;
act=identity comb=lin nobias;
hidden 1 / id=f3
connect f3 out;
connect x3 f3;
act=identity comb=lin nobias;
hidden 1 / id=f4
connect f4 out;
connect x4 f4;
act=identity comb=lin nobias;
hidden 1 / id=f5
connect f5 out;
connect x5 f5;
act=identity comb=lin nobias;
hidden 1 / id=f6
connect f6 out;
connect x6 f6;
act=identity comb=lin nobias;
hidden 1 / id=f7
connect f7 out;
connect x7 f7;
act=identity comb=lin nobias;
hidden 1 / id=f8
connect f8 out;
connect x8 f8;
act=identity comb=lin nobias;
hidden 1 / id=f9
connect f9 out;
connect x9 f9;
act=identity comb=lin nobias;
11
hidden 1 / id=f10 act=identity comb=lin nobias;
connect f10 out;
connect x10 f10;
hidden 1 / id=f11 act=identity comb=lin nobias;
connect f11 out;
connect x11 f11;
hidden 1 / id=f12 act=identity comb=lin nobias;
connect f12 out;
connect x12 f12;
hidden 1 / id=f13 act=identity comb=lin nobias;
connect f13 out;
connect x13 f13;
initial;
freeze f1->out f2->out f3->out f4->out
f5->out f6->out f7->out f8->out
f9->out f10->out f11->out f12->out f13
->out / value=1;
train outfit=of outest=start;
run;


The input units are given the Ids X1 through X13.
The TARGET statement uses the log link and Poisson error
function.
 Each of the 13 inputs have a block of three statements: a HIDDEN
statement to set up the consolidation layer, a CONNECT statement
to connect the consolidation layer to the output, and a CONNECT
statement to connect the input to the consolidation layer (the skip
layer).
 The units in the consolidation layer are given the Ids F1 through
F13. It uses the identity activation function, and a linear
combination function with no bias terms. It would be redundant to
include 13 output biases, one for each input.
 The output layer uses a linear combination function by default.
These weights are now redundant, so they are set equal to one and
prevented from changing using the FREEZE statement. The
INITIAL statement is needed to activate the FREEZE statement.
The only free parameter is the single input bias.
12

The OUTEST=option on the TRAIN statement creates the
START data set containing the final estimates. They are used to
initialize the GANN.
proc sql;
select _err_,
_dfm_-13 as p,
_err_+(calculated p)*log(_dft_) as
sbc
from of
where _name_='OVERALL';
quit;
Various fit statistics are computed for the OUTFIT data set, including
the deviance and SBC. SBC is automatically calculated and given the
variable _SBC_. However, it is computed incorrectly for this type of
network. This network has 27 parameters, 13 of which are frozen. The
frozen parameters are counted in the SBC penalty pln(n).
Consequently, the SBC statistic needs to be corrected by subtracting
the number of inputs from the model degrees of freedom. The variable
_DFM_ is the number of parameters in the model (counting the
frozen parameters). The variable _ERR_ is the deviance. The variable
_DFT_ is the sample size.
The SAS System
15:33 Thursday, May 31, 2001
5
Train:
Error
Function.
p
sbc
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
359.2707
14 446.4423
In this example, SBC is used along with the partial residual plots for
model selection.
The first GANN is fit with one hidden unit for each input and a skip
layer (4 df)
proc neural data=dbh dmdbcat=cbh ranscale=.1 graph;
13
input crim
input zn
input indus
input chas
input nox
input rm
input age
input dis
input rad
input tax
input ptratio
input b
input lstat
target medv
/ level=int id=x1;
/ level=int id=x2;
/ level=int id=x3;
/ level=int id=x4;
/ level=int id=x5;
/ level=int id=x6;
/ level=int id=x7;
/ level=int id=x8;
/ level=int id=x9;
/ level=int id=x10;
/ level=int id=x11;
/ level=int id=x12;
/ level=int id=x13;
/ level=int id=out act=exp
error=poisson;
hidden 1 / id=f1 act=identity comb=lin nobias;
/* Consolidation layer */
connect f1 out;
connect x1 f1;
/* Skip layer */
hidden 1 / id=h1;
/* Hidden layer */
connect x1 h1;
connect h1 f1;
hidden 1 / id=f2 act=identity comb=lin nobias;
connect f2 out;
connect x2 f2;
hidden 1 / id=h2;
connect x2 h2;
connect h2 f2;
hidden 1 / id=f3 act=identity comb=lin nobias;
connect f3 out;
connect x3 f3;
hidden 1 / id=h3;
connect x3 h3;
connect h3 f3;
hidden 1 / id=f4
connect f4 out;
connect x4 f4;
act=identity comb=lin nobias;
hidden 1 / id=f5
connect f5 out;
connect x5 f5;
act=identity comb=lin nobias;
14
hidden 1 / id=h5;
connect x5 h5;
connect h5 f5;
hidden 1 / id=f6 act=identity comb=lin nobias;
connect f6 out;
connect x6 f6;
hidden 1 / id=h6;
connect x6 h6;
connect h6 f6;
hidden 1 / id=f7 act=identity comb=lin nobias;
connect f7 out;
connect x7 f7;
hidden 1 / id=h7;
connect x7 h7;
connect h7 f7;
hidden 1 / id=f8 act=identity comb=lin nobias;
connect f8 out;
connect x8 f8;
hidden 1 / id=h8;
connect x8 h8;
connect h8 f8;
hidden 1 / id=f9 act=identity comb=lin nobias;
connect f9 out;
connect x9 f9;
hidden 1 / id=h9;
connect x9 h9;
connect h9 f9;
hidden 1 / id=f10 act=identity comb=lin nobias;
connect f10 out;
connect x10 f10;
hidden 1 / id=h10;
connect x10 h10;
connect h10 f10;
hidden 1 / id=f11 act=identity comb=lin nobias;
connect f11 out;
connect x11 f11;
hidden 1 / id=h11;
connect x11 h11;
connect h11 f11;
hidden 1 / id=f12 act=identity comb=lin nobias;
15
connect f12 out;
connect x12 f12;
hidden 1 / id=h12;
connect x12 h12;
connect h12 f12;
hidden 1 / id=f13 act=identity comb=lin nobias;
connect f13 out;
connect x13 f13;
hidden 1 / id=h13;
connect x13 h13;
connect h13 f13;
initial inest=start;
freeze f1->out f2->out f3->out f4->out
f5->out f6->out f7->out f8->out
f9->out f10->out f11->out f12->out f13>out;
train outfit=of maxiter=200;
score data=neuralnt.housing nodmdb out=parres;
run;
proc sql;
select _err_,
_dfm_-13 as p,
_err_+(calculated p)*log(_dft_) as
sbc
from of
where _name_='OVERALL';
quit;
 Each input has a block of six statements: the three for the linear
model and three new statements for a one-neuron hidden layer.
The hidden layer for CHAS was not included because it is a
binary variable.
 The INITIAL statement brings in the final estimates from the
generalized linear model as starting values. Any parameters not
included in the INEST data set, by default, to random values
drawn from a normal distribution with mean zero.
 The standard deviation is determined by the value in the
RANSCALE option on the PROC NEURAL statement. A
16
standard deviation of 0.1 is used to encourage the training
algorithm to move slowly away from the linear model.
 The FREEZE statement keeps the output connections equal to
1.
 A SCORE statement is added to create data for the partial
residual plots
 The maximum iteration is increased using the
MAXITER=option, since the Levenbeg-Marquardt method
needs more than 100 iterations to converge.
Train:
Error
Function.
p
sbc
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
224.9645
50 536.2913
The 36 new parameters caused the deviance to decrease from 359 to
225. However, this 50 parameters model increased the SBC from 446
to 536.
The SCORE data set automatically contains variables representing the
units in the consolidation layer. They are named F11-F131. The name
of the variable corresponds to the ID option with unit number are as a
suffix. The predicted values are automatically named P_MEDV.
The partial residuals can be calculated by adding each univariate
function to the residuals. In this case, the partial residuals are
calculated on the log scale because the log link function was used.
data parres;
set parres;
lres=log(medv)-log(p_medv);
parres1 =lres+f11;
parres2 =lres+f21;
parres3 =lres+f31;
parres4 =lres+f41;
parres5 =lres+f51;
parres6 =lres+f61;
parres7 =lres+f71;
17
parres8 =lres+f81;
parres9 =lres+f91;
parres10=lres+f101;
parres11=lres+f111;
parres12=lres+f121;
parres13=lres+f131;
run;
goptions reset=all;
proc gplot data=parres;
symbol1 c=b v=circle i=none;
symbol2 c=bl v=none i=splines;
plot parres1 *crim
=1 f11 *crim
=2
frame;
plot parres2 *zn
=1 f21 *zn
=2
frame;
plot parres3 *indus =1 f31 *indus =2
frame;
plot parres4 *chas
=1 f41 *chas
=2
frame;
plot parres5 *nox
=1 f51 *nox
=2
frame;
plot parres6 *rm
=1 f61 *rm
=2
frame;
plot parres7 *age
=1 f71 *age
=2
frame;
plot parres8 *dis
=1 f81 *dis
=2
frame;
plot parres9 *rad
=1 f91 *rad
=2
frame;
plot parres10*tax
=1 f101*tax
=2
frame;
plot parres11*ptratio=1 f111*ptratio=2
frame;
plot parres12*b
=1 f121*b
=2
frame;
plot parres13*lstat =1 f131*lstat =2
frame;
run;
quit;
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
/ overlay
The I=SPLINES option on the symbol connects the function with an
interpolating spline. Using I=JOIN would require the data to be sorted
by each input variable for each plot.
18
19
20
21
22
Comments:
By looking at the residual plots and increasing SBC values, some
modifications on the architecture should be made.
 5 inputs: X 2 , X 3 , X 4 , X 7 , and X12 were eliminated since there is no
trend shown in the partial residual plots.
 4 of the remaining 8 inputs: X 9 , X10 , X11 , and X13 were pruned back
to a linear fit.
 4 remaining inputs were kept at 4 degrees of freedom.
These changes were actually made in several steps of pruning and
refitting.
In the program, the pruning is accomplished by commenting out the code
for the hidden layers. (the code between an asterisk and a semicolon is
treated as a comment.)
proc neural data=dbh dmdbcat=cbh ranscale=.1 graph
random=9199;
input crim
/ level=int id=x1;
input zn
/ level=int id=x2;
input indus
/ level=int id=x3;
input chas
/ level=int id=x4;
input nox
/ level=int id=x5;
input rm
/ level=int id=x6;
input age
/ level=int id=x7;
input dis
/ level=int id=x8;
23
input rad
/ level=int
input tax
/ level=int
input ptratio / level=int
input b
/ level=int
input lstat
/ level=int
target medv
/ level=int
error=poisson;
id=x9;
id=x10;
id=x11;
id=x12;
id=x13;
id=out act=exp
hidden 1 / id=f1 act=identity comb=lin nobias;
connect f1 out;
connect x1 f1;
hidden 1 / id=h1;
connect x1 h1;
connect h1 f1;
*
*
*
*
*
*
hidden 1 / id=f2 act=identity comb=lin nobias;
connect f2 out;
connect x2 f2;
hidden 1 / id=h2;
connect x2 h2;
connect h2 f2;
*
*
*
*
*
*
hidden 1 / id=f3 act=identity comb=lin nobias;
connect f3 out;
connect x3 f3;
hidden 1 / id=h3;
connect x3 h3;
connect h3 f3;
*
*
*
hidden 1 / id=f4
connect f4 out;
connect x4 f4;
act=identity comb=lin nobias;
hidden 1 / id=f5 act=identity comb=lin nobias;
connect f5 out;
connect x5 f5;
hidden 2 / id=h5;
connect x5 h5;
connect h5 f5;
hidden 1 / id=f6 act=identity comb=lin nobias;
connect f6 out;
connect x6 f6;
hidden 1 / id=h6;
connect x6 h6;
connect h6 f6;
24
*
*
*
*
*
*
hidden 1 / id=f7 act=identity comb=lin nobias;
connect f7 out;
connect x7 f7;
hidden 1 / id=h7;
connect x7 h7;
connect h7 f7;
hidden 1 / id=f8 act=identity comb=lin nobias;
connect f8 out;
connect x8 f8;
hidden 1 / id=h8;
connect x8 h8;
connect h8 f8;
*
*
*
hidden 1 / id=f9 act=identity comb=lin nobias;
connect f9 out;
connect x9 f9;
hidden 1 / id=h9;
connect x9 h9;
connect h9 f9;
*
*
*
hidden 1 / id=f10 act=identity comb=lin nobias;
connect f10 out;
connect x10 f10;
hidden 1 / id=h10;
connect x10 h10;
connect h10 f10;
*
*
*
hidden 1 / id=f11 act=identity comb=lin nobias;
connect f11 out;
connect x11 f11;
hidden 1 / id=h11;
connect x11 h11;
connect h11 f11;
*
*
*
*
*
*
hidden 1 / id=f12 act=identity comb=lin nobias;
connect f12 out;
connect x12 f12;
hidden 1 / id=h12;
connect x12 h12;
connect h12 f12;
*
*
hidden 1 / id=f13 act=identity comb=lin nobias;
connect f13 out;
connect x13 f13;
hidden 1 / id=h13;
connect x13 h13;
25
*
connect h13 f13;
initial inest=start;
freeze f1->out f2->out f3->out f4->out
f5->out f6->out f7->out f8->out
f9->out f10->out f11->out f12->out f13->out;
train outfit=of;
score data=neuralnt.housing nodmdb out=parres;
run;
proc sql;
select _err_,
_dfm_-8 as p,
_err_+(calculated p)*log(_dft_) as sbc
from of
where _name_='OVERALL';
quit;
Since 5 inputs have been eliminated, only 8 consolidation layers have
actually been frozen. So the degree of freedom should be _dfm_-8.
Train:
Error
Function.
p
sbc
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
263.2337
24 412.6705
The reduced model has 24 parameters. The deviance reduced from 359 to
263 and SBC reduced from 446 to 413.
The improvement is clearly indicated.
Generalized Partial Residuals:
26
Transformed


g01 ( yi )   wˆ 0   fˆm ( xim )    g01 ( yi )  g01 ( yˆi )   fˆj ( xij )
m j


Note the partial residuals were originally developed for linear regression.
1
When g 0 is nonlinear, there are two alternatives:
1. Link-transforming the target or using a first order approximation:
g01 ( yˆi )
 yi  yˆi   fˆj ( xij ) .
First order approximation:
yi
yi  yˆi
 fˆj ( xij ) .
For logistic regression, it yields ˆ
yi 1  yˆi 
So the partial residuals need to be approximated because the logit is
Infinite ast zero and one and th efirst order approximation often
produces wild outliers when the predicted probabilities, yˆi , are close
zero or one.
2. Empirical Partial Residuals:
The empirical partial residuals for the jth input are calculated by
binning the input into r  1, 2, , R equally sized groups (quantiles)
and transform the sample proportions.
  m 1 

yˆ rj
irj


prij  ln 

ln

  mrj  m1rj  1 
 yˆ rj 1  yˆ rj





   fˆrj  x j 
,


where
27
m1rj  the number of events,
mrj  the number of cases,
yˆ rj  the mean predicted probability, and
fˆrj ( x j )  the mean component in the rth bin.
Example 5.17 (GANN: Binary Target) The BANK8DTR data has a
binary target. A GANN model is constructed in the same fashion as the
previous example. The empirical partial residuals require additional
programming.
Initialize Data Mining Database:
proc dmdb batch data=neuralnt.bank8dtr dmdbcat=ctr
out=dtr;
var atmct adbdda ddatot ddadep income invest savbal
atres;
class acquire(desc);
run;
(logit) linear model for initialization:
proc neural data=dtr dmdbcat=ctr graph;
input atmct
/ level=int id=x1;
input adbdda / level=int id=x2;
input ddatot / level=int id=x3;
input ddadep / level=int id=x4;
input income / level=int id=x5;
input invest / level=int id=x6;
input savbal / level=int id=x7;
input atres
/ level=int id=x8;
target acquire / level=nom id=out;
hidden 1 / id=f1 act=identity comb=lin nobias;
connect f1 out;
connect x1 f1;
hidden 1 / id=f2 act=identity comb=lin nobias;
connect f2 out;
connect x2 f2;
28
hidden 1 / id=f3 act=identity comb=lin nobias;
connect f3 out;
connect x3 f3;
hidden 1 / id=f4 act=identity comb=lin nobias;
connect f4 out;
connect x4 f4;
hidden 1 / id=f5 act=identity comb=lin nobias;
connect f5 out;
connect x5 f5;
hidden 1 / id=f6 act=identity comb=lin nobias;
connect f6 out;
connect x6 f6;
hidden 1 / id=f7 act=identity comb=lin nobias;
connect f7 out;
connect x7 f7;
hidden 1 / id=f8 act=identity comb=lin nobias;
connect f8 out;
connect x8 f8;
initial;
freeze f1->out
f2->out
f3->out
f4->out
f5->out
f6->out
f7->out
f8->out / value=1;
train outest=start;
run;
GANN with 4 df per input:
proc neural data=dtr dmdbcat=ctr ranscale=.1 graph;
input atmct
/ level=int id=x1;
input adbdda / level=int id=x2;
input ddatot / level=int id=x3;
input ddadep / level=int id=x4;
input income / level=int id=x5;
input invest / level=int id=x6;
input savbal / level=int id=x7;
input atres
/ level=int id=x8;
29
target acquire / level=nom id=out;
hidden 1 /
connect f1
connect x1
hidden 1 /
connect x1
connect h1
id=f1 act=identity comb=lin nobias;
out;
f1;
id=h1;
h1;
f1;
hidden 1 /
connect f2
connect x2
hidden 1 /
connect x2
connect h2
id=f2 act=identity comb=lin nobias;
out;
f2;
id=h2;
h2;
f2;
hidden 1 /
connect f3
connect x3
hidden 1 /
connect x3
connect h3
id=f3 act=identity comb=lin nobias;
out;
f3;
id=h3;
h3;
f3;
hidden 1 /
connect f4
connect x4
hidden 1 /
connect x4
connect h4
id=f4 act=identity comb=lin nobias;
out;
f4;
id=h4;
h4;
f4;
hidden 1 /
connect f5
connect x5
hidden 1 /
connect x5
connect h5
id=f5 act=identity comb=lin nobias;
out;
f5;
id=h5;
h5;
f5;
hidden 1 /
connect f6
connect x6
hidden 1 /
connect x6
connect h6
id=f6 act=identity comb=lin nobias;
out;
f6;
id=h6;
h6;
f6;
hidden 1 / id=f7 act=identity comb=lin nobias;
connect f7 out;
connect x7 f7;
30
hidden 1 / id=h7;
connect x7 h7;
connect h7 f7;
hidden 1 /
connect f8
connect x8
hidden 1 /
connect x8
connect h8
id=f8 act=identity comb=lin nobias;
out;
f8;
id=h8;
h8;
f8;
initial inest=start;
freeze f1->out
f2->out
f3->out
f4->out
f5->out
f6->out
f7->out
f8->out;
train;
score data=neuralnt.bank8dtr nodmdb out=str;
run;
The empirical partial residual plots are created using a macro called
PRLOGIT. The macro has two arguments: the name of the input variable
and the name of its univariate function (consolidation layer unit).
%macro prlogit(var,f);
proc rank data=str groups=100 out=binned;
var &var;
ranks bin;
run;
proc sql;
create table bins as
select count(acquire) as tot,
sum(acquire) as m,
mean(p_acquire1) as p_acqui1,
mean(&var) as &var,
mean(&f) as &f,
log((calculated m+1)/(calculated totcalculated m+1)) as elogit,
calculated elogit-log(calculated p_acqui1
/(1-calculated p_acqui1))+calculated &f
as parres
31
from binned
group by bin;
quit;
goptions reset=all;
proc gplot data=bins;
symbol1 c=b v=circle i=none;
symbol2 c=bl v=none i=join;
plot parres*&var=1 &f*&var=2 / overlay frame;
run;
quit;
%mend;
The PROC RANK with the GROUP option bins the input variable
(&VAR) into 100 equal groups. If there are tied values, fewer than 100
groups might be created and the groups may be unequal. The output data
set (BINNED) contains a bin identification variable (BIN) that ranges
from 0 to 99. It also contains all the variables in the input data set,
including the target, inputs, and predicted values.
The PROC SQL creates a data set (BINNED) containing the mean input
variable, the mean univariate function, and the empirical partial residuals
for each bin. The empirical logits are calculated as the logit of the
proportions of events in each bin. The sum of ACQUIRE is the number
of events. The value 1 is added to the counts to shrink the estimates and
prevent problems with zeros.
PROC GPLOT overlays the function and the partial residuals versus the
mean input. The i=join is used in the SYMBOL statement because the
data are displayed sorted by the input variable.
Each macro call produces a single empirical partial residual plot.
%prlogit(atmct ,f11);
%prlogit(adbdda,f21);
%prlogit(ddatot,f31);
%prlogit(ddadep,f41);
%prlogit(income,f51);
%prlogit(invest,f61);
%prlogit(savbal,f71);
%prlogit(atres ,f81);
32
33
34
Leverage and Transformations:
The empirical partial residual plots show poor fits for several of the
inputs. The distributions of the inputs DDATOT, ADBDDA, DDADEP,
and SAVBAL are highly skewed. The fitted values appear to be overly
sensitive to the new large values in the tails of the distributions.
Power transformations are applied to DDATOT, ADBDDA, DDADEP,
and SAVBAL to encourage the GANN to learn the variation in the center
of the distributions. The cube-root transformation is used because of the
positive skewness and the presence of many zero values. The choice of
transformation is not as crucial here as it is with linear models. The
35
neural network can accommodate nonlinearity. The purpose of the
transformation is merely to reduce data sparsity, for simplicity, the
negative values of ADBDDA are truncated at zero.
data neuralnt.transtr;
set neuralnt.bank8dtr;
tddatot=ddatot**(1/3);
tadbdda=(adbdda*(adbdda>=0))**(1/3);
tddadep=ddadep**(1/3);
tsavbal=savbal**(1/3);
run;
data neuralnt.transte;
set neuralnt.bank8dte;
tddatot=ddatot**(1/3);
tadbdda=(adbdda*(adbdda>=0))**(1/3);
tddadep=ddadep**(1/3);
tsavbal=savbal**(1/3);
run;
The addition of the new transformed inputs requires that the entire
program be rerun.
proc dmdb batch data=neuralnt.transtr dmdbcat=ctr
out=dtr;
var atmct tadbdda tddatot tddadep income invest
tsavbal atres;
class acquire(desc);
run;
proc neural data=dtr dmdbcat=ctr graph;
input atmct
/ level=int id=x1;
input tadbdda / level=int id=x2;
input tddatot / level=int id=x3;
input tddadep / level=int id=x4;
input income / level=int id=x5;
input invest / level=int id=x6;
input tsavbal / level=int id=x7;
input atres
/ level=int id=x8;
target acquire / level=nom id=out comb=lin;
hidden 1 / id=f1 act=identity comb=lin nobias;
connect f1 out;
connect x1 f1;
36
hidden 1 / id=f2 act=identity comb=lin nobias;
connect f2 out;
connect x2 f2;
hidden 1 / id=f3 act=identity comb=lin nobias;
connect f3 out;
connect x3 f3;
hidden 1 / id=f4 act=identity comb=lin nobias;
connect f4 out;
connect x4 f4;
hidden 1 / id=f5 act=identity comb=lin nobias;
connect f5 out;
connect x5 f5;
hidden 1 / id=f6 act=identity comb=lin nobias;
connect f6 out;
connect x6 f6;
hidden 1 / id=f7 act=identity comb=lin nobias;
connect f7 out;
connect x7 f7;
hidden 1 / id=f8 act=identity comb=lin nobias;
connect f8 out;
connect x8 f8;
initial;
freeze f1->out
f2->out
f3->out
f4->out
f5->out
f6->out
f7->out
f8->out / value=1;
train outest=start;
run;
The Quasi-Newton algorithm is used to speed-up training.
proc neural data=dtr dmdbcat=ctr ranscale=.1 graph;
input atmct
/ level=int id=x1;
input tadbdda / level=int id=x2;
input tddatot / level=int id=x3;
input tddadep / level=int id=x4;
37
input income / level=int id=x5;
input invest / level=int id=x6;
input tsavbal / level=int id=x7;
input atres
/ level=int id=x8;
target acquire / level=nom id=out comb=lin;
hidden 1 /
connect f1
connect x1
hidden 1 /
connect x1
connect h1
id=f1 act=identity comb=lin nobias;
out;
f1;
id=h1;
h1;
f1;
hidden 1 /
connect f2
connect x2
hidden 1 /
connect x2
connect h2
id=f2 act=identity comb=lin nobias;
out;
f2;
id=h2;
h2;
f2;
hidden 1 /
connect f3
connect x3
hidden 1 /
connect x3
connect h3
id=f3 act=identity comb=lin nobias;
out;
f3;
id=h3;
h3;
f3;
hidden 1 /
connect f4
connect x4
hidden 1 /
connect x4
connect h4
id=f4 act=identity comb=lin nobias;
out;
f4;
id=h4;
h4;
f4;
hidden 1 /
connect f5
connect x5
hidden 1 /
connect x5
connect h5
id=f5 act=identity comb=lin nobias;
out;
f5;
id=h5;
h5;
f5;
hidden 1 /
connect f6
connect x6
hidden 1 /
connect x6
connect h6
id=f6 act=identity comb=lin nobias;
out;
f6;
id=h6;
h6;
f6;
38
hidden 1 /
connect f7
connect x7
hidden 1 /
connect x7
connect h7
id=f7 act=identity comb=lin nobias;
out;
f7;
id=h7;
h7;
f7;
hidden 1 /
connect f8
connect x8
hidden 1 /
connect x8
connect h8
id=f8 act=identity comb=lin nobias;
out;
f8;
id=h8;
h8;
f8;
initial inest=start;
freeze f1->out
f2->out
f3->out
f4->out
f5->out
f6->out
f7->out
f8->out;
train tech=quanew;
score data=neuralnt.transtr nodmdb out=str;
score data=neuralnt.transte nodmdb out=ste;
run;
%prlogit(atmct ,f11);
%prlogit(tadbdda,f21);
%prlogit(tddatot,f31);
%prlogit(tddadep,f41);
%prlogit(income ,f51);
%prlogit(invest ,f61);
%prlogit(tsavbal,f71);
%prlogit(atres ,f81);
39
40
41
Download