3.5 Neural Networks for Predictive Data Mining A. Review Data mining is concerned with extracting knowledge from large (mainly corporate) database that were originally developed from operational purposes. Application of Predictive Data Mining: Database Marketing Response modeling Attrition prediction Credit Scoring Fraud Detection Types of Predictive Data Mining: Regression Models- linear and generalized linear model. Easy to train Interpretable Inflexible Low power in prediction Decision Tree Model Relatively easy to train Easy in interpretation Flexible Moderate power in prediction 1 Artificial Neural Networks Flexible High power in prediction Troublesome in training Incomprehensibility Some Remarks about ANN: Considerable computational effort is often required to optimize the large number of parameters in a typical neural network model. This is partially self-inflicted. Data analysts often use inefficient optimization methods such as back-propagation. Even with an efficient algorithm, local minima are troublesome. There is no simple method for selecting the best network from the enormous number of possible architectures. Input variable selection increases the combinatorial complexity of the problem. Overfitting is always a serious concern with flexible models. Multilayer perceptrons are black boxes with respect to interpretation. ( y, x ) yˆ In some pattern recognition application, such as hand writing recognition, pure prediction is the goal; understanding how the inputs affect the prediction is immaterial. In many scientific applications, the opposite is true. Understanding is the goal, and predictive power is a consideration only to the extent that it validates the interpretive power of the model. This is the domain for formal statistical inference such as hypothesis testing and confidence intervals. Domains such as database marketing often have both goals. Scoring new cases is the ultimate purpose of predictive modeling. However, some understanding, even informal, of the factors affecting the prediction can be helpful in determining how to market to segments of likely responders. Understanding the effects of the inputs can be useful for decisions about costly data acquisitions. In credit scoring, 2 the opaqueness of the model can have legal ramifications. The US Equal Credit Opportunity Act requires that creditors provide a statement of specific reasons why an adverse action was taken. Neural networks have an allure unmatched by other methods with similar capabilities. Neural networks are often characterized as capable of learning complex patterns by thinking about past experiences. This is falsely contrasted with more difficult statistical methods that require manual tuning and adherence to restrictive assumptions. The chief benefit of neural networks for predictive modeling is their flexibility in representing nonlinear phenomenon (universal approximation). However, their ascribed similarities to organic neural networks have lead to unhealthy anthropomorphism and with it, unrealistic expectations. Most of the effort in predictive modeling is in the data preparation steps: creating and eliminating input variables, imputing missing values, collapsing and dummy-coding nominal inputs, and nullifying the effect of outliers. Even more effort is spent on determining and acquiring the data that is relevant to the particular business problem. Neural network provide no particular relief with regard to these effort. Keep in Mind!!! While performance improvements are possible with sophisticated modeling methods, such as neural networks, the greatest benefit is not from a better modeling technique, but from better data, for example, discovering some new input that is strongly associated with the target is mostly beneficial to prediction. 3 B. Generalized Additive Neural Networks (GANN) Question: Can we have a class of models that is between generalized regression model and neural network model in terms of efficiency and training complexity? Solution: Generalized additive models. (GAM) They are a compromise between inflexible, but docile, linear models and flexible troublesome, universal approximators. GAMs g01 E( y) w0 f1 ( x1 ) f2 ( x2 ) f k ( xk ) In a GAM, the link-transformed expected target is expressed as the sum of individual unspecified univariate function. GAMs are usually presented as extensions of linear models, but they can also be portrayed as constrained forms of multivariate function estimators such as projection pursuit regression and neural networks. References: Hastie, T.J. and Tibshirani, R. J. (1990) Generalized Additive Models, New York: Chapman and Hall. Friedman, J.H. and Stuetzle, D.L. (1981) “Projection Pursuit Regression.” Journal of the American Statistical Association, 76, 817823. Sarle, W.S. (1994) “Neural Networks and Statistical Models,” Proceedings of the Nineteenth Annual SAS Users Group International Conference. 4 Generalized Additive Neural Network H11 tanh w011 w111x1 w111 w11 w10 x1 w20 y w21 x2 w121 H 21 tanh w021 w121 x2 g01 E( y) w0 w11H11 w10 x1 w21H 21 w20 x2 The basic architecture for a generalized additive neural network (GANN) has a separate set of layers for each input variable. The hidden layers do not have connections from more than one input variable. The combined output from the layers associated with each input (a mini neural network) form the individual univariate functions. The type of architecture could vary across inputs. Including a skip layer for each input makes the generalized linear model (GAM) a special case of GANN. 5 Consolidation Layer: H11 tanh w011 w111x1 w111 w11 1 f1 x1 w10 f2 w21 w20 x2 w121 1 y H 21 tanh w021 w121 x2 g 01 E ( y ) w0 w11 H11 w10 x1 w21H 21 w20 x2 f1 f2 A useful modification for PROC NEURAL is the addition of a linear layer to consolidate the output from the hidden layer for each input. This layer does not transform its input (identity activation function). If the output connections are set to zero, then the output from the consolidation layer are the individual univariate function. PROC NEURAL automatically creates a variable in the output data set for each unit of the consolidation layer. This facilitates plotting the individual functions. 6 GANN Estimation Stepwise Construction: 1. Fit a generalized linear model to give initial estimates of the skip layer and the output bias. g01 E( y) w0 w10 x1 w20 x2 2. Construct a GANN with one neuron and a skip layer for each input. This gives four parameters for each input variable. Binary inputs (dummy variables) only need a direct connection (1df) 3. Initialize the remaining three parameters in each hidden layer as small random numbers. g 01 E ( y ) w0 f1 ( x1 ) f 2 ( x2 ) f1 ( x1 ) w11 H11 w10 x1 , H11 tanh w011 w111 x1 f 2 ( x2 ) w21 H 22 w20 x2 , H 22 tanh w021 w121 x2 7 w111 , w011 , and w11 and w121, w121, and w21 need to be initialized . w10 and w20 can be chosen by the estimates from Step 1. 4. Train the full GANN (4k+1 df) model. 5. Examine each of the fitted univariate functions, fˆi ( xi ) , overlaid on their partial residuals. Partial Residuals (Identity Link): PRij yi wˆ 0 fˆm ( xim ) yi yˆi fˆj ( x j ) m j Partial residuals are calculated as the difference between the target and part of additive model that does not include the input under investigation. In contrast to simple plots of the target versus an input, partial residual plots show the effects of the input variables, adjusted for the effects of the other inputs and thus are a more effective visual diagnostic. Upshot: If no trend presents, probably the corresponding input variable should be eliminated If there is linear trend, probably only the skip layer should be included. If there a curvature presented, the corresponding architecture should be included or even add new neurons. 6. Prune the hidden layers with apparently linear effects and add neurons to hidden layers where the nonlinear trend appears to be underfitted. 8 Notes: The random initialization of the hidden layer is designed to give well-placed and stable starting values. However, local minima can still occur. Inferior local minima can usually be seen in the partial residual plots, in which case a new random initialization should be tried. The growing and pruning process starts with a single neuron plus a skip layer instead of the linear model (the linear fit is only for initialization). The partial residuals based on a general additive fit are more reliable than those based on a linear fit. Moreover, starting with a 4 degree of freedom smoother is a common practice with GAM estimation. Model selection based on partial residual plots is subjective. Formal measures of fit can be incorporated into the process 9 Example 3.16 (GANN: Interval Target) In this example, we retrain the data set HOUSING along the procedure of generalized additive neural network model. First, a generalized linear model is fitted to provide initial estimates for the GANN. A generalized linear model could be fitted easily by using the GLIM option on the ARCH statement. However, to initialize the GANN a more complicated program is needed. The linear model needs to have the same structure as the GANN so that the parameter estimates have the same names. Furthermore, the inputs are not standardized in a model using the GLIM option. By setting up a parallel structure for the linear model, the GANN model can automatically use the final estimates for starting values. The GANN structure has a hidden and skip layer for each input. It also has a linear consolidation layer between the hidden layer and the output. The connections between the consolidation layer and the output layer are set to one and frozen. The PROC NEURAL syntax for GANNs is long because separate hidden layers are built for each input variable. In the HOUSING data, there are 13 inputs, so there are 13 INPUT statements, 13 HIDDEN statements for the hidden layer, 13 HIDDEN statements for the linear consolidation layer, and the CONNECT statements to connect all the layers. Step 1. Covert the data set to data mining data set proc dmdb batch data=neuralnt.housing dmdbcat=cbh out=dbh; var crim zn indus chas nox rm age dis rad tax ptratio b lstat medv; run; Step 2. Linear initialization model proc neural data=dbh input crim input zn input indus input chas input nox input rm dmdbcat=cbh / level=int / level=int / level=int / level=int / level=int / level=int graph; id=x1; id=x2; id=x3; id=x4; id=x5; id=x6; 10 input age input dis input rad input tax input ptratio input b input lstat target medv / / / / / / / / level=int id=x7; level=int id=x8; level=int id=x9; level=int id=x10; level=int id=x11; level=int id=x12; level=int id=x13; level=int id=out act=exp error=poisson; hidden 1 / id=f1 act=identity comb=lin nobias; /* Consolidation layer */ connect f1 out; connect x1 f1; /* Skip layer */ hidden 1 / id=f2 connect f2 out; connect x2 f2; act=identity comb=lin nobias; hidden 1 / id=f3 connect f3 out; connect x3 f3; act=identity comb=lin nobias; hidden 1 / id=f4 connect f4 out; connect x4 f4; act=identity comb=lin nobias; hidden 1 / id=f5 connect f5 out; connect x5 f5; act=identity comb=lin nobias; hidden 1 / id=f6 connect f6 out; connect x6 f6; act=identity comb=lin nobias; hidden 1 / id=f7 connect f7 out; connect x7 f7; act=identity comb=lin nobias; hidden 1 / id=f8 connect f8 out; connect x8 f8; act=identity comb=lin nobias; hidden 1 / id=f9 connect f9 out; connect x9 f9; act=identity comb=lin nobias; 11 hidden 1 / id=f10 act=identity comb=lin nobias; connect f10 out; connect x10 f10; hidden 1 / id=f11 act=identity comb=lin nobias; connect f11 out; connect x11 f11; hidden 1 / id=f12 act=identity comb=lin nobias; connect f12 out; connect x12 f12; hidden 1 / id=f13 act=identity comb=lin nobias; connect f13 out; connect x13 f13; initial; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out f9->out f10->out f11->out f12->out f13 ->out / value=1; train outfit=of outest=start; run; The input units are given the Ids X1 through X13. The TARGET statement uses the log link and Poisson error function. Each of the 13 inputs have a block of three statements: a HIDDEN statement to set up the consolidation layer, a CONNECT statement to connect the consolidation layer to the output, and a CONNECT statement to connect the input to the consolidation layer (the skip layer). The units in the consolidation layer are given the Ids F1 through F13. It uses the identity activation function, and a linear combination function with no bias terms. It would be redundant to include 13 output biases, one for each input. The output layer uses a linear combination function by default. These weights are now redundant, so they are set equal to one and prevented from changing using the FREEZE statement. The INITIAL statement is needed to activate the FREEZE statement. The only free parameter is the single input bias. 12 The OUTEST=option on the TRAIN statement creates the START data set containing the final estimates. They are used to initialize the GANN. proc sql; select _err_, _dfm_-13 as p, _err_+(calculated p)*log(_dft_) as sbc from of where _name_='OVERALL'; quit; Various fit statistics are computed for the OUTFIT data set, including the deviance and SBC. SBC is automatically calculated and given the variable _SBC_. However, it is computed incorrectly for this type of network. This network has 27 parameters, 13 of which are frozen. The frozen parameters are counted in the SBC penalty pln(n). Consequently, the SBC statistic needs to be corrected by subtracting the number of inputs from the model degrees of freedom. The variable _DFM_ is the number of parameters in the model (counting the frozen parameters). The variable _ERR_ is the deviance. The variable _DFT_ is the sample size. The SAS System 15:33 Thursday, May 31, 2001 5 Train: Error Function. p sbc ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 359.2707 14 446.4423 In this example, SBC is used along with the partial residual plots for model selection. The first GANN is fit with one hidden unit for each input and a skip layer (4 df) proc neural data=dbh dmdbcat=cbh ranscale=.1 graph; 13 input crim input zn input indus input chas input nox input rm input age input dis input rad input tax input ptratio input b input lstat target medv / level=int id=x1; / level=int id=x2; / level=int id=x3; / level=int id=x4; / level=int id=x5; / level=int id=x6; / level=int id=x7; / level=int id=x8; / level=int id=x9; / level=int id=x10; / level=int id=x11; / level=int id=x12; / level=int id=x13; / level=int id=out act=exp error=poisson; hidden 1 / id=f1 act=identity comb=lin nobias; /* Consolidation layer */ connect f1 out; connect x1 f1; /* Skip layer */ hidden 1 / id=h1; /* Hidden layer */ connect x1 h1; connect h1 f1; hidden 1 / id=f2 act=identity comb=lin nobias; connect f2 out; connect x2 f2; hidden 1 / id=h2; connect x2 h2; connect h2 f2; hidden 1 / id=f3 act=identity comb=lin nobias; connect f3 out; connect x3 f3; hidden 1 / id=h3; connect x3 h3; connect h3 f3; hidden 1 / id=f4 connect f4 out; connect x4 f4; act=identity comb=lin nobias; hidden 1 / id=f5 connect f5 out; connect x5 f5; act=identity comb=lin nobias; 14 hidden 1 / id=h5; connect x5 h5; connect h5 f5; hidden 1 / id=f6 act=identity comb=lin nobias; connect f6 out; connect x6 f6; hidden 1 / id=h6; connect x6 h6; connect h6 f6; hidden 1 / id=f7 act=identity comb=lin nobias; connect f7 out; connect x7 f7; hidden 1 / id=h7; connect x7 h7; connect h7 f7; hidden 1 / id=f8 act=identity comb=lin nobias; connect f8 out; connect x8 f8; hidden 1 / id=h8; connect x8 h8; connect h8 f8; hidden 1 / id=f9 act=identity comb=lin nobias; connect f9 out; connect x9 f9; hidden 1 / id=h9; connect x9 h9; connect h9 f9; hidden 1 / id=f10 act=identity comb=lin nobias; connect f10 out; connect x10 f10; hidden 1 / id=h10; connect x10 h10; connect h10 f10; hidden 1 / id=f11 act=identity comb=lin nobias; connect f11 out; connect x11 f11; hidden 1 / id=h11; connect x11 h11; connect h11 f11; hidden 1 / id=f12 act=identity comb=lin nobias; 15 connect f12 out; connect x12 f12; hidden 1 / id=h12; connect x12 h12; connect h12 f12; hidden 1 / id=f13 act=identity comb=lin nobias; connect f13 out; connect x13 f13; hidden 1 / id=h13; connect x13 h13; connect h13 f13; initial inest=start; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out f9->out f10->out f11->out f12->out f13>out; train outfit=of maxiter=200; score data=neuralnt.housing nodmdb out=parres; run; proc sql; select _err_, _dfm_-13 as p, _err_+(calculated p)*log(_dft_) as sbc from of where _name_='OVERALL'; quit; Each input has a block of six statements: the three for the linear model and three new statements for a one-neuron hidden layer. The hidden layer for CHAS was not included because it is a binary variable. The INITIAL statement brings in the final estimates from the generalized linear model as starting values. Any parameters not included in the INEST data set, by default, to random values drawn from a normal distribution with mean zero. The standard deviation is determined by the value in the RANSCALE option on the PROC NEURAL statement. A 16 standard deviation of 0.1 is used to encourage the training algorithm to move slowly away from the linear model. The FREEZE statement keeps the output connections equal to 1. A SCORE statement is added to create data for the partial residual plots The maximum iteration is increased using the MAXITER=option, since the Levenbeg-Marquardt method needs more than 100 iterations to converge. Train: Error Function. p sbc ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 224.9645 50 536.2913 The 36 new parameters caused the deviance to decrease from 359 to 225. However, this 50 parameters model increased the SBC from 446 to 536. The SCORE data set automatically contains variables representing the units in the consolidation layer. They are named F11-F131. The name of the variable corresponds to the ID option with unit number are as a suffix. The predicted values are automatically named P_MEDV. The partial residuals can be calculated by adding each univariate function to the residuals. In this case, the partial residuals are calculated on the log scale because the log link function was used. data parres; set parres; lres=log(medv)-log(p_medv); parres1 =lres+f11; parres2 =lres+f21; parres3 =lres+f31; parres4 =lres+f41; parres5 =lres+f51; parres6 =lres+f61; parres7 =lres+f71; 17 parres8 =lres+f81; parres9 =lres+f91; parres10=lres+f101; parres11=lres+f111; parres12=lres+f121; parres13=lres+f131; run; goptions reset=all; proc gplot data=parres; symbol1 c=b v=circle i=none; symbol2 c=bl v=none i=splines; plot parres1 *crim =1 f11 *crim =2 frame; plot parres2 *zn =1 f21 *zn =2 frame; plot parres3 *indus =1 f31 *indus =2 frame; plot parres4 *chas =1 f41 *chas =2 frame; plot parres5 *nox =1 f51 *nox =2 frame; plot parres6 *rm =1 f61 *rm =2 frame; plot parres7 *age =1 f71 *age =2 frame; plot parres8 *dis =1 f81 *dis =2 frame; plot parres9 *rad =1 f91 *rad =2 frame; plot parres10*tax =1 f101*tax =2 frame; plot parres11*ptratio=1 f111*ptratio=2 frame; plot parres12*b =1 f121*b =2 frame; plot parres13*lstat =1 f131*lstat =2 frame; run; quit; / overlay / overlay / overlay / overlay / overlay / overlay / overlay / overlay / overlay / overlay / overlay / overlay / overlay The I=SPLINES option on the symbol connects the function with an interpolating spline. Using I=JOIN would require the data to be sorted by each input variable for each plot. 18 19 20 21 22 Comments: By looking at the residual plots and increasing SBC values, some modifications on the architecture should be made. 5 inputs: X 2 , X 3 , X 4 , X 7 , and X12 were eliminated since there is no trend shown in the partial residual plots. 4 of the remaining 8 inputs: X 9 , X10 , X11 , and X13 were pruned back to a linear fit. 4 remaining inputs were kept at 4 degrees of freedom. These changes were actually made in several steps of pruning and refitting. In the program, the pruning is accomplished by commenting out the code for the hidden layers. (the code between an asterisk and a semicolon is treated as a comment.) proc neural data=dbh dmdbcat=cbh ranscale=.1 graph random=9199; input crim / level=int id=x1; input zn / level=int id=x2; input indus / level=int id=x3; input chas / level=int id=x4; input nox / level=int id=x5; input rm / level=int id=x6; input age / level=int id=x7; input dis / level=int id=x8; 23 input rad / level=int input tax / level=int input ptratio / level=int input b / level=int input lstat / level=int target medv / level=int error=poisson; id=x9; id=x10; id=x11; id=x12; id=x13; id=out act=exp hidden 1 / id=f1 act=identity comb=lin nobias; connect f1 out; connect x1 f1; hidden 1 / id=h1; connect x1 h1; connect h1 f1; * * * * * * hidden 1 / id=f2 act=identity comb=lin nobias; connect f2 out; connect x2 f2; hidden 1 / id=h2; connect x2 h2; connect h2 f2; * * * * * * hidden 1 / id=f3 act=identity comb=lin nobias; connect f3 out; connect x3 f3; hidden 1 / id=h3; connect x3 h3; connect h3 f3; * * * hidden 1 / id=f4 connect f4 out; connect x4 f4; act=identity comb=lin nobias; hidden 1 / id=f5 act=identity comb=lin nobias; connect f5 out; connect x5 f5; hidden 2 / id=h5; connect x5 h5; connect h5 f5; hidden 1 / id=f6 act=identity comb=lin nobias; connect f6 out; connect x6 f6; hidden 1 / id=h6; connect x6 h6; connect h6 f6; 24 * * * * * * hidden 1 / id=f7 act=identity comb=lin nobias; connect f7 out; connect x7 f7; hidden 1 / id=h7; connect x7 h7; connect h7 f7; hidden 1 / id=f8 act=identity comb=lin nobias; connect f8 out; connect x8 f8; hidden 1 / id=h8; connect x8 h8; connect h8 f8; * * * hidden 1 / id=f9 act=identity comb=lin nobias; connect f9 out; connect x9 f9; hidden 1 / id=h9; connect x9 h9; connect h9 f9; * * * hidden 1 / id=f10 act=identity comb=lin nobias; connect f10 out; connect x10 f10; hidden 1 / id=h10; connect x10 h10; connect h10 f10; * * * hidden 1 / id=f11 act=identity comb=lin nobias; connect f11 out; connect x11 f11; hidden 1 / id=h11; connect x11 h11; connect h11 f11; * * * * * * hidden 1 / id=f12 act=identity comb=lin nobias; connect f12 out; connect x12 f12; hidden 1 / id=h12; connect x12 h12; connect h12 f12; * * hidden 1 / id=f13 act=identity comb=lin nobias; connect f13 out; connect x13 f13; hidden 1 / id=h13; connect x13 h13; 25 * connect h13 f13; initial inest=start; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out f9->out f10->out f11->out f12->out f13->out; train outfit=of; score data=neuralnt.housing nodmdb out=parres; run; proc sql; select _err_, _dfm_-8 as p, _err_+(calculated p)*log(_dft_) as sbc from of where _name_='OVERALL'; quit; Since 5 inputs have been eliminated, only 8 consolidation layers have actually been frozen. So the degree of freedom should be _dfm_-8. Train: Error Function. p sbc ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 263.2337 24 412.6705 The reduced model has 24 parameters. The deviance reduced from 359 to 263 and SBC reduced from 446 to 413. The improvement is clearly indicated. Generalized Partial Residuals: 26 Transformed g01 ( yi ) wˆ 0 fˆm ( xim ) g01 ( yi ) g01 ( yˆi ) fˆj ( xij ) m j Note the partial residuals were originally developed for linear regression. 1 When g 0 is nonlinear, there are two alternatives: 1. Link-transforming the target or using a first order approximation: g01 ( yˆi ) yi yˆi fˆj ( xij ) . First order approximation: yi yi yˆi fˆj ( xij ) . For logistic regression, it yields ˆ yi 1 yˆi So the partial residuals need to be approximated because the logit is Infinite ast zero and one and th efirst order approximation often produces wild outliers when the predicted probabilities, yˆi , are close zero or one. 2. Empirical Partial Residuals: The empirical partial residuals for the jth input are calculated by binning the input into r 1, 2, , R equally sized groups (quantiles) and transform the sample proportions. m 1 yˆ rj irj prij ln ln mrj m1rj 1 yˆ rj 1 yˆ rj fˆrj x j , where 27 m1rj the number of events, mrj the number of cases, yˆ rj the mean predicted probability, and fˆrj ( x j ) the mean component in the rth bin. Example 5.17 (GANN: Binary Target) The BANK8DTR data has a binary target. A GANN model is constructed in the same fashion as the previous example. The empirical partial residuals require additional programming. Initialize Data Mining Database: proc dmdb batch data=neuralnt.bank8dtr dmdbcat=ctr out=dtr; var atmct adbdda ddatot ddadep income invest savbal atres; class acquire(desc); run; (logit) linear model for initialization: proc neural data=dtr dmdbcat=ctr graph; input atmct / level=int id=x1; input adbdda / level=int id=x2; input ddatot / level=int id=x3; input ddadep / level=int id=x4; input income / level=int id=x5; input invest / level=int id=x6; input savbal / level=int id=x7; input atres / level=int id=x8; target acquire / level=nom id=out; hidden 1 / id=f1 act=identity comb=lin nobias; connect f1 out; connect x1 f1; hidden 1 / id=f2 act=identity comb=lin nobias; connect f2 out; connect x2 f2; 28 hidden 1 / id=f3 act=identity comb=lin nobias; connect f3 out; connect x3 f3; hidden 1 / id=f4 act=identity comb=lin nobias; connect f4 out; connect x4 f4; hidden 1 / id=f5 act=identity comb=lin nobias; connect f5 out; connect x5 f5; hidden 1 / id=f6 act=identity comb=lin nobias; connect f6 out; connect x6 f6; hidden 1 / id=f7 act=identity comb=lin nobias; connect f7 out; connect x7 f7; hidden 1 / id=f8 act=identity comb=lin nobias; connect f8 out; connect x8 f8; initial; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out / value=1; train outest=start; run; GANN with 4 df per input: proc neural data=dtr dmdbcat=ctr ranscale=.1 graph; input atmct / level=int id=x1; input adbdda / level=int id=x2; input ddatot / level=int id=x3; input ddadep / level=int id=x4; input income / level=int id=x5; input invest / level=int id=x6; input savbal / level=int id=x7; input atres / level=int id=x8; 29 target acquire / level=nom id=out; hidden 1 / connect f1 connect x1 hidden 1 / connect x1 connect h1 id=f1 act=identity comb=lin nobias; out; f1; id=h1; h1; f1; hidden 1 / connect f2 connect x2 hidden 1 / connect x2 connect h2 id=f2 act=identity comb=lin nobias; out; f2; id=h2; h2; f2; hidden 1 / connect f3 connect x3 hidden 1 / connect x3 connect h3 id=f3 act=identity comb=lin nobias; out; f3; id=h3; h3; f3; hidden 1 / connect f4 connect x4 hidden 1 / connect x4 connect h4 id=f4 act=identity comb=lin nobias; out; f4; id=h4; h4; f4; hidden 1 / connect f5 connect x5 hidden 1 / connect x5 connect h5 id=f5 act=identity comb=lin nobias; out; f5; id=h5; h5; f5; hidden 1 / connect f6 connect x6 hidden 1 / connect x6 connect h6 id=f6 act=identity comb=lin nobias; out; f6; id=h6; h6; f6; hidden 1 / id=f7 act=identity comb=lin nobias; connect f7 out; connect x7 f7; 30 hidden 1 / id=h7; connect x7 h7; connect h7 f7; hidden 1 / connect f8 connect x8 hidden 1 / connect x8 connect h8 id=f8 act=identity comb=lin nobias; out; f8; id=h8; h8; f8; initial inest=start; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out; train; score data=neuralnt.bank8dtr nodmdb out=str; run; The empirical partial residual plots are created using a macro called PRLOGIT. The macro has two arguments: the name of the input variable and the name of its univariate function (consolidation layer unit). %macro prlogit(var,f); proc rank data=str groups=100 out=binned; var &var; ranks bin; run; proc sql; create table bins as select count(acquire) as tot, sum(acquire) as m, mean(p_acquire1) as p_acqui1, mean(&var) as &var, mean(&f) as &f, log((calculated m+1)/(calculated totcalculated m+1)) as elogit, calculated elogit-log(calculated p_acqui1 /(1-calculated p_acqui1))+calculated &f as parres 31 from binned group by bin; quit; goptions reset=all; proc gplot data=bins; symbol1 c=b v=circle i=none; symbol2 c=bl v=none i=join; plot parres*&var=1 &f*&var=2 / overlay frame; run; quit; %mend; The PROC RANK with the GROUP option bins the input variable (&VAR) into 100 equal groups. If there are tied values, fewer than 100 groups might be created and the groups may be unequal. The output data set (BINNED) contains a bin identification variable (BIN) that ranges from 0 to 99. It also contains all the variables in the input data set, including the target, inputs, and predicted values. The PROC SQL creates a data set (BINNED) containing the mean input variable, the mean univariate function, and the empirical partial residuals for each bin. The empirical logits are calculated as the logit of the proportions of events in each bin. The sum of ACQUIRE is the number of events. The value 1 is added to the counts to shrink the estimates and prevent problems with zeros. PROC GPLOT overlays the function and the partial residuals versus the mean input. The i=join is used in the SYMBOL statement because the data are displayed sorted by the input variable. Each macro call produces a single empirical partial residual plot. %prlogit(atmct ,f11); %prlogit(adbdda,f21); %prlogit(ddatot,f31); %prlogit(ddadep,f41); %prlogit(income,f51); %prlogit(invest,f61); %prlogit(savbal,f71); %prlogit(atres ,f81); 32 33 34 Leverage and Transformations: The empirical partial residual plots show poor fits for several of the inputs. The distributions of the inputs DDATOT, ADBDDA, DDADEP, and SAVBAL are highly skewed. The fitted values appear to be overly sensitive to the new large values in the tails of the distributions. Power transformations are applied to DDATOT, ADBDDA, DDADEP, and SAVBAL to encourage the GANN to learn the variation in the center of the distributions. The cube-root transformation is used because of the positive skewness and the presence of many zero values. The choice of transformation is not as crucial here as it is with linear models. The 35 neural network can accommodate nonlinearity. The purpose of the transformation is merely to reduce data sparsity, for simplicity, the negative values of ADBDDA are truncated at zero. data neuralnt.transtr; set neuralnt.bank8dtr; tddatot=ddatot**(1/3); tadbdda=(adbdda*(adbdda>=0))**(1/3); tddadep=ddadep**(1/3); tsavbal=savbal**(1/3); run; data neuralnt.transte; set neuralnt.bank8dte; tddatot=ddatot**(1/3); tadbdda=(adbdda*(adbdda>=0))**(1/3); tddadep=ddadep**(1/3); tsavbal=savbal**(1/3); run; The addition of the new transformed inputs requires that the entire program be rerun. proc dmdb batch data=neuralnt.transtr dmdbcat=ctr out=dtr; var atmct tadbdda tddatot tddadep income invest tsavbal atres; class acquire(desc); run; proc neural data=dtr dmdbcat=ctr graph; input atmct / level=int id=x1; input tadbdda / level=int id=x2; input tddatot / level=int id=x3; input tddadep / level=int id=x4; input income / level=int id=x5; input invest / level=int id=x6; input tsavbal / level=int id=x7; input atres / level=int id=x8; target acquire / level=nom id=out comb=lin; hidden 1 / id=f1 act=identity comb=lin nobias; connect f1 out; connect x1 f1; 36 hidden 1 / id=f2 act=identity comb=lin nobias; connect f2 out; connect x2 f2; hidden 1 / id=f3 act=identity comb=lin nobias; connect f3 out; connect x3 f3; hidden 1 / id=f4 act=identity comb=lin nobias; connect f4 out; connect x4 f4; hidden 1 / id=f5 act=identity comb=lin nobias; connect f5 out; connect x5 f5; hidden 1 / id=f6 act=identity comb=lin nobias; connect f6 out; connect x6 f6; hidden 1 / id=f7 act=identity comb=lin nobias; connect f7 out; connect x7 f7; hidden 1 / id=f8 act=identity comb=lin nobias; connect f8 out; connect x8 f8; initial; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out / value=1; train outest=start; run; The Quasi-Newton algorithm is used to speed-up training. proc neural data=dtr dmdbcat=ctr ranscale=.1 graph; input atmct / level=int id=x1; input tadbdda / level=int id=x2; input tddatot / level=int id=x3; input tddadep / level=int id=x4; 37 input income / level=int id=x5; input invest / level=int id=x6; input tsavbal / level=int id=x7; input atres / level=int id=x8; target acquire / level=nom id=out comb=lin; hidden 1 / connect f1 connect x1 hidden 1 / connect x1 connect h1 id=f1 act=identity comb=lin nobias; out; f1; id=h1; h1; f1; hidden 1 / connect f2 connect x2 hidden 1 / connect x2 connect h2 id=f2 act=identity comb=lin nobias; out; f2; id=h2; h2; f2; hidden 1 / connect f3 connect x3 hidden 1 / connect x3 connect h3 id=f3 act=identity comb=lin nobias; out; f3; id=h3; h3; f3; hidden 1 / connect f4 connect x4 hidden 1 / connect x4 connect h4 id=f4 act=identity comb=lin nobias; out; f4; id=h4; h4; f4; hidden 1 / connect f5 connect x5 hidden 1 / connect x5 connect h5 id=f5 act=identity comb=lin nobias; out; f5; id=h5; h5; f5; hidden 1 / connect f6 connect x6 hidden 1 / connect x6 connect h6 id=f6 act=identity comb=lin nobias; out; f6; id=h6; h6; f6; 38 hidden 1 / connect f7 connect x7 hidden 1 / connect x7 connect h7 id=f7 act=identity comb=lin nobias; out; f7; id=h7; h7; f7; hidden 1 / connect f8 connect x8 hidden 1 / connect x8 connect h8 id=f8 act=identity comb=lin nobias; out; f8; id=h8; h8; f8; initial inest=start; freeze f1->out f2->out f3->out f4->out f5->out f6->out f7->out f8->out; train tech=quanew; score data=neuralnt.transtr nodmdb out=str; score data=neuralnt.transte nodmdb out=ste; run; %prlogit(atmct ,f11); %prlogit(tadbdda,f21); %prlogit(tddatot,f31); %prlogit(tddadep,f41); %prlogit(income ,f51); %prlogit(invest ,f61); %prlogit(tsavbal,f71); %prlogit(atres ,f81); 39 40 41