Chapter 3. Advanced Data Mining –Neural

advertisement
Chapter 3. Advanced Data Mining –Neural Networks
3.1 Supervised Prediction
A. Introduction
B. Univariate Function Estimation
C. Multivariate Function Estimation
3.2 Network Architecture
A. Multilayer Perceptrons
B. Link Functions
C. Normalized Radial Basis Functions
3.3 Training Model
A. Estimation
B. Optimization
C. Issues in Training Model
3.4 Model Complexity
A. Model Generalization
B. Architecture Selection
C. Pruning Inputs
D. Regularization
3.5 Predictive Data Mining
A. Overview
B. Generalized Additive Neural Networks
1
3.1 Supervised Prediction
A. Introduction
Data set
 n cases (observations, examples, instances).
 A vector of input variables x1 , x2 , , xk
 A target variable y (response, outcome)
Supervised prediction: to identify the target variable y, build a model
to predict y using inputs x1 , x2 , , xk .
Predictive data mining: predictive modeling methods applied to
large operational (mainly corporate) database.
Types of targets:
1. Interval measurement scale (amount of some quantities)
2. Binary response (indicator of an event)
3. Nominal measurement scale (classifications)
4. Ordinal measurement scale (grade)
What do we model?
1. Amount of some quantities (interval measurement)
2. The probability of the event (binary response)
3. The probabilities of each class of grade (nominal or ordinal
measurement)
2
B. Univariate Function Estimation
Example 3.1 The EXPCAR data set contains simulated data relating a
single target, Y1 to a single input X. The variable EY represents the true
function relationship given in a form


 1
3 
  3 

E ( y )  1 
 exp( 4 x)   
 exp( 3 x) 


  2 3   4 
  3   4 

1
This function form is generally unknown in applications.
Nonlinear Regression:
Reference
Seber and Wild (1989) Nonlinear Regression, New York: John Wiley
& Sons.
3
Main idea: assume that the functional form of the relationship
between y and x is known, up to a set of unknown parameters. These
unknown parameters can be estimated by fitting the model to the data.
The SAS NLIN procedure performs nonlinear regression.
Model 1: Model is correctly specified, i.e. we know that
  3 

 1
3 


E ( y )  1 
 exp( 4 x)   
 exp( 3 x) 
  3   4 

  2 3   4 


1
proc nlin data=Neuralnt.expcar;
parms theta1=100
theta2=2
theta3=.5
theta4=-.01;
model y1=theta1/((theta3/(theta3-theta4))*exp(theta4*x)+(theta1/theta2(theta3/(theta3-theta4)))*exp(theta3*x));
output out=pred p=y1hat;
run;
The estimating procedure involves an iterative algorithm, so the initial
guesses for the parameter estimates are necessary.
The keyword p in the OUTPUT statement signifies the predicted
values
After fitting the model, we plot the predicted results
goptions reset=all;
proc gplot data=pred;
symbol1 c=b v=none i=join;
symbol2 c=bl v=circle i=none;
symbol3 c=r v=none i=join;
legend1 across=3 position=(top inside right)
label=none mode=share;
plot ey*x=1 y1*x=2 y1hat*x=3 / frame overlay
legend=legend1;
run;
quit;
4
Clearly, it shows the model fits the data very well. It is to be expected
since the model if correctly specified.
By default, PROC NLIN finds the least-squares estimated using the
Gauss-Newton method.
The least-squares estimates:
ˆ  (ˆ1 ,ˆ2 ,ˆ3 ,ˆ4 ) satisfies
ˆ  argmin
n
ˆi ( ) 
    yi  y
2
i 1
Note the algorithm is sensitive to the initial guess. It may diverge with
a bad initial value.
5
proc nlin data=Neuralnt.expcar;
parms theta1=10
theta2=.2
theta3=.05
theta4=-.001;
model y1=theta1/((theta3/(theta3-theta4))*exp(theta4*x)+(theta1/theta2(theta3/(theta3theta4)))*exp(theta3*x));
output out=pred p=y1hat;
run;
goptions reset=all;
proc gplot data=pred;
symbol1 c=b v=none i=join;
symbol2 c=bl v=circle i=none;
symbol3 c=r v=none i=join;
legend1 across=3 position=(top inside right)
label=none mode=share;
plot ey*x=1 y1*x=2 y1hat*x=3 / frame overlay
legend=legend1;
run;
quit;
6
Clearly, this is not the right estimate.
The NLIN Procedure
Iterative Phase
Dependent Variable Y1
Method: Gauss-Newton
Iter
theta1
theta2
theta3
theta4
Sum of
Squares
88
89
90
91
92
93
94
95
96
97
98
99
100
-0.00848
-0.00822
-0.00791
-0.00771
-0.00746
-0.00716
-0.00697
-0.00673
-0.00643
-0.00625
-0.00602
-0.00574
-0.00556
1.3514
1.3591
1.3684
1.3741
1.3810
1.3895
1.3946
1.4009
1.4086
1.4133
1.4190
1.4260
1.4303
0.00315
0.00303
0.00289
0.00280
0.00269
0.00255
0.00247
0.00237
0.00225
0.00218
0.00208
0.00197
0.00190
0.5310
0.5295
0.5277
0.5266
0.5252
0.5236
0.5227
0.5215
0.5201
0.5192
0.5182
0.5169
0.5162
160349
159871
159633
159106
158662
158444
157959
157556
157371
156930
156572
156433
156037
WARNING: Maximum number of iterations exceeded.
WARNING: PROC NLIN failed to converge.
Sometimes alternative algorithms do better. The Marquardt
(Levenberg-Marquardt) method is a modification of the Gauss7
Newton method designed for ill-conditioned problems. In this case it
is able to find the least squares estimates in 12 iterations.
proc nlin data=Neuralnt.expcar method=marquardt;
parms theta1=10
theta2=.2
theta3=.05
theta4=-.001;
model y1=theta1/((theta3/(theta3-theta4))*exp(theta4*x)+(theta1/theta2(theta3/(theta3-theta4)))*exp(-theta3*x));
output out=pred p=y1hat;
run;
Parametric Model Fitting:
Parametric regression models can be used when the mathematical
mechanism of the data generalization is unknown.
Empirical model building
If we look at the scatter plot carefully, we may find that there is a
trend: y is very small when x is close to zero, it increases quickly and
then decreases to zero again as x increases.
This encourages us to try a parametric model (with two parameters,
1 , 2 )
E ( y )  1 xe2 x
proc nlin data=Nueralnt.expcar;
parms theta1=1
theta2=.05;
model y1=theta1*x*exp(-theta2*x);
output out=pred p=y1hat;
run;
goptions reset=all;
proc gplot data=pred;
symbol1 c=b v=none i=join;
8
symbol2 c=bl v=circle i=none;
symbol3 c=r v=none i=join;
legend1 across=3 position=(top inside right)
label=none mode=share;
plot ey*x=1 y1*x=2 y1hat*x=3 / frame overlay
legend=legend1;
run;
quit;
This empirical model fits the data quite well.
Polynomial parametric models:
Based in approximation theorem, any smooth function can be
approximate by a polynomial with a certain degree.
Consider a polynomial model
d
E ( y )   0   i xi ,
i 1
where d is the degree of the polynomial.
9
Polynomials are linear with respect to the regression parameters. The
least squares estimates have a close-form solution and do not require
iterative algorithms. Polynomials can be fitted in a number of SAS
procedures. Alternatively, polynomials (up to degree 3) can be fitted
in GPLOT using the i=r<l|q|c> option on the SYMBOL statement.
Linear regression:
Model: E ( y)  0  1 x
goptions reset=all;
proc gplot data=pred;
symbol1 c=b v=none i=join;
symbol2 c=bl v=circle i=none;
symbol3 c=r v=none i=rl;
legend1 across=3 position=(top inside right)
label=none mode=share;
plot ey*x=1 y1*x=2 y1*x=3 / frame overlay
legend=legend1;
run;
quit;
10
Cubic Regression:
2
Model: E ( y )   0  1 x   2 x
goptions reset=all;
proc gplot data=pred;
symbol1 c=b v=none i=join;
symbol2 c=bl v=circle i=none;
symbol3 c=r v=none i=rc;
legend1 across=3 position=(top inside right)
label=none mode=share;
plot ey*x=1 y1*x=2 y1*x=3 / frame overlay
legend=legend1;
run;
quit;
11
Apparently, lower order polynomials are lack of fit for the data set.
Instead of continuing to increase the degree, a better strategy is to use
modern smoothing methods.
Nonparametric Model Fitting:
A nonparametric smoother can fit data without having to specify a
parametric functional form. Three popular methods are:
 Loess-Each predicted value results from a separate (weighted)
polynomial regression on a subset of the data centered on that data
point (window).
 Kernel Regression-Each predicted value is a weighted average of
the data in a window around that data point.
 Smoothing Splines-Smoothing splines are made up of piecewise
cubic polynomials, joined continuously and smoothly at a set knots
Smoothing spline can be fitted using GPLOT with option
i=SM<nn>
Reference:
12
1. Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive
Models, New York: Chapman and Hall.
2. Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and its
Applications. New York: Chapman and Hall.
Smoothing Splines
Goal: to minimize
2
2

y


(
x
)



(
x
)
dx ,




i i
i

subject to that  ( xi ),  (xi ) and  (xi ) are continuous at each knots.
 A smoothing spline is known to be over-parameterization since
there are more parameters than data points. But it does not
necessarily interpolate (overfit) data.
 The parameters are estimated using penalized least squares. The
penalty term favors curves where the average squared second
derivative is low. This discourages a bumpy fit.
  is the smoothing parameter. When it increases, the fit becomes
smoother and less flexible. For smoothing splines the smoothing is
specified by the constant after SM in the SYMBOL statement. The
smoothing parameter rages from 0 to 99. Incorporating
13
(roughness/complexity) penalties into the estimation criterion is
called regularization.
 A smoothing spline is actually a parametric model, but the
interpretation of the parameter is uninformative.
Let’s fit this data using some smoothing splines.
goptions reset=all;
proc gplot data=Neuralnt.expcar;
symbol1 c=b v=none i=join;
symbol2 c=bl v=circle i=none;
symbol3 c=r v=none i=sm0;
title1 h=1.5 c=r j=right 'Smoothing Spline,
SM=0';
plot ey*x=1 y1*x=2 y1*x=3 / frame overlay;
run;
quit;
14
15
C. Multivariate Function Estimation
Most of the modeling methods devised for supervised prediction
problems are multivariate function estimations. The expected value of the
target can be thought of as a surface in the space spanned by the input
variables.
 It is uncommon to use parametric nonlinear regression for more than
one, or possibly two inputs.
 Smoothing splines and loess suffer because of the relative sparseness
of data in higher dimensions.
Multivariate function estimation is a challenging analytical task!
Some notable works on multivariate function estimation:
 Friedman, J. H. and Stuetzle, W. (1981). “Projection Pursuit
Regression”, J. am. Statist Assoc. 76, 817-23.
16
 Friedman, J.H. (1991). “Multivariate adaptive regression splines”,
Ann. Statist. 19. 1-141
Example 3.2. The ECC2D data set has 500 cases, two inputs X1, X 2 , and
an interval-scaled target Y1 . The ECC2DTE has an additional 3321 cases.
The variable Y1 in ECC2DTE set is the true, usually unknown, value of
the target. ECC2DTE serves as a test set; the fitted values and residuals
can be plotted and evaluated.
The true surface is
17
Method 1: E ( y)  0  1 x1   2 x2
proc reg data=neuralnt.ecc2d outest=betas;
model y1=x1 x2;
quit;
proc score data=neuralnt.ecc2dte type=parms scores=betas
out=ste;
var x1 x2;
run;
data ste;
set ste;
r_y1=model1-y1;
run;
proc g3d data=ste;
plot x1*x2=model1 / rotate=315;
plot x1*x2=r_y1
/ rotate=315 zmin=-50 zmax=50
zticknum=5;
run;
quit;
18
Method 2.
E ( y )   0  1 x1   2 x2  11 x12   22 x22  12 x1 x2  111 x13   222 x23  112 x12 x2   221 x1 x22
data a;
set neuralnt.ecc2d;
x11=x1**2;
x22=x2**2;
x12=x1*x2;
x111=x1**3;
x222=x2**3;
x112=x1**2*x2;
x221=x2**2*x1;
run;
data te;
set neuralnt.ecc2dte;
x11=x1**2;
x22=x2**2;
x12=x1*x2;
x111=x1**3;
x222=x2**3;
19
x112=x1**2*x2;
x221=x2**2*x1;
run;
proc reg data=a outest=betas;
model y1=x1 x2 x11 x22 x12 x111 x222 x112 x221;
quit;
proc score data=te type=parms scores=betas out=ste;
var x1 x2 x11 x22 x12 x111 x222 x112 x221;
run;
data ste;
set ste;
r_y1=model1-y1;
run;
proc g3d data=ste;
plot x1*x2=model1 / rotate=315;
plot x1*x2=r_y1
/ rotate=315 zmin=-50 zmax=50
zticknum=5;
run;
quit;
20
These multivariate polynomials are apparently lack of fitting!
Method 3: Neural Network Modeling
1. Input Data Source node (top):
21
 Select the ECC2D data
 In the variables tab, change the model role of Y1 to target
 Change the model role of all other variables except X1 and X2
to rejected
2. Input Data Source node (bottom):
 Select the ECC2DTE data
 In the Data tab, set the role to score
3. Score node:
 Select the button to apply training data score code to score data
set
4. SAS Code node:
 Type in the following program
proc g3d data=&_score;
plot x1*x2=p_y1 /rotate=315;
run;
5. Run the flow from the Score node.
22
23
Download