Abstract

advertisement
Abstract
The aim of this master thesis is to predict the outcome of a metric K which
describes the usage of Scania vehicles on different roads. This metric is of great
interest for the company and it is used during the process of developing vehicle
components. Through this work we discuss two well known supervised learning
methods, decision trees and neural networks, which enable us to build the
predictive models. The set of data used consists of approximately 30.000 vehicles,
and it is based on a set of features which from theoretical bases and expert
opinions in Scania were considered to contain relevant information and be related
to the output metric K. The selected data set represents the largest product segment
in Scania, long haulage vehicles.
CART (Classification and Regression Trees) and CHAID (Count or Chi-squared
Automatic Interaction Detection) regression trees of different sizes were first
performed given their simplicity and predictive power. However, evaluation of the
performance of these algorithms, based on the Nash-Sutcliffe efficiency measure
(0.61 and 0.65 for the CART and CHAID tree respectively), demonstrates that the
tree methods were not able to extract the patterns and relationships present in the
data. Finally, knowing that given enough data, hidden units, and training time, a
feedforward multilayer perceptron (MLP) can learn to approximate virtually any
function to any degree of accuracy, a MLP neural network model with one hidden
layer and four neurons was performed. An accuracy of 0.86 shows that the
predictive results obtained with the selected network were more accurate than
those acquired with the regression tree methods. Predicted values for the fraction
of the data set that did not contain the metric K as the target value were also
obtained, and the results showed that it is possible to rely on the predictive power
of the neural network model for further analysis, including other group of vehicles
built in Scania for different purposes.
1
2
Acknowledgements
I would first like to express my deep appreciation and sincere gratitude to
Professor Anders Grimvall for encouraging me and giving me the opportunity to
take part of the Master’s Programme in Statistics, Data Analysis and Knowledge
Discovery, and specially thank him for recommending me in Scania. Thank you
Anders for your guidance, for sharing your knowledge with me, for being so
patient, for always trying to explain everything really clearly and carefully during
our lessons and our consulting sessions. Your pedagogical spirit and your never
ending stream of ideas have always inspired me and motivated me during my
studies and work.
I would also like to thank everyone who in one way or another is involved in the
success of this thesis work. I thank Scania for its permission to carry out this
project and for making the data that has been analyzed available. Ann Lindqvist,
my supervisor at Scania, who gave me the opportunity and confidence to work
with this challenging project. Thank you Ann for helping within Scania to obtain
all the necessary knowledge and for introducing me to all the people that in
different ways contributed to improving the quality of my work. I also want to
thank you for your friendship, for making sure I would always find my way in
Scania and in Södertälje, and for sharing time with me after a hard working day.
Thank you for all the help you offered me inside and outside the office.
Thank you Klas Levin, Mikael Curbo, Anders Forsen, Erik Landström and all of
you at Scania who were always interested in my work, offering me valuable
advices and helping me during the development of my thesis.
Special thanks also go to Anders Noorgard, my supervisor at Linköpings
University, who gave me valuable tips about the project work and report writing,
3
and Oleg Sysoev for taking the time to review my work and for sharing his
knowledge about machine learning.
Last, but not least, I especially want to thank my beloved family: my mother
Juneida Sánchez, for your endless love, your blessings and your wishes for the
successful culmination of all my projects. My sister Marbella Covino, for
believing in me and for always being there for me. My boyfriend Karl Aronsson,
for all your love and support, for always encouraging me, for being next to me
during happy and difficult times, for always making me laugh, and for showing
me the positive side of every situation. My Swedish family, the Aronsson family,
because you have made me feel this is my second home, thank you for supporting
me and helping me ever since I decided to move to Sweden.
4
Table of contents
1
2
3
4
5
6
7
Introduction ................................................................................................................. 7
1.1 Scania ................................................................................................................... 7
1.2 Background .......................................................................................................... 7
1.3 Objective ............................................................................................................ 10
Data ............................................................................................................................ 11
Methodology.............................................................................................................. 17
3.1 Methodology step by step .................................................................................. 17
3.2 Supervised learning methods ............................................................................ 19
3.2.1 Decision Trees .................................................................................................... 19
3.2.1.1 CHAID .................................................................................................. 24
3.2.1.2 CART .................................................................................................... 25
3.2.2 Neural Networks ................................................................................................ 26
3.3 Approximation efficiency…. ............................................................................. 30
Results….. ................................................................................................................. 32
4.1 Decision Trees .................................................................................................... 32
4.2 Neural Networks ................................................................................................ 41
4.3 Scoring Process .................................................................................................. 51
Discussion and conclusions ....................................................................................... 52
Literature. .................................................................................................................. 56
Appendix ................................................................................................................... 57
5
6
1 Introduction
1.1 Scania
Scania is one of the world’s leading manufacturers of trucks and buses for heavy
transport applications. The company operates in about 100 countries and employs
almost 33000 people. Scania’s objective is to deliver optimized heavy trucks and
buses, engines and services, provide the best total operating economy for our
customers, and thereby be the leading company in the industry. Research and
development are concentrated in Södertälje-Sweden, and production units are
located in Europe and Latin America. This master thesis has been carried out at
RESD, the department responsible for diagnostic protocols. Software modules for
diagnostic communication between electrical control unit systems and external
tools are developed in this department, as well as off board systems for remotely
retrieving and analyzing diagnostic data. (Scania Inline, 2010)
1.2 Background
The electrical system in Scania vehicles is based on a number of control units that
communicate with each other via a common network based on serial
communication. Scania’s serial communication is based on the CAN protocol. The
principal features of a CAN bus system are control and interaction. At the heart of
the Scania’s CAN bus is a central control unit (coordinator) through which all
functions are monitored and managed. From here, the truck’s electrical functions
are arranged in three circuits: red, yellow and green. Red functions cover all main
management units: engine, gearbox, brakes and suspension. Yellow covers
instruments, bodywork systems, locking and alarm systems, and lights. Green
covers comfort systems, such as climate control, audio and informatics.
7
Figure 1. Vehicle Applications of Controller Area Network (CAN).
All control units found in the Scania electrical system can be checked with a plugin diagnostic software (SDP3) used by Scania’s workshops, among other purposes,
to decipher and interpret the operational data. Data about the operation of the
vehicles stored in the control units is read with SDP3 and sent via the Internet to
Scania’s servers in Södertälje for analysis. Only authorized dealer workshops and
distributors have the necessary identities and access rights to collect, use, and
transfer operational data.
8
Figure 2. Operational data collection system.
A huge amount of operational data have been gathered and analyzed to understand
vehicles usage, for example how the accelerator pedal and vehicle momentum are
utilized in varying topography. The frequency and harshness of brake applications,
the efficient use of the auxiliary brake system (Scania retarder and exhaust brake),
matches gear selection and engine revolutions, have also been evaluated from the
data collected.
Figure 3. Histogram of operational data collected since 2006 until 2010.
9
A metric used in the company to describe the usage of Scania vehicles due to the
road conditions and the driving needs (starts and stops, accelerations, etc.), and
which from now on we will call K, is calculated by using data collected from one
of the control units currently installed in the electrical system of the new
generation Scania vehicles. However, this value cannot be estimated for those
vehicles that are not equipped with the required control unit. Hence, it is of interest
to build a predictive model to estimate the values of K by making use of the data
available for all vehicles.
Given the nature of the problem we have decided to carefully select a set of data
consisting of a group of variables which are believed to contain potentially
predictive relationships with the variable K. Afterwards, different algorithms can
be implemented to capture the patterns and relations found in the data, and
generalize to unseen situations in a reasonable way. These algorithms, also called
supervised learning methods, apply various mechanisms capable of inducing
knowledge from examples of data.
1.3 Objective
The aim of this thesis work is to build a predictive model, by making use of
supervised learning methods, which could accurately predict the outcome of a
metric that describes the usage of Scania vehicles due to the road conditions and
the driving needs.
10
2
Data
Our analysis will be concentrated in the segment of long haulage vehicles for
which Scania has followed many years of strong presence in the market. The
selection of the data is based on an assortment of physical components in order to
obtain a group of vehicles that are mostly dedicated to this specific product
segment. The selected group represents 78% of the total operational data collected
in the company.
As illustrated in Figure 4, approximately 43% of these vehicles are not equipped
with the required control unit from which the necessary data for the calculation of
the metric K is collected. Thus, only 35% of the vehicles are selected for building
the predictive model and the remaining data will be used during the scoring
process. Some potential predictor variables are selected based on theoretical bases
and expert opinions. In addition, the corresponding values of K for this 35% of
data are calculated from a sequence of measurements made by specialists in
Scania, and they are based on series of studies.
16%
35%
Required control units (yes)
- Long haulage
Required control units (yes)
- Other purposes
Required control units (no)
- Long haulage
43%
Required control units (no)
- Other purposes
6%
Figure 4. Pie chart of operational data.
11
Once the selected data had been extracted from the different databases, it was
finally integrated into one data set consisting of approximately 30.000
observations. After removing input variables that had low or no predictive power,
the input data set was represented by four variables.
The first variable corresponds to an 11*12 matrix called L. The second and third
variables represent two vectors of 10 positions each, called S and G. The variables
L, S and G implicitly contain information of the usage of Scania vehicles. The last
variable is called E and it corresponds to the different categories for one of the
vehicle components. All input variables excluding E are represented by continuous
values, whereas the variable E contains nominal values.
Afterwards, for simplicity reasons and reduction of the data, we performed
transformations of the raw data to create new input variables. We calculated
averaged values of the vectors S and G. However, it was not possible to make this
estimation for the values in the matrix L due to the importance of the information
contained in each of its positions; every position in the matrix is crucial for the
pattern recognition process. Hence, the matrix was just reorganized into a feature
vector of 132 positions for possible handling of the variable by the predictive
methods. As there was no need to transform the output variable K, this variable
was used in its raw form.
A quantitative analysis of the data set is given by the descriptive statistics of the
input and output variables, shown by the color maps in Figures 5-8, the Tables 1
and 2, and the histograms in Figures 9-11. They provide simple summaries about
the data set being analyzed and the measures.
12
Figure 5. Minimum values for the input variable Matrix L.
Figure 6. Maximum values for the input variable Matrix L.
13
Figure 7. Mean values for the input variable Matrix L.
Figure 8. Median values for the input variable Matrix L.
Table1. Descriptive statistics for the input
variables S and G, and for the output variable K.
Variable
Min
Max
Mean
Q1
Q3
Median
S
1.527
82.218
54.538
48.826
62.869
57.240
G
2.000
86.774
35.238
29.023
40.250
35.278
K
17.090
79.310
34.564
30.750
37.460
33.560
14
Table2. Counts for the input variable E.
Variable
Count
Variable
Count
Variable
Count
Total E
E01
E02
E03
E04
E05
E06
E07
E08
E09
E10
E11
E12
E13
E14
28883
162
13
627
86
342
1860
2382
3470
20
507
2
391
10
616
E15
E16
E17
E18
E19
E20
E21
E22
E23
E24
E25
E26
E27
E28
E29
E30
30
1175
22
297
1096
480
1389
529
1129
149
146
813
69
32
8
12
E31
E32
E33
E34
E35
E36
E37
E38
E39
E40
E41
E42
E43
E44
E45
E46
52
8
4
1
7
13
1
96
700
47
504
5701
42
56
3786
1
In addition, as a summary of the frequency of the continuous input and output
variables, the histogram plots from Figures 9 trough 11 were also obtained:
1400
1200
Frequency
1000
800
600
400
200
0
0
11
22
33
44
55
66
S
Figure 9. Histogram of the input variable S.
15
77
2000
Frequency
1500
1000
500
0
11
22
33
44
G
55
66
77
88
Figure 10. Histogram of the input variable G.
2500
Frequency
2000
1500
1000
500
0
18
27
36
45
54
63
72
K
Figure 11. Histogram of the output variable K.
Further information about how each of the chosen supervised learning methods
interprets and utilizes the selected variables when building the predictive models is
given in detail in the results chapter.
16
3 Methodology
A sequence of steps were followed during the development of the project in order
to successfully reach the main objective of this master thesis, to build a predictive
model, by making use of supervised learning methods, which could accurately
predict the outcome of a metric that describes the usage of Scania vehicles due to
the road conditions and the driving needs.
3.1 Methodology step by step:
1. First, we gathered the training set of data which needed to be characteristic
of the real-world use of the function to be learned. Thus, approximately
30.000 observations were collected, characterized by a set of input
variables which implicitly contained descriptive information of the usage of
the vehicles, and that were considered to have enough predictive power to
be able to estimate the values of the output variable K. The corresponding
values of K were also collected for each observation from a sequence of
measurements made by specialists in Scania.
2. Second, we determined the input feature representation of the function.
During this step, the input variables were reorganized or transformed into
suitable values for the predictive methods. Thus, matrices were reorganized
into vectors where all positions were kept, and vectors were transformed
into single averaged values. The number of features should not be too large,
because of the curse of dimensionality; but should be large enough to
accurately predict the output. The output variable was used in its raw form.
3. Third, we carried out graphical representations of the data, and analysis of
the descriptive statistics which were useful for detecting spurious
17
observations. Inconsistent records were eliminated, thus increasing the
quality of the data.
4. Subsequently, we selected two supervised learning methods which were
thought to be appropriate for the given problem and data at hand, decision
trees and neural networks.
5. The selected predictive methods required partitioning the dataset into
training and validation sets. The training set teaches the model, and the
validation set measures and assesses the model performance and reliability
for applying the model to future unseen data. The validation process avoids
the over-fitting problem by validating the model on a different set of data.
Our model data set was split into a training data set and a validation data
set, 70% and 30% respectively, in order to create a large enough validation
data set. A validation data set that is too small might lead to erroneous
conclusions when evaluating the reliability of the model.
6. We completed the design by running the learning algorithms on the
gathered training set. Parameters of the algorithms were adjusted to
optimize the performance on a subset (validation set) of data. A manual
forward selection method was also implemented during this step for
selecting the combination of input variables that increased the predictive
power of the learning methods.
7. Finally, we assessed the performance of the chosen learning algorithms
based on the produced average squared error, and compared the efficiency
of the predictions obtained based on the Nash-Sutcliffe efficiency measure.
The best predictive method was selected and applied to a new set of data in
order to compute the corresponded values of K.
18
3.2 Supervised learning methods:
In a typical scenario for supervised learning methods, we have an outcome
measurement, usually quantitative or categorical, that we wish to predict based on
a set of features. We also have a training set of data, in which we observe the
outcome and feature measurements for a set of objects. Using this data we build a
prediction model, or learner, which will enable us to predict the outcome for new
unseen objects. A good learner is one that accurately predicts such an outcome.
A Supervised learning method is a machine learning technique for deducing a
function from training data. The function fitting paradigm from a machine learning
point of view is as follows: Suppose for simplicity that the errors are additive and
that the model
is a reasonable assumption. Supervised learning
attempts to learn
by example through a teacher. One observes the system under
study, both the inputs and the outputs, and assembles a training set of observations
i = 1,…,N. The observed input values to the system
are also fed
into an artificial system, known as a learning algorithm which also produces
outputs
in response to the inputs. The learning algorithm has the property
that it can modify its input/output relationship
in response to differences
between the original and generated outputs. This process is known as
learning by example. Upon completion of the learning process the hope is that the
artificial and real outputs will be close enough to be useful for all sets of inputs
likely to be encountered in practice. (Hastie et al., 2001).
3.2.1 Decision Trees:
Decision trees belong to a class of data mining techniques that break up a
collection of heterogeneous records into smaller groups of more homogeneous
19
records using a directed knowledge discovery. Directed knowledge discovery is
goal-oriented where it explains the target fields in terms of the rest of the input
fields to find meaningful patterns in order to predict the future events using a chain
of decision rules. In this way, decision trees provide accurate and explanatory
models where the decision tree model is able to explain the reason of certain
decisions using these decision rules. Decision trees could be used in classification
problems and also in estimation problems where the output is a continuous value,
and in the last case the tree is called a regression tree. (Abdullah, 2010)
For a tree to be useful, the data in the leaves (the final groups or unsplit nodes)
must be similar with respect to some target measure, so that the tree represents the
segregation of a mixture of data into purified groups. (Neville, 1999). The general
form of this modeling approach is illustrated in Figure 12.
Decision trees attempt to find a strong relationship between input values and target
values in a group of observations that form a data set. When a set of input values is
identified as having a strong relationship with a target value, then all of these
values are grouped in a bin that becomes a branch on the decision tree. These
groupings are determined by the observed form of the relationship between the bin
values and the target. Binning involves taking each input, determining how the
values in the input are related to the target, and, based on the input-target
relationship, depositing inputs with similar values into bins that are formed by the
relationship. A strong input-target relationship is formed when knowledge of the
value of an input improves the ability to predict the value of the target. (De Ville,
2006)
20
Figure 12. Illustration of a decision tree.
Decision trees have many useful features, both in the traditional fields of science
and engineering and in a range of applied areas, including business intelligence
and data mining. These useful features include (De Ville, 2006):
Decision trees produce results that communicate very well in symbolic and
visual terms. Decision trees are easy to produce, easy to understand, and
easy to use. One valuable feature is the ability to incorporate multiple
predictors in a simple, step-by-step fashion. The ability to incrementally
build highly complex rule sets (which are built on simple, single association
rules) is both simple and powerful.
21
Decision trees readily incorporate various levels of measurement, (nominal,
ordinal, and interval), regardless of whether it serves as the target or as an
input.
Decision trees readily adapt to various twists and turns in data (unbalanced
effects, nested effects, offsetting effects, interactions and nonlinearities)
that frequently defeat other one-way and multi-way statistical and numeric
approaches.
Trees require little data preparation and perform well with large data in a
short time.
Decision trees are nonparametric and highly robust (for example, they
readily accommodate the incorporation of missing values) and produce
similar effects regardless of the level of measurement of the fields that are
used to construct decision tree branches (for example, a decision tree of
income distribution will reveal similar results regardless of whether income
is measured in thousands, in tens of thousands, or even as a discrete range
of values from 1 to 5).
Trees also have their short comings (Neville, 1999):
When the data contain no simple relationship between the inputs and the
target, a not complex tree is too simplistic. Even when a simple description
is accurate, the description may not be the only accurate one.
A tree gives an impression that certain inputs uniquely explain the
variations in the target. A completely different set of inputs might give a
different explanation that is just as good.
Trees may deceive; they may fit the data well but then predict new data
worse than having no model at all. This is called over-fitting. They may fit
the data well, predict well, and convey a good story, but then, if some of the
original data are replaced with a fresh sample and a new tree is created, a
22
completely different tree may emerge using completely different inputs in
the splitting rules and consequently conveying a completely different story.
Specific decision tree methods include the CHAID (Count or Chi-squared
Automatic Interaction Detection) and CART (Classification and Regression Trees)
algorithms. The following discussion provides a brief description of these
algorithms for building decision trees.
3.2.1.1 CHAID
CHAID is an acronym for “Chi-Squared Automatic Interaction Detection”. This
algorithm accepts either nominal or ordinal inputs, however some software
packages as SAS Business Analytics and Business Intelligence Software accept
interval inputs and automatically group the values into ranges before growing the
tree.
The splitting criterion is based on P-values from the F-distribution (interval
targets) or Chi-squared distribution (nominal targets). The P-values are adjusted to
accommodate multiple testing.
Missing values are treated as separate values. For nominal inputs, a missing value
constitutes a new category. For ordinal inputs, a missing value is free of any order
restrictions.
The search for a split on an input proceeds stepwise. Initially, a branch is allocated
for each value of the input. Branches are alternately merged and re-split as seems
warranted by the P-values. The algorithm stops when no merge or re-splitting
operation creates an adequate P-value. The final split is adopted. A common
alternative, sometimes called the exhaustive method, continues merging to a
23
binary split and then adopts the split with the most favorable P-value among all
splits the algorithm considered.
The tests of significance are used to select whether inputs are significant
descriptors of target values and, if so, what are their strengths relative to other
inputs. Thus, after a split is adopted for an input, its P-value is adjusted, and the
input with the best adjusted P-value is selected as the splitting variable.
If the adjusted P-value is smaller than a specified threshold, then the node is split.
Tree construction ends when all the adjusted P-values of the splitting variables in
the unsplit nodes are above the user-specified threshold. (SAS Enterprise Miner
Tutorial, 2010)
3.2.1.2 CART
The following is a description of the Breiman, Friedman, Olshen, and Stone
Classification and Regression Trees method for building decision trees. More
detailed information can be found in the following text: Breiman, L., Friedman,
J.H., Olsen, R. A., and Stone, C. J. (1984), Classification and Regression Trees,
Pacific Grove: Wadsworth.
For this method, the inputs are either nominal or interval. Ordinal inputs are
treated as interval. CART trees employ a binary splitting methodology, which
produces binary decision trees. They do not embrace the kind of merge-and-split
heuristic developed in the CHAID algorithm to grow multi-way splits, so multiway splits are not included in this approach. Classification and Regression Trees
do not use the statistical hypothesis testing approach proposed in the CHAID
algorithm, and they rely on the empirical properties of a validation or resample
data set to guard against overfit. (De Ville, 2006)
24
The full methodology for growing and pruning branches in CART trees includes
the following (De Ville, 2006; SAS Enterprise Miner Tutorial, 2010):
For a continuous response field, both least squares and least absolute
deviation measures can be employed. Deviations between training and test
measures can be used to assess when the error rate has reached a point to
justify pruning the sub tree below the error-calculation point.
For a categorical-dependent response field, it is possible to use either the
Gini diversity measure or Twoing criteria.
Ordered Twoing is a criterion for splitting ordinal target fields.
Calculating misclassification costs of smaller decision trees is possible.
Selecting the decision tree with the lowest or near-lowest cost is an option.
Costs can be adjusted.
Picking the smallest decision tree within one standard error of the lowest
cost decision tree is an option.
In addition to a validated decision tree structure, CART trees also:
work with both continuous and categorical response variables.
omit observations with a missing value in the splitting variable when
creating a split.
create surrogate splits and uses them to assign observations to branches
when the primary splitting variable is missing. If missing values prevent the
use of the primary and surrogate splitting rules, then the observation is
assigned to the largest branch (based on the within node training sample).
grow a larger-than-optimal decision tree and then prunes it to a final
decision tree using a variety of pruning rules.
consider misclassification costs in the desirability of a split.
use cost-complexity rules in the desirability of a split.
25
split on linear and multiple linear combinations.
do sub sampling with large data sets.
3.2.2 Neural Networks
Neural networks (NNs) form a joint framework for regression and classification
that has become widely used during the past decades, traditionally associated with
machine learning and data mining. Because of their ability to approximate any
dataset, NNs are sometimes called universal approximators (Hornik et al,1989)
The study of artificial neural networks is motivated by their similarity to
successfully working biological systems, which compared to the complete system
consist of very simple but numerous nerve cells that work massively parallel and
have the capability to learn. There is no need to explicitly program a neural
network. For instance, it can learn from training examples. One result from this
learning procedure is the capability of neural networks to generalize and associate
data. After successful training, a neural network can find reasonable solutions for
similar problems of the same class that were not explicitly trained.
A technical neural network consists of simple processing units or neurons which
are connected by directed, weighted connections. Data are transferred between
neurons via connections with the connecting weight being either excitatory or
inhibitory.
A propagation function converts vector inputs to scalar network inputs. For a
neuron the propagation function receives the outputs of other neurons and
transforms them in consideration of the connecting weights into a network input
net, that can be used by the activation function.
26
The activation function is the “switching status” of a neuron. Based on the model
of nature every neuron is always active to a certain extent. The reactions of the
neurons to the input values depend on this activation state. Neurons get activated,
if the network input exceeds their threshold value. The threshold value is explicitly
assigned to the neurons and marks the position of the maximum gradient value of
the activation function. When centered on the threshold value, the activation
function of a neuron reacts particularly sensitive. The activation of a neuron
depends on the prior activation state of the neuron and the external input.
Finally, an output function may be used to process the activation once again. The
output function of a neuron calculates the values which are transferred to the other
connected neurons. The learning strategy is an algorithm that can be used to
change the neural network and thus such a network can be trained to produce a
desired output for a given input. An error is composed from the difference
between the desired response and the system output. This error information is fed
back to the system and adjusts the system parameters in a systematic fashion. The
process is repeated until the performance is acceptable. It is clear from this
description that the performance hinges heavily on the data. If one does not have
data that cover a significant portion of the operating conditions then neural
network technology is probably not the right solution. (Kriesel, 2005)
The term neural network has evolved to encompass a large class of models and
learning methods. Here we describe the most commonly used neural net, a
feedforward multilayer perceptron (MLP) neural network model with one hidden
layer. A more general description and analysis of the neural network framework
can be found in Bishop (1995).
This neural network is a two-stage regression or classification model typically
represented by a network diagram as to the one shown in Figure 13.
27
Figure 13. Schematic of a single hidden layer, feed-forward neural network.
For regression, there is only one output unit , however these networks can handle
multiple responses in a seamless fashion. Derived features
linear combination of the inputs , and then the target
of linear combination of the
, m =1,…,M,
, k = 1,…,K,
=
=
The activation function
is modeled as a function
.
=
where
are created from
(1)
, k = 1,…,K,
=
, and
.
is usually chosen to be the sigmoid
1/(1+e-ν).
Sometimes a Gaussian radial basis function (Hastie et al., 2001) is used for the
, producing what is known as a radial basis function network.
28
Neural network diagrams like the one in Figure 13 are sometimes drawn with an
additional bias unit feeding into every unit in the hidden and output layers.
Thinking of the constant “1” as an additional input feature, this bias unit captures
the intercepts
and
in model (1).
The output function
outputs
.
allows a final transformation of the vectors of
For regression we typically choose the identity function
.
Early work in classification also used the identity function, but this was later
abandoned in favor of the softmax function
. This is of course
exactly the transformation used in a multilogit model, and produces positive
estimates that sum to one.
The units in the middle of the network, computing the derived features
called hidden units because the values
are not directly observed. In general
there can be more than one hidden layer. We can think of
of the original inputs
, are
as a basis expansion
; the neural network is then a standard linear model, or a
linear multilogit model, using these transformations as inputs. (Hastie et al., 2001)
The network shown in Figure 13 belongs to the class of feed-forward networks, in
which the connections go from one layer to its successor only; there are not feedbacks. The fitting of the neural network model is done by searching for the
weights that minimize the error function, which often takes the form of a weighted
sum of squared errors:
.
The practical use of neural networks has clear advantages but also some
limitations.
29
Advantages:
NNs involve human like thinking.
There is no need to assume an underlying probability distribution such as
usually is done in statistical modeling.
They handle noisy or missing data.
They can work with large number of variables or parameters.
They create their own relationship amongst information.
NNs are applicable to multivariate non-linear problems. A neural network
can perform tasks that a linear program cannot.
When an element of the neural network fails, it can continue without any
problem by their parallel nature.
NNs learn and do not need to be reprogrammed.
They provide general solutions with good predictive accuracy.
Disadvantages:
Large NNs require high processing time.
The individual relations between the input variables and the output
variables are not developed by engineering judgment thus NNs model tend
to be black boxes or input/output tables without analytical basis.
3.3 Approximation efficiency:
The efficiency of the predictions obtained by the different supervised learning
methods can be quantified in many different ways. We have decided to use the
Nash-Sutcliffe efficiency measure. The efficiency E proposed by Nash and
Sutcliffe (1970) is defined as one minus the sum of the absolute squared
differences between the predicted and observed values normalized by the variance
30
of the observed values during the period under investigation. E is calculated as:
(Krause et al., 2005)
This measure can take values from minus infinity to one, and it is close to one if
the prediction errors are small.
31
4 Results
4.1 Decision Trees:
Different techniques do better with different data but trees should compete along
with other methods. We decided to make an approximation of the CHAID and
CART regression trees by making use of the tree node in SAS Enterprise Miner.
The SAS Enterprise Miner provides a visual programming environment for
predictive modeling. SAS algorithms incorporate and extend most of the good
ideas of the tree methods discussed in the methodology chapter.
The two chosen tree methods were performed producing a series of trees which
were based on selected parameters. A number of common tree parameters were set
to specific values to support appropriate assessment efforts. The remaining
parameters were set according to the different algorithms performed in each tree
node. Details about the parameters setting can be found in Appendix A.
We first performed the CHAID tree method by building trees of different depths,
varying from 6 to 15. Given that the target is a continuous value we used the
average squared error as the assessment measure. The results obtained are shown
in table 3:
32
Table 3. Depth and average squared error - CHAID tree.
Depth
ASE Training
ASE Validation
6
7
8
9
10
11
12
13
14
15
11.40
11.40
11.40
11.40
11.40
11.40
11.40
11.40
11.40
11.40
12.75
12.75
12.75
12.75
12.75
12.75
12.75
12.75
12.75
12.75
These results confirm that the predictive power of the tree will not be improved by
building a more complex model. Thus, the depth of the CAHID tree was set to 6.
In the same fashion, when varying the values for the depth of the tree in the CART
model we obtained the results shown in table 4:
Table 4. Depth and average squared error - CART tree.
Depth
ASE Training
ASE Validation
6
7
8
9
10
11
12
13
14
15
12.00
10.74
10.00
9.65
9.48
9.38
9.34
9.33
9.34
9.34
13.17
12.37
11.75
11.55
11.43
11.37
11.36
11.36
11.36
11.36
33
Figure 14 shows a plot of average squared error vs. depth for the CART model.
There, we observe that the lines for the training and validation set are progressing
as the depth of the tree increases; however after a depth of 10 the reduction in the
average squared error is not significant. Therefore, 10 is chosen to be an
appropriated value for the depth of the CART tree:
Average squared error
14
13
12
11
Training
10
Validation
9
8
6
7
8
9
10
11
12
13
14
15
Depth of the tree
Figure 14. Average squared error vs. Depth - CART tree.
To verify that the performance of the selected trees was acceptable, we carefully
analyzed the results of the tree nodes where we could find a number of diagnostic
tools. First, we reviewed the assessment plots which show tree evaluation
information; trees are evaluated using the number of cases that are correctly
predicted. For each tree size, a tree that correctly predicts the most training cases is
selected to represent that size. The selected tree is then evaluated again with
validation cases.
The assessment plots in Figures 15 and 16 display lines of the modeling
assessment statistic between the training and validation data sets across the
number of leaves that are created. The plots allow us to evaluate the accuracy of
34
the decision tree models by viewing the change in the average squared error in the
growth of the trees based on the number of leaves to the design.
Figure 15. Assessment plot - CHAID tree. The selected tree contains 253 leaves.
Figure 16. Assessment plot - CART tree. The selected tree contains 134 leaves.
The following situations can be identified in these plots:
35
The lines for the training and validation data are progressing as the number
of leaves increases.
In Figure 15, the validation data confirms the progress of the training data
until the number of leaves are around 100. After this point, the line for the
validation data starts to flat out and move apart from the training data line.
A similar situation is encountered in Figure 16 where the validation data
confirms the progress of the training data until the number of leaves are
around 30.
If any of the decision tree models were selected as the best and final model, these
plots would help us evaluate smaller trees that still perform well in terms of the
assessment measure but are less complex and more reliable, and therefore they
might be more appropriate for the prediction process.
The trees chosen by SAS Enterprise Miner as the best trees to use were the one
with 253 leaves for the CHAID model and the one with 134 for the CART model.
These trees were selected because they optimized the assessment value on the
training data set.
The average number of observations assigned in each leaf was around 80 and 150
for the CHAID and CART tree respectively, which represents 0.28 and 0.52
percent of the total number of cases in the model set. The appropriate value of
observations in a leaf to avoid overfitting or underfitting the training data set
depends on the context, i.e., the size of the training data set; however, as a rule-ofthumb, an appropriated value would be between 0.25 and 1 percent of the model
set. (Berry and Linoff, 1999)
Subsequently, we constructed the color maps shown in Figures 17 and 18 in order
to illustrate the importance each of the input variables had when building the
36
decision trees. The higher the importance measure the better the variable
approximates the target values, and therefore variables with high importance
represent strong splits.
Figure 17. Variable’s importance - CHAID tree.
Figure 18. Variable’s importance - CART tree.
37
Afterwards, we analyzed graphics diagnostic of the model fit. Figures 19 and 20
are scatter plots of observed vs. predicted values for the validation set of the
CHAID and CART models correspondingly.
Pr edi ct ed K
80
75
70
65
60
55
50
45
40
35
30
25
20
15
15
20
25
30
35
40
45
50
55
60
65
70
75
80
K
Figure 19. Scatter plot of predicted vs. observed values - CHAID tree.
Pr edi ct ed K
80
75
70
65
60
55
50
45
40
35
30
25
20
15
15
20
25
30
35
40
45
50
55
60
65
70
75
80
K
Figure 20. Scatter plot of predicted vs. observed values - CART tree.
38
From the two plots, it is easy to observe a large discrepancy of the observed and
predicted values. Points are lying far away from the 45 degree reference line that
passes through the origin indicating a low predictive accuracy.
Residual plots of the tree models were also evaluated. Residuals are helpful in
evaluating the adequacy of the model itself relative to the data and any assumption
made in the analysis. If the model fits the data well, and the typical assumption of
independent normally distributed residuals is also made, the plots of the residuals
versus predicted values should not show any patterns or trends, i.e., they should be
a random scatter of points.
The plots of residuals in Figures 21 and 22 show a slightly increasing variation of
the residuals as the predicted values increase, which may suggest that the
assumption of equal variance of the residuals is not valid for this data.
Nevertheless, it is hard to confirm this assumption and it would be more natural to
consider the plots of residuals within the limits one may expect when building a
complex predictive model.
Resi dual s
40
30
20
10
0
- 10
- 20
- 30
20
30
40
50
60
Pr edi ct ed K
Figure 21. Scatter plot of residuals vs. predicted values - CHAID tree.
39
Resi dual s
40
30
20
10
0
- 10
- 20
- 30
20
30
40
50
Pr edi ct ed K
Figure 22. Scatter plot of residuals vs. predicted values - CART tree.
In addition, the histograms shown in Figures 23 and 24 provide a view of the
overall distribution of the residuals. The plots appear to be bell-shaped, however
the pattern found in the plots of residuals vs. predicted values is also reveled in
these histograms which show too long tails to be considered approximately
normal.
Normal
1400
1200
Frequency
1000
800
600
400
200
0
-24
-16
-8
0
Residuals
8
16
24
Figure 23. Histogram of residuals - CHAID tree.
40
Normal
1400
1200
Frequency
1000
800
600
400
200
0
-24
-16
-8
0
8
Residuals
16
24
32
Figure 24. Histogram of residuals - CART tree.
Finally, we calculated the Nash-Sutcliffe efficiency measure for the validation sets
of the CHAID and CART models to evaluate the performance of these trees. The
values obtained are far from 1, indicating a poor fit.
4.2 Neural Networks:
The Neural Network node in SAS Enterprise Miner enables us to fit nonlinear
models such as a multilayer perceptron (MLP). NNs are flexible prediction models
that, when carefully tuned, often provide optimal performance in regression and
classification problems. There is no theory that tells us how to set the parameters
41
of the network to approximate any given function. It will generally be impossible
to determine the correct design without training numerous networks and
estimating the generalization error for each model. The design process and the
training process are both iterative.
We made use of the advanced user interface provided by the Neural Network node
to create a MLP model. Figure 25 displays the constructed network.
Figure 25. Schematic representation of the MLP neural
network model built in SAS Enterprise Miner.
The layer on the left represents the input layer and it consists of all interval and
nominal inputs. The middle layer is the hidden layer, in which hidden units
(neurons) were varied from 1 to 40, and 4 was selected as the optimal value based
on the results shown in Table 5 and Figure 26. The layer on the right is the output
layer, which correspond to the target variable K. The propagation, activation and
output functions were selected based on the default configuration specified in the
methodology chapter.
42
Table 5. Average squared error of a feedforward MLP
neural network model with one hidden layer.
Neurons
ASE Training
ASE Validation
Neurons
ASE Training
ASE Validation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
5.49
4.12
3.7
3.57
3.38
3.25
3.07
3.19
3.31
2.73
3.43
2.95
3.2
2.97
2.74
2.7
2.65
2.78
2.78
2.51
6.39
5.3
4.93
4.25
4.73
4.23
4.34
4.39
4.39
4.4
4.57
4.41
4.4
4.64
4.05
4.22
4.34
4.3
3.94
4.2
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
2.24
2.72
2.75
2.49
2.39
2.24
2.49
2.24
2.32
2.27
2.14
2.33
2.23
2.3
2.05
1.98
1.92
1.96
2.2
2
4.09
4.11
4.37
3.78
4.12
3.91
3.87
4.05
3.94
4.03
4
3.94
4.07
3.97
4.06
3.92
4.09
4.04
3.95
4.1
Average squared error
7
6
5
4
3
Training
2
Validation
1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Number of neurons
Figure 26. Average squared error vs. number of neurons.
43
The numbers of hidden neurons affect how well the network is able to predict the
output variable. A large number of hidden neurons will ensure correct learning and
prediction of the data the network has been trained on, but its performance on new
data may be compromised. On the other hand, with too few hidden neurons the
network may be unable to learn the relationships among the data. Thus, selection
of the number of hidden neurons is crucial. The trial and error approach performed
for the selection of an appropriate number of hidden neurons started with a small
number of neurons and gradually increased the number if the network had failed to
reduce the error.
Although one hidden layer is always sufficient provided we have enough data,
there are situations where a network with two or more hidden layers may require
fewer hidden units and weights than a network with one hidden layer, thus using
extra hidden layers sometimes can improve generalization. We built a second
model with two hidden layers, and neurons varying from two to four which were
differently distributed among the layers during each run. Estimations of the
average squared error for each network are display in Table 6, and the results
reveal that better predictions are not obtained by adding an extra hidden layer.
Table 6. Average squared error of a feedforward MLP
neural network model with two hidden layers.
Neurons First Layer
Neurons Second Layer
ASE Training
ASE Validation
1
1
1
2
3
2
1
2
3
1
1
2
5.64
5.01
5.55
9.29
5.26
5.93
6.26
5.62
6.33
10.05
5.69
6.80
44
The plot shown in Figure 27 displays the average squared error for each iteration
of the training and validation sets of the MLP model with one hidden layer and
four neurons.
Figure 27. Assessment plot - MLP (1 hidden layer and 4 neurons).
The lines for the training and validation data are progressing as the number of
iterations increases. By default, the node completed 100 iterations but we could
have continued the training process. However, given that the reduction in the
average squared error was becoming less and less significant after the hundredth
iteration, we decided to evaluate the default model.
Color maps of the weight factors were constructed and they are displayed in
Figures 28-33. Each input has its own relative weight, which gives the inputs the
impact that is needed during the training process. Weights determine the intensity
of the inputs signal as registered by the neurons. Some input variables are
considered more important than others, and the color maps illustrate the effect that
each input has on the network.
45
Figure 28. Weight 1 - Variable L.
Figure 29. Weight 2 - Variable L.
46
Figure 30. Weight 3 - Variable L.
Figure 31. Weight 4 - Variable L.
47
Figure 32. Weights - Variable E.
Figure 33. Weights - Variables G and S.
Subsequently, a scatter plot of predicted vs. observed values was obtained and it is
shown in Figure 34. This plot reveals that the MLP neural network model with
one hidden layer and four neurons produced better predictive results for the output
variable K than the CHAID and CART tree models. Observed and predicted
values are very close to each other which is expected from an accurate model.
Observations lie close to the 45 degree reference line that passes through the
origin showing a high correlation between the observed and predicted values.
48
However, the closer we get to the minimum and especially maximum values of the
data, the more disperse the points tend to be, indicating that prediction of those
values are less accurate. These points, lying far away from the diagonal line,
represent cases with a few numbers of observations.
Pr edi ct ed K
80
75
70
65
60
55
50
45
40
35
30
25
20
15
15
20
25
30
35
40
45
50
55
60
65
70
75
80
K
Figure 34. Scatter plot of predicted vs. observed values.
Additionally, in Figure 35 we can observe that even thought the residuals are fairly
scattered around zero, there is a slight but discernable tendency for the residuals to
increase as the predicted values increase. This indicates that the model performs
less well when predicting high observed values.
Resi dual s
30
20
10
0
- 10
- 20
- 30
10
20
30
40
50
Pr edi ct ed K
Figure 35. Scatter plot of residuals vs. predicted values.
49
60
Once again, the histogram of the residuals shown in Figures 36 appears to follow a
normal distribution pattern; however the tails are too long. When building
complex predictive models, such as trees or neural networks, it is acceptable to
obtain residuals that behave as the ones in this figure.
Normal
2500
Frequency
2000
1500
1000
500
0
-21
-14
-7
0
7
Residuals
14
21
Figure 36. Histogram of residuals.
Finally, we calculated the Nash-Sutcliffe efficiency measure for the validation set
to evaluate the performance of the selected neural network. This time, the value
obtained is closer to 1 indicating a reasonably good fit.
50
4.3 Scoring Process:
The final and most important step during the process of building a predictive
model is the generalization or scoring process, i.e., how well the model makes
predictions for cases that were not available at the time of training and that do not
contain a target value.
The Score node in SAS Enterprise Miner generates and manages scoring code that
is produced by the tree or neural network nodes. The code is encapsulated and can
be used in most SAS environments even without the presence of Enterprise Miner.
After scoring 43% of the collected data, from which the value of the variable K
was not possible to calculate, we produced the overlaid histograms of observed
and predicted values shown in Figure 36. The distribution of the predicted values
is very much analogous to the distribution of the observed values, which indicates
that we have obtained reliable predictive results.
Variable
K
Predicted K
3000
Frequency
2500
2000
1500
1000
500
0
18
27
36
45
Data
54
63
72
Figure 37. Histograms of observed and predicted values for the variable K.
51
5 Discussion and conclusions
Throughout this thesis work two well known supervised learning methods,
regression trees and neural networks, were performed in order to build a predictive
model that could accurately predict the output of the metric K. This metric
describes the usage of Scania vehicles due to the road conditions and the driving
needs, and it is used in the company during the process of developing vehicle
components.
The first major problem encountered when selecting the appropriate predictive
method was the high dimensionality of the input data, as the presence of a large
number of input variables can present some severe problems for pattern
recognition systems. In addition, the underlying distribution of the input dataset
was unknown, as well as the relationships between the input variables and the
output variable, and the possible relations among all input variables. Given the
complexity of the input dataset, methods that assume no distributional patterns in
the data, and that can at the same time handle unknown high dimensional
relationships were required.
We first decided to implement CHAID and CART regression trees as they are
easy to produce, understand, and use. Tree methods ability to incrementally build
complex rules is simple and powerful, and they readily adapt to various twists and
turns in the data. Nevertheless, given that the predictive results were not
satisfactory; a MLP neural network model with one hidden layer and four neurons
was performed.
Neural networks are as well normally implemented to model complex
relationships between inputs and outputs, and when having little prior knowledge
of these relationships. They also have the ability to detect all possible interactions
52
between predictor variables. Moreover, no assumptions of the model have to be
made; neural networks can solve difficult process problems that cannot be solved
quickly or accurately with conventional methods given their limitation to strict
assumptions of normality, linearity, variable independence, etc. Finally, MLPs can
approximate almost any function with a high degree of accuracy given enough
data, enough hidden units, and enough training time.
Evaluation of the methods performance was based on the Nash-Sutcliffe efficiency
measure which showed that the selected neural network model was able to capture
the patterns and unknown relations existing between the input data and the output
metric K with an accuracy of 0.86, whereas the measures of model performance
for the CHAID and CART trees were 0.61 and 0.65 respectively. One of the
reasons of the high accuracy of the neural network model is due to its computation
of adequate weights for each one of the input attributes, thus accounting for all the
predictive information each of these attributes contains. Later, these weights are
combined and the computed value is passed along connections to other hidden
units and output units, where first internal computations are performed, providing
the nonlinearity that makes neural networks so powerful, and finally predicted
output values close to the observed values are generated.
On the other hand, both CHAID and CART regression trees use less number of
inputs than the neural network model. They attempt to find strong relationships
between the input and target variables, and only relationships that are strong
enough are used for building the model. Some inputs attributes are treated as
irrelevant or redundant and are not taken into account when building the predictive
tree. Thus, the patterns and relations existing between these “irrelevant” inputs
attribute and the output variable K are not captured, and the predicted values
produced are not as accurate as when performing the neural network model.
53
In addition, knowledge about the input variable matrix L indicates that some of its
adjacent positions must be considered as a whole when analyzing the patterns
present in the data, even in the situation when they are somehow correlated. The
tree methods do not take into account this special feature of the input data because
attributes are subsequently treated one by one when producing the splitting rules,
and in certain cases only some of them are considered as important inputs.
On the contrary, neural networks take into account all input attributes when
building the model, even if some of them have certain degree of correlation.
Some attempts were made to try to understand how the weights produced by the
neural network were distributed in the input data set in a way that they could
capture the patterns shaped by adjacent positions of the matrix L. However, plots
of the computed weights did not reveal any apparent pattern in the distribution of
the weights over the entire input set, thus no evident explanation of how the neural
network model relates adjacent positions of the matrix was found.
One of the disadvantages of performing a neural network model is its "black box"
nature, and therefore they are often implemented when the prediction task is more
important than the interpretation of the built model. Even though the neural
network model outperformed the tree models, due to its complex structure, it lacks
of clear graphical representation of the results and it also requires longer
computation time.
54
Satisfactory results were also achieved when applying the scoring formula from
the neural network model to new cases, i.e., the generation of predicted values for
the fraction of the data set that did not contain the metric K as the target value. The
results obtained showed that it is possible to rely on the predictive power of the
neural network model, and further analysis including other group of vehicles built
in Scania for different purposes, can be made based on the model proposed.
55
6 Literature
1. Abdullah, M. (2010). Decision Tree Induction & Clustering Techniques in
SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A
Comparative Analysis. IABR & ITLC Conference Proceedings.
2. Berry, M.J.A. and Linoff, G. (1999). Mastering Data Mining: The Art and
Science of Customer Relationship Management. New York: John Wiley &
Sons, Inc.
3. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford
University Press Inc., New York.
4. De Ville, B. (2006). Decision Trees for Business Intelligence and Data
Mining: Using SAS® Enterprise Miner™. SAS Publishing.
5. Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical
Learning; Data Mining, Inference, and Prediction. Springer Series in
Statistics.
6. Krause P., Boyle DP., Bäse F. (2005). Comparison of different efficiency
criteria for hydrological model assessment. Advance in Geosciences; 5:8997.
7. Kriesel D. (2005). A Brief Introduction to Neural Networks.
www.dkriesel.com.
8. Neville, P. (1999). Decision Trees for Predictive Modeling. SAS Institute
Inc.
9. SAS Enterprise Miner Tutorial, retrieved in 2010.
10. Scania Inline, retrieved in 2010 from www.sacnia.inline.com.
56
7 Appendix
Appendix A. Parameters setting for the CHAID and CART algorithms
Tree parameters setting to support appropriate assessment efforts:
Minimum number of observations in a leaf:
The smaller this value is, the more likely it is that the tree will overfit the training
data set. If the value is too large, it is likely that the tree will underfit the training
data set and miss relevant patterns in the data. In SAS the default setting is
max (5, n/1000) where n is the number of observations in the training set. In our
case, the default value for the minimum number of observations in a leaf is 20, and
better predictive results were not obtained when trying different values for this
parameter.
Observations required for a split search:
This option prevents the splitting of nodes with few observations. In other words,
nodes with fewer observations than the value specified in observations required
for a split search will not be split. The default is a calculated value that depends
on the number of observations and the value stored in minimum number of
observations in a leaf. The default value for our model is 202, and better
predictive results were not obtained when trying different values.
Maximum depth of tree:
This parameter was changed from 6 to 15 to allow complex trees to be grown. The
size of a tree may be the most important single determinant of quality, more
important, perhaps, than creating good individual splits. Trees that are too small
do not describe the data well. Trees that are too large have leaves with too little
data to make any reliable predictions about the contents of the leaf when the tree is
applied to a new sample. Splits deep in a large tree may be based on too little data
to be reliable.
57
Parameters setting according to the different algorithms performed in each tree
node:
Approximation of the CHAID algorithm by using the tree node:
The Model assessment measure property was set to Average squared error.
This measure is the average of the square of the difference between the
predicted outcome and the actual outcome, and it is used to calculate the
worth of a tree when the target is continuous. The worth of the tree is
calculated by using the validation set to compare trees of different sizes in
order to pick the tree with the optimal number of leaves.
The Splitting Criterion was set to F test to measure the degree of separation
achieved by a split.
The F test significance level was set to 0.05, as a stopping rule that
accounts for the predictive reliability of the data. Partitioning stops when no
split meets the threshold level of significance.
To avoid automatic pruning, the Subtree method was set to The most leaves.
The subtree method determines which subtree is selected from the fully
grown tree. This option selects the full tree given that other options are
relied on for stopping the training.
The Maximum number of branches from a node option was changed from 2
to 100, and 10 was chosen given that not better predictive results were
obtained when this value was increased.
The Surrogate rules saved in each node option were set to 0. A surrogate
rule is a back-up to the main splitting rule. When the main splitting rule
relies on an input whose value is missing, the first surrogate rule is invoked.
If the first surrogate also relies on an input whose value is missing, the next
surrogate is invoked. If missing values prevent the main rule and all of the
58
surrogates from applying to an observation, then the main rule assigns the
observation to the branch it has designated as receiving missing values.
However, since missing values are not present in the data the use of
surrogate rules was not implemented.
To force a heuristic search, the Maximum tries in an exhaustive split search
option was set to 0. This option allows finding the optimal split, even if it is
necessary to evaluate every possible split on a variable.
The Observations sufficient for split search option was set to the size of the
training data set (20218). This option sets an upper limit on the number of
observations used in the sample to determine a split. All observations in the
node are then passed to the branches and a new sample is taken within each
branch independently.
The P-value adjustment was set to Kass, and the Apply Kass after choosing
number of branches option was also selected. By choosing this option, the
P-value is multiplied by a Bonferroni factor that depends on the number of
branches, target values, and sometimes on the number of distinct input
values. The algorithm applies this factor after the split is selected. The
adjusted P-values are used in comparing splits on the same input and splits
on different inputs.
Approximation of the CART algorithm by using the tree node:
Trees created by using the tree node are very similar to the ones grown by using
the Classification and Regression Trees method without linear combination splits
or Twoing or ordered Twoing splitting criteria. The Classification and Regression
Trees method recommends using validation data unless the data set contains too
few observations. The Tree node is intended for large data sets. The options in the
Tree node were set as follow:
The Model assessment measure property was set to Average squared error.
59
The Splitting Criterion was set to Variance reduction. This value measures
the reduction in the squared error from node means.
The Maximum number of branches from a node were set to 2.
The Treat missing as an acceptable value check box was selected.
However, this option did not affect the results since the data did not contain
missing values.
The Surrogate rules saved in each node were set to 5. Yet, for the same
reason mentioned above, these rules were not implemented.
The Subtree method was set to Best assessment value. This option selects
the smallest subtree with the best assessment value. Validation data is used
during the selection process.
The Observations sufficient for split search were set to 1000.
The Maximum tries in an exhaustive split search were set to 5000. To find
the optimal split, it is sometimes necessary to evaluate every possible split
on a variable. Sometimes the number of possible splits is extremely large.
In this case, if the number for a specific variable in a specific node is larger
than 5000, then a heuristic (stepwise, hill-climbing) search algorithm is
used instead for that variable in that node.
The P-value adjustment was set to Depth. By selecting this option, the
P-values are adjusted for the number of ancestor splits where the
adjustment depends on the depth of the tree at which the split is done.
Depth is measured as the number of branches in the path from the current
node, where the splitting is taking place, to the root node. The calculated
P-value is multiplied by a depth multiplier, based on the depth in the tree of
the current node, to arrive at the depth-adjusted P-value of the split.
60
Download