Abstract The aim of this master thesis is to predict the outcome of a metric K which describes the usage of Scania vehicles on different roads. This metric is of great interest for the company and it is used during the process of developing vehicle components. Through this work we discuss two well known supervised learning methods, decision trees and neural networks, which enable us to build the predictive models. The set of data used consists of approximately 30.000 vehicles, and it is based on a set of features which from theoretical bases and expert opinions in Scania were considered to contain relevant information and be related to the output metric K. The selected data set represents the largest product segment in Scania, long haulage vehicles. CART (Classification and Regression Trees) and CHAID (Count or Chi-squared Automatic Interaction Detection) regression trees of different sizes were first performed given their simplicity and predictive power. However, evaluation of the performance of these algorithms, based on the Nash-Sutcliffe efficiency measure (0.61 and 0.65 for the CART and CHAID tree respectively), demonstrates that the tree methods were not able to extract the patterns and relationships present in the data. Finally, knowing that given enough data, hidden units, and training time, a feedforward multilayer perceptron (MLP) can learn to approximate virtually any function to any degree of accuracy, a MLP neural network model with one hidden layer and four neurons was performed. An accuracy of 0.86 shows that the predictive results obtained with the selected network were more accurate than those acquired with the regression tree methods. Predicted values for the fraction of the data set that did not contain the metric K as the target value were also obtained, and the results showed that it is possible to rely on the predictive power of the neural network model for further analysis, including other group of vehicles built in Scania for different purposes. 1 2 Acknowledgements I would first like to express my deep appreciation and sincere gratitude to Professor Anders Grimvall for encouraging me and giving me the opportunity to take part of the Master’s Programme in Statistics, Data Analysis and Knowledge Discovery, and specially thank him for recommending me in Scania. Thank you Anders for your guidance, for sharing your knowledge with me, for being so patient, for always trying to explain everything really clearly and carefully during our lessons and our consulting sessions. Your pedagogical spirit and your never ending stream of ideas have always inspired me and motivated me during my studies and work. I would also like to thank everyone who in one way or another is involved in the success of this thesis work. I thank Scania for its permission to carry out this project and for making the data that has been analyzed available. Ann Lindqvist, my supervisor at Scania, who gave me the opportunity and confidence to work with this challenging project. Thank you Ann for helping within Scania to obtain all the necessary knowledge and for introducing me to all the people that in different ways contributed to improving the quality of my work. I also want to thank you for your friendship, for making sure I would always find my way in Scania and in Södertälje, and for sharing time with me after a hard working day. Thank you for all the help you offered me inside and outside the office. Thank you Klas Levin, Mikael Curbo, Anders Forsen, Erik Landström and all of you at Scania who were always interested in my work, offering me valuable advices and helping me during the development of my thesis. Special thanks also go to Anders Noorgard, my supervisor at Linköpings University, who gave me valuable tips about the project work and report writing, 3 and Oleg Sysoev for taking the time to review my work and for sharing his knowledge about machine learning. Last, but not least, I especially want to thank my beloved family: my mother Juneida Sánchez, for your endless love, your blessings and your wishes for the successful culmination of all my projects. My sister Marbella Covino, for believing in me and for always being there for me. My boyfriend Karl Aronsson, for all your love and support, for always encouraging me, for being next to me during happy and difficult times, for always making me laugh, and for showing me the positive side of every situation. My Swedish family, the Aronsson family, because you have made me feel this is my second home, thank you for supporting me and helping me ever since I decided to move to Sweden. 4 Table of contents 1 2 3 4 5 6 7 Introduction ................................................................................................................. 7 1.1 Scania ................................................................................................................... 7 1.2 Background .......................................................................................................... 7 1.3 Objective ............................................................................................................ 10 Data ............................................................................................................................ 11 Methodology.............................................................................................................. 17 3.1 Methodology step by step .................................................................................. 17 3.2 Supervised learning methods ............................................................................ 19 3.2.1 Decision Trees .................................................................................................... 19 3.2.1.1 CHAID .................................................................................................. 24 3.2.1.2 CART .................................................................................................... 25 3.2.2 Neural Networks ................................................................................................ 26 3.3 Approximation efficiency…. ............................................................................. 30 Results….. ................................................................................................................. 32 4.1 Decision Trees .................................................................................................... 32 4.2 Neural Networks ................................................................................................ 41 4.3 Scoring Process .................................................................................................. 51 Discussion and conclusions ....................................................................................... 52 Literature. .................................................................................................................. 56 Appendix ................................................................................................................... 57 5 6 1 Introduction 1.1 Scania Scania is one of the world’s leading manufacturers of trucks and buses for heavy transport applications. The company operates in about 100 countries and employs almost 33000 people. Scania’s objective is to deliver optimized heavy trucks and buses, engines and services, provide the best total operating economy for our customers, and thereby be the leading company in the industry. Research and development are concentrated in Södertälje-Sweden, and production units are located in Europe and Latin America. This master thesis has been carried out at RESD, the department responsible for diagnostic protocols. Software modules for diagnostic communication between electrical control unit systems and external tools are developed in this department, as well as off board systems for remotely retrieving and analyzing diagnostic data. (Scania Inline, 2010) 1.2 Background The electrical system in Scania vehicles is based on a number of control units that communicate with each other via a common network based on serial communication. Scania’s serial communication is based on the CAN protocol. The principal features of a CAN bus system are control and interaction. At the heart of the Scania’s CAN bus is a central control unit (coordinator) through which all functions are monitored and managed. From here, the truck’s electrical functions are arranged in three circuits: red, yellow and green. Red functions cover all main management units: engine, gearbox, brakes and suspension. Yellow covers instruments, bodywork systems, locking and alarm systems, and lights. Green covers comfort systems, such as climate control, audio and informatics. 7 Figure 1. Vehicle Applications of Controller Area Network (CAN). All control units found in the Scania electrical system can be checked with a plugin diagnostic software (SDP3) used by Scania’s workshops, among other purposes, to decipher and interpret the operational data. Data about the operation of the vehicles stored in the control units is read with SDP3 and sent via the Internet to Scania’s servers in Södertälje for analysis. Only authorized dealer workshops and distributors have the necessary identities and access rights to collect, use, and transfer operational data. 8 Figure 2. Operational data collection system. A huge amount of operational data have been gathered and analyzed to understand vehicles usage, for example how the accelerator pedal and vehicle momentum are utilized in varying topography. The frequency and harshness of brake applications, the efficient use of the auxiliary brake system (Scania retarder and exhaust brake), matches gear selection and engine revolutions, have also been evaluated from the data collected. Figure 3. Histogram of operational data collected since 2006 until 2010. 9 A metric used in the company to describe the usage of Scania vehicles due to the road conditions and the driving needs (starts and stops, accelerations, etc.), and which from now on we will call K, is calculated by using data collected from one of the control units currently installed in the electrical system of the new generation Scania vehicles. However, this value cannot be estimated for those vehicles that are not equipped with the required control unit. Hence, it is of interest to build a predictive model to estimate the values of K by making use of the data available for all vehicles. Given the nature of the problem we have decided to carefully select a set of data consisting of a group of variables which are believed to contain potentially predictive relationships with the variable K. Afterwards, different algorithms can be implemented to capture the patterns and relations found in the data, and generalize to unseen situations in a reasonable way. These algorithms, also called supervised learning methods, apply various mechanisms capable of inducing knowledge from examples of data. 1.3 Objective The aim of this thesis work is to build a predictive model, by making use of supervised learning methods, which could accurately predict the outcome of a metric that describes the usage of Scania vehicles due to the road conditions and the driving needs. 10 2 Data Our analysis will be concentrated in the segment of long haulage vehicles for which Scania has followed many years of strong presence in the market. The selection of the data is based on an assortment of physical components in order to obtain a group of vehicles that are mostly dedicated to this specific product segment. The selected group represents 78% of the total operational data collected in the company. As illustrated in Figure 4, approximately 43% of these vehicles are not equipped with the required control unit from which the necessary data for the calculation of the metric K is collected. Thus, only 35% of the vehicles are selected for building the predictive model and the remaining data will be used during the scoring process. Some potential predictor variables are selected based on theoretical bases and expert opinions. In addition, the corresponding values of K for this 35% of data are calculated from a sequence of measurements made by specialists in Scania, and they are based on series of studies. 16% 35% Required control units (yes) - Long haulage Required control units (yes) - Other purposes Required control units (no) - Long haulage 43% Required control units (no) - Other purposes 6% Figure 4. Pie chart of operational data. 11 Once the selected data had been extracted from the different databases, it was finally integrated into one data set consisting of approximately 30.000 observations. After removing input variables that had low or no predictive power, the input data set was represented by four variables. The first variable corresponds to an 11*12 matrix called L. The second and third variables represent two vectors of 10 positions each, called S and G. The variables L, S and G implicitly contain information of the usage of Scania vehicles. The last variable is called E and it corresponds to the different categories for one of the vehicle components. All input variables excluding E are represented by continuous values, whereas the variable E contains nominal values. Afterwards, for simplicity reasons and reduction of the data, we performed transformations of the raw data to create new input variables. We calculated averaged values of the vectors S and G. However, it was not possible to make this estimation for the values in the matrix L due to the importance of the information contained in each of its positions; every position in the matrix is crucial for the pattern recognition process. Hence, the matrix was just reorganized into a feature vector of 132 positions for possible handling of the variable by the predictive methods. As there was no need to transform the output variable K, this variable was used in its raw form. A quantitative analysis of the data set is given by the descriptive statistics of the input and output variables, shown by the color maps in Figures 5-8, the Tables 1 and 2, and the histograms in Figures 9-11. They provide simple summaries about the data set being analyzed and the measures. 12 Figure 5. Minimum values for the input variable Matrix L. Figure 6. Maximum values for the input variable Matrix L. 13 Figure 7. Mean values for the input variable Matrix L. Figure 8. Median values for the input variable Matrix L. Table1. Descriptive statistics for the input variables S and G, and for the output variable K. Variable Min Max Mean Q1 Q3 Median S 1.527 82.218 54.538 48.826 62.869 57.240 G 2.000 86.774 35.238 29.023 40.250 35.278 K 17.090 79.310 34.564 30.750 37.460 33.560 14 Table2. Counts for the input variable E. Variable Count Variable Count Variable Count Total E E01 E02 E03 E04 E05 E06 E07 E08 E09 E10 E11 E12 E13 E14 28883 162 13 627 86 342 1860 2382 3470 20 507 2 391 10 616 E15 E16 E17 E18 E19 E20 E21 E22 E23 E24 E25 E26 E27 E28 E29 E30 30 1175 22 297 1096 480 1389 529 1129 149 146 813 69 32 8 12 E31 E32 E33 E34 E35 E36 E37 E38 E39 E40 E41 E42 E43 E44 E45 E46 52 8 4 1 7 13 1 96 700 47 504 5701 42 56 3786 1 In addition, as a summary of the frequency of the continuous input and output variables, the histogram plots from Figures 9 trough 11 were also obtained: 1400 1200 Frequency 1000 800 600 400 200 0 0 11 22 33 44 55 66 S Figure 9. Histogram of the input variable S. 15 77 2000 Frequency 1500 1000 500 0 11 22 33 44 G 55 66 77 88 Figure 10. Histogram of the input variable G. 2500 Frequency 2000 1500 1000 500 0 18 27 36 45 54 63 72 K Figure 11. Histogram of the output variable K. Further information about how each of the chosen supervised learning methods interprets and utilizes the selected variables when building the predictive models is given in detail in the results chapter. 16 3 Methodology A sequence of steps were followed during the development of the project in order to successfully reach the main objective of this master thesis, to build a predictive model, by making use of supervised learning methods, which could accurately predict the outcome of a metric that describes the usage of Scania vehicles due to the road conditions and the driving needs. 3.1 Methodology step by step: 1. First, we gathered the training set of data which needed to be characteristic of the real-world use of the function to be learned. Thus, approximately 30.000 observations were collected, characterized by a set of input variables which implicitly contained descriptive information of the usage of the vehicles, and that were considered to have enough predictive power to be able to estimate the values of the output variable K. The corresponding values of K were also collected for each observation from a sequence of measurements made by specialists in Scania. 2. Second, we determined the input feature representation of the function. During this step, the input variables were reorganized or transformed into suitable values for the predictive methods. Thus, matrices were reorganized into vectors where all positions were kept, and vectors were transformed into single averaged values. The number of features should not be too large, because of the curse of dimensionality; but should be large enough to accurately predict the output. The output variable was used in its raw form. 3. Third, we carried out graphical representations of the data, and analysis of the descriptive statistics which were useful for detecting spurious 17 observations. Inconsistent records were eliminated, thus increasing the quality of the data. 4. Subsequently, we selected two supervised learning methods which were thought to be appropriate for the given problem and data at hand, decision trees and neural networks. 5. The selected predictive methods required partitioning the dataset into training and validation sets. The training set teaches the model, and the validation set measures and assesses the model performance and reliability for applying the model to future unseen data. The validation process avoids the over-fitting problem by validating the model on a different set of data. Our model data set was split into a training data set and a validation data set, 70% and 30% respectively, in order to create a large enough validation data set. A validation data set that is too small might lead to erroneous conclusions when evaluating the reliability of the model. 6. We completed the design by running the learning algorithms on the gathered training set. Parameters of the algorithms were adjusted to optimize the performance on a subset (validation set) of data. A manual forward selection method was also implemented during this step for selecting the combination of input variables that increased the predictive power of the learning methods. 7. Finally, we assessed the performance of the chosen learning algorithms based on the produced average squared error, and compared the efficiency of the predictions obtained based on the Nash-Sutcliffe efficiency measure. The best predictive method was selected and applied to a new set of data in order to compute the corresponded values of K. 18 3.2 Supervised learning methods: In a typical scenario for supervised learning methods, we have an outcome measurement, usually quantitative or categorical, that we wish to predict based on a set of features. We also have a training set of data, in which we observe the outcome and feature measurements for a set of objects. Using this data we build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects. A good learner is one that accurately predicts such an outcome. A Supervised learning method is a machine learning technique for deducing a function from training data. The function fitting paradigm from a machine learning point of view is as follows: Suppose for simplicity that the errors are additive and that the model is a reasonable assumption. Supervised learning attempts to learn by example through a teacher. One observes the system under study, both the inputs and the outputs, and assembles a training set of observations i = 1,…,N. The observed input values to the system are also fed into an artificial system, known as a learning algorithm which also produces outputs in response to the inputs. The learning algorithm has the property that it can modify its input/output relationship in response to differences between the original and generated outputs. This process is known as learning by example. Upon completion of the learning process the hope is that the artificial and real outputs will be close enough to be useful for all sets of inputs likely to be encountered in practice. (Hastie et al., 2001). 3.2.1 Decision Trees: Decision trees belong to a class of data mining techniques that break up a collection of heterogeneous records into smaller groups of more homogeneous 19 records using a directed knowledge discovery. Directed knowledge discovery is goal-oriented where it explains the target fields in terms of the rest of the input fields to find meaningful patterns in order to predict the future events using a chain of decision rules. In this way, decision trees provide accurate and explanatory models where the decision tree model is able to explain the reason of certain decisions using these decision rules. Decision trees could be used in classification problems and also in estimation problems where the output is a continuous value, and in the last case the tree is called a regression tree. (Abdullah, 2010) For a tree to be useful, the data in the leaves (the final groups or unsplit nodes) must be similar with respect to some target measure, so that the tree represents the segregation of a mixture of data into purified groups. (Neville, 1999). The general form of this modeling approach is illustrated in Figure 12. Decision trees attempt to find a strong relationship between input values and target values in a group of observations that form a data set. When a set of input values is identified as having a strong relationship with a target value, then all of these values are grouped in a bin that becomes a branch on the decision tree. These groupings are determined by the observed form of the relationship between the bin values and the target. Binning involves taking each input, determining how the values in the input are related to the target, and, based on the input-target relationship, depositing inputs with similar values into bins that are formed by the relationship. A strong input-target relationship is formed when knowledge of the value of an input improves the ability to predict the value of the target. (De Ville, 2006) 20 Figure 12. Illustration of a decision tree. Decision trees have many useful features, both in the traditional fields of science and engineering and in a range of applied areas, including business intelligence and data mining. These useful features include (De Ville, 2006): Decision trees produce results that communicate very well in symbolic and visual terms. Decision trees are easy to produce, easy to understand, and easy to use. One valuable feature is the ability to incorporate multiple predictors in a simple, step-by-step fashion. The ability to incrementally build highly complex rule sets (which are built on simple, single association rules) is both simple and powerful. 21 Decision trees readily incorporate various levels of measurement, (nominal, ordinal, and interval), regardless of whether it serves as the target or as an input. Decision trees readily adapt to various twists and turns in data (unbalanced effects, nested effects, offsetting effects, interactions and nonlinearities) that frequently defeat other one-way and multi-way statistical and numeric approaches. Trees require little data preparation and perform well with large data in a short time. Decision trees are nonparametric and highly robust (for example, they readily accommodate the incorporation of missing values) and produce similar effects regardless of the level of measurement of the fields that are used to construct decision tree branches (for example, a decision tree of income distribution will reveal similar results regardless of whether income is measured in thousands, in tens of thousands, or even as a discrete range of values from 1 to 5). Trees also have their short comings (Neville, 1999): When the data contain no simple relationship between the inputs and the target, a not complex tree is too simplistic. Even when a simple description is accurate, the description may not be the only accurate one. A tree gives an impression that certain inputs uniquely explain the variations in the target. A completely different set of inputs might give a different explanation that is just as good. Trees may deceive; they may fit the data well but then predict new data worse than having no model at all. This is called over-fitting. They may fit the data well, predict well, and convey a good story, but then, if some of the original data are replaced with a fresh sample and a new tree is created, a 22 completely different tree may emerge using completely different inputs in the splitting rules and consequently conveying a completely different story. Specific decision tree methods include the CHAID (Count or Chi-squared Automatic Interaction Detection) and CART (Classification and Regression Trees) algorithms. The following discussion provides a brief description of these algorithms for building decision trees. 3.2.1.1 CHAID CHAID is an acronym for “Chi-Squared Automatic Interaction Detection”. This algorithm accepts either nominal or ordinal inputs, however some software packages as SAS Business Analytics and Business Intelligence Software accept interval inputs and automatically group the values into ranges before growing the tree. The splitting criterion is based on P-values from the F-distribution (interval targets) or Chi-squared distribution (nominal targets). The P-values are adjusted to accommodate multiple testing. Missing values are treated as separate values. For nominal inputs, a missing value constitutes a new category. For ordinal inputs, a missing value is free of any order restrictions. The search for a split on an input proceeds stepwise. Initially, a branch is allocated for each value of the input. Branches are alternately merged and re-split as seems warranted by the P-values. The algorithm stops when no merge or re-splitting operation creates an adequate P-value. The final split is adopted. A common alternative, sometimes called the exhaustive method, continues merging to a 23 binary split and then adopts the split with the most favorable P-value among all splits the algorithm considered. The tests of significance are used to select whether inputs are significant descriptors of target values and, if so, what are their strengths relative to other inputs. Thus, after a split is adopted for an input, its P-value is adjusted, and the input with the best adjusted P-value is selected as the splitting variable. If the adjusted P-value is smaller than a specified threshold, then the node is split. Tree construction ends when all the adjusted P-values of the splitting variables in the unsplit nodes are above the user-specified threshold. (SAS Enterprise Miner Tutorial, 2010) 3.2.1.2 CART The following is a description of the Breiman, Friedman, Olshen, and Stone Classification and Regression Trees method for building decision trees. More detailed information can be found in the following text: Breiman, L., Friedman, J.H., Olsen, R. A., and Stone, C. J. (1984), Classification and Regression Trees, Pacific Grove: Wadsworth. For this method, the inputs are either nominal or interval. Ordinal inputs are treated as interval. CART trees employ a binary splitting methodology, which produces binary decision trees. They do not embrace the kind of merge-and-split heuristic developed in the CHAID algorithm to grow multi-way splits, so multiway splits are not included in this approach. Classification and Regression Trees do not use the statistical hypothesis testing approach proposed in the CHAID algorithm, and they rely on the empirical properties of a validation or resample data set to guard against overfit. (De Ville, 2006) 24 The full methodology for growing and pruning branches in CART trees includes the following (De Ville, 2006; SAS Enterprise Miner Tutorial, 2010): For a continuous response field, both least squares and least absolute deviation measures can be employed. Deviations between training and test measures can be used to assess when the error rate has reached a point to justify pruning the sub tree below the error-calculation point. For a categorical-dependent response field, it is possible to use either the Gini diversity measure or Twoing criteria. Ordered Twoing is a criterion for splitting ordinal target fields. Calculating misclassification costs of smaller decision trees is possible. Selecting the decision tree with the lowest or near-lowest cost is an option. Costs can be adjusted. Picking the smallest decision tree within one standard error of the lowest cost decision tree is an option. In addition to a validated decision tree structure, CART trees also: work with both continuous and categorical response variables. omit observations with a missing value in the splitting variable when creating a split. create surrogate splits and uses them to assign observations to branches when the primary splitting variable is missing. If missing values prevent the use of the primary and surrogate splitting rules, then the observation is assigned to the largest branch (based on the within node training sample). grow a larger-than-optimal decision tree and then prunes it to a final decision tree using a variety of pruning rules. consider misclassification costs in the desirability of a split. use cost-complexity rules in the desirability of a split. 25 split on linear and multiple linear combinations. do sub sampling with large data sets. 3.2.2 Neural Networks Neural networks (NNs) form a joint framework for regression and classification that has become widely used during the past decades, traditionally associated with machine learning and data mining. Because of their ability to approximate any dataset, NNs are sometimes called universal approximators (Hornik et al,1989) The study of artificial neural networks is motivated by their similarity to successfully working biological systems, which compared to the complete system consist of very simple but numerous nerve cells that work massively parallel and have the capability to learn. There is no need to explicitly program a neural network. For instance, it can learn from training examples. One result from this learning procedure is the capability of neural networks to generalize and associate data. After successful training, a neural network can find reasonable solutions for similar problems of the same class that were not explicitly trained. A technical neural network consists of simple processing units or neurons which are connected by directed, weighted connections. Data are transferred between neurons via connections with the connecting weight being either excitatory or inhibitory. A propagation function converts vector inputs to scalar network inputs. For a neuron the propagation function receives the outputs of other neurons and transforms them in consideration of the connecting weights into a network input net, that can be used by the activation function. 26 The activation function is the “switching status” of a neuron. Based on the model of nature every neuron is always active to a certain extent. The reactions of the neurons to the input values depend on this activation state. Neurons get activated, if the network input exceeds their threshold value. The threshold value is explicitly assigned to the neurons and marks the position of the maximum gradient value of the activation function. When centered on the threshold value, the activation function of a neuron reacts particularly sensitive. The activation of a neuron depends on the prior activation state of the neuron and the external input. Finally, an output function may be used to process the activation once again. The output function of a neuron calculates the values which are transferred to the other connected neurons. The learning strategy is an algorithm that can be used to change the neural network and thus such a network can be trained to produce a desired output for a given input. An error is composed from the difference between the desired response and the system output. This error information is fed back to the system and adjusts the system parameters in a systematic fashion. The process is repeated until the performance is acceptable. It is clear from this description that the performance hinges heavily on the data. If one does not have data that cover a significant portion of the operating conditions then neural network technology is probably not the right solution. (Kriesel, 2005) The term neural network has evolved to encompass a large class of models and learning methods. Here we describe the most commonly used neural net, a feedforward multilayer perceptron (MLP) neural network model with one hidden layer. A more general description and analysis of the neural network framework can be found in Bishop (1995). This neural network is a two-stage regression or classification model typically represented by a network diagram as to the one shown in Figure 13. 27 Figure 13. Schematic of a single hidden layer, feed-forward neural network. For regression, there is only one output unit , however these networks can handle multiple responses in a seamless fashion. Derived features linear combination of the inputs , and then the target of linear combination of the , m =1,…,M, , k = 1,…,K, = = The activation function is modeled as a function . = where are created from (1) , k = 1,…,K, = , and . is usually chosen to be the sigmoid 1/(1+e-ν). Sometimes a Gaussian radial basis function (Hastie et al., 2001) is used for the , producing what is known as a radial basis function network. 28 Neural network diagrams like the one in Figure 13 are sometimes drawn with an additional bias unit feeding into every unit in the hidden and output layers. Thinking of the constant “1” as an additional input feature, this bias unit captures the intercepts and in model (1). The output function outputs . allows a final transformation of the vectors of For regression we typically choose the identity function . Early work in classification also used the identity function, but this was later abandoned in favor of the softmax function . This is of course exactly the transformation used in a multilogit model, and produces positive estimates that sum to one. The units in the middle of the network, computing the derived features called hidden units because the values are not directly observed. In general there can be more than one hidden layer. We can think of of the original inputs , are as a basis expansion ; the neural network is then a standard linear model, or a linear multilogit model, using these transformations as inputs. (Hastie et al., 2001) The network shown in Figure 13 belongs to the class of feed-forward networks, in which the connections go from one layer to its successor only; there are not feedbacks. The fitting of the neural network model is done by searching for the weights that minimize the error function, which often takes the form of a weighted sum of squared errors: . The practical use of neural networks has clear advantages but also some limitations. 29 Advantages: NNs involve human like thinking. There is no need to assume an underlying probability distribution such as usually is done in statistical modeling. They handle noisy or missing data. They can work with large number of variables or parameters. They create their own relationship amongst information. NNs are applicable to multivariate non-linear problems. A neural network can perform tasks that a linear program cannot. When an element of the neural network fails, it can continue without any problem by their parallel nature. NNs learn and do not need to be reprogrammed. They provide general solutions with good predictive accuracy. Disadvantages: Large NNs require high processing time. The individual relations between the input variables and the output variables are not developed by engineering judgment thus NNs model tend to be black boxes or input/output tables without analytical basis. 3.3 Approximation efficiency: The efficiency of the predictions obtained by the different supervised learning methods can be quantified in many different ways. We have decided to use the Nash-Sutcliffe efficiency measure. The efficiency E proposed by Nash and Sutcliffe (1970) is defined as one minus the sum of the absolute squared differences between the predicted and observed values normalized by the variance 30 of the observed values during the period under investigation. E is calculated as: (Krause et al., 2005) This measure can take values from minus infinity to one, and it is close to one if the prediction errors are small. 31 4 Results 4.1 Decision Trees: Different techniques do better with different data but trees should compete along with other methods. We decided to make an approximation of the CHAID and CART regression trees by making use of the tree node in SAS Enterprise Miner. The SAS Enterprise Miner provides a visual programming environment for predictive modeling. SAS algorithms incorporate and extend most of the good ideas of the tree methods discussed in the methodology chapter. The two chosen tree methods were performed producing a series of trees which were based on selected parameters. A number of common tree parameters were set to specific values to support appropriate assessment efforts. The remaining parameters were set according to the different algorithms performed in each tree node. Details about the parameters setting can be found in Appendix A. We first performed the CHAID tree method by building trees of different depths, varying from 6 to 15. Given that the target is a continuous value we used the average squared error as the assessment measure. The results obtained are shown in table 3: 32 Table 3. Depth and average squared error - CHAID tree. Depth ASE Training ASE Validation 6 7 8 9 10 11 12 13 14 15 11.40 11.40 11.40 11.40 11.40 11.40 11.40 11.40 11.40 11.40 12.75 12.75 12.75 12.75 12.75 12.75 12.75 12.75 12.75 12.75 These results confirm that the predictive power of the tree will not be improved by building a more complex model. Thus, the depth of the CAHID tree was set to 6. In the same fashion, when varying the values for the depth of the tree in the CART model we obtained the results shown in table 4: Table 4. Depth and average squared error - CART tree. Depth ASE Training ASE Validation 6 7 8 9 10 11 12 13 14 15 12.00 10.74 10.00 9.65 9.48 9.38 9.34 9.33 9.34 9.34 13.17 12.37 11.75 11.55 11.43 11.37 11.36 11.36 11.36 11.36 33 Figure 14 shows a plot of average squared error vs. depth for the CART model. There, we observe that the lines for the training and validation set are progressing as the depth of the tree increases; however after a depth of 10 the reduction in the average squared error is not significant. Therefore, 10 is chosen to be an appropriated value for the depth of the CART tree: Average squared error 14 13 12 11 Training 10 Validation 9 8 6 7 8 9 10 11 12 13 14 15 Depth of the tree Figure 14. Average squared error vs. Depth - CART tree. To verify that the performance of the selected trees was acceptable, we carefully analyzed the results of the tree nodes where we could find a number of diagnostic tools. First, we reviewed the assessment plots which show tree evaluation information; trees are evaluated using the number of cases that are correctly predicted. For each tree size, a tree that correctly predicts the most training cases is selected to represent that size. The selected tree is then evaluated again with validation cases. The assessment plots in Figures 15 and 16 display lines of the modeling assessment statistic between the training and validation data sets across the number of leaves that are created. The plots allow us to evaluate the accuracy of 34 the decision tree models by viewing the change in the average squared error in the growth of the trees based on the number of leaves to the design. Figure 15. Assessment plot - CHAID tree. The selected tree contains 253 leaves. Figure 16. Assessment plot - CART tree. The selected tree contains 134 leaves. The following situations can be identified in these plots: 35 The lines for the training and validation data are progressing as the number of leaves increases. In Figure 15, the validation data confirms the progress of the training data until the number of leaves are around 100. After this point, the line for the validation data starts to flat out and move apart from the training data line. A similar situation is encountered in Figure 16 where the validation data confirms the progress of the training data until the number of leaves are around 30. If any of the decision tree models were selected as the best and final model, these plots would help us evaluate smaller trees that still perform well in terms of the assessment measure but are less complex and more reliable, and therefore they might be more appropriate for the prediction process. The trees chosen by SAS Enterprise Miner as the best trees to use were the one with 253 leaves for the CHAID model and the one with 134 for the CART model. These trees were selected because they optimized the assessment value on the training data set. The average number of observations assigned in each leaf was around 80 and 150 for the CHAID and CART tree respectively, which represents 0.28 and 0.52 percent of the total number of cases in the model set. The appropriate value of observations in a leaf to avoid overfitting or underfitting the training data set depends on the context, i.e., the size of the training data set; however, as a rule-ofthumb, an appropriated value would be between 0.25 and 1 percent of the model set. (Berry and Linoff, 1999) Subsequently, we constructed the color maps shown in Figures 17 and 18 in order to illustrate the importance each of the input variables had when building the 36 decision trees. The higher the importance measure the better the variable approximates the target values, and therefore variables with high importance represent strong splits. Figure 17. Variable’s importance - CHAID tree. Figure 18. Variable’s importance - CART tree. 37 Afterwards, we analyzed graphics diagnostic of the model fit. Figures 19 and 20 are scatter plots of observed vs. predicted values for the validation set of the CHAID and CART models correspondingly. Pr edi ct ed K 80 75 70 65 60 55 50 45 40 35 30 25 20 15 15 20 25 30 35 40 45 50 55 60 65 70 75 80 K Figure 19. Scatter plot of predicted vs. observed values - CHAID tree. Pr edi ct ed K 80 75 70 65 60 55 50 45 40 35 30 25 20 15 15 20 25 30 35 40 45 50 55 60 65 70 75 80 K Figure 20. Scatter plot of predicted vs. observed values - CART tree. 38 From the two plots, it is easy to observe a large discrepancy of the observed and predicted values. Points are lying far away from the 45 degree reference line that passes through the origin indicating a low predictive accuracy. Residual plots of the tree models were also evaluated. Residuals are helpful in evaluating the adequacy of the model itself relative to the data and any assumption made in the analysis. If the model fits the data well, and the typical assumption of independent normally distributed residuals is also made, the plots of the residuals versus predicted values should not show any patterns or trends, i.e., they should be a random scatter of points. The plots of residuals in Figures 21 and 22 show a slightly increasing variation of the residuals as the predicted values increase, which may suggest that the assumption of equal variance of the residuals is not valid for this data. Nevertheless, it is hard to confirm this assumption and it would be more natural to consider the plots of residuals within the limits one may expect when building a complex predictive model. Resi dual s 40 30 20 10 0 - 10 - 20 - 30 20 30 40 50 60 Pr edi ct ed K Figure 21. Scatter plot of residuals vs. predicted values - CHAID tree. 39 Resi dual s 40 30 20 10 0 - 10 - 20 - 30 20 30 40 50 Pr edi ct ed K Figure 22. Scatter plot of residuals vs. predicted values - CART tree. In addition, the histograms shown in Figures 23 and 24 provide a view of the overall distribution of the residuals. The plots appear to be bell-shaped, however the pattern found in the plots of residuals vs. predicted values is also reveled in these histograms which show too long tails to be considered approximately normal. Normal 1400 1200 Frequency 1000 800 600 400 200 0 -24 -16 -8 0 Residuals 8 16 24 Figure 23. Histogram of residuals - CHAID tree. 40 Normal 1400 1200 Frequency 1000 800 600 400 200 0 -24 -16 -8 0 8 Residuals 16 24 32 Figure 24. Histogram of residuals - CART tree. Finally, we calculated the Nash-Sutcliffe efficiency measure for the validation sets of the CHAID and CART models to evaluate the performance of these trees. The values obtained are far from 1, indicating a poor fit. 4.2 Neural Networks: The Neural Network node in SAS Enterprise Miner enables us to fit nonlinear models such as a multilayer perceptron (MLP). NNs are flexible prediction models that, when carefully tuned, often provide optimal performance in regression and classification problems. There is no theory that tells us how to set the parameters 41 of the network to approximate any given function. It will generally be impossible to determine the correct design without training numerous networks and estimating the generalization error for each model. The design process and the training process are both iterative. We made use of the advanced user interface provided by the Neural Network node to create a MLP model. Figure 25 displays the constructed network. Figure 25. Schematic representation of the MLP neural network model built in SAS Enterprise Miner. The layer on the left represents the input layer and it consists of all interval and nominal inputs. The middle layer is the hidden layer, in which hidden units (neurons) were varied from 1 to 40, and 4 was selected as the optimal value based on the results shown in Table 5 and Figure 26. The layer on the right is the output layer, which correspond to the target variable K. The propagation, activation and output functions were selected based on the default configuration specified in the methodology chapter. 42 Table 5. Average squared error of a feedforward MLP neural network model with one hidden layer. Neurons ASE Training ASE Validation Neurons ASE Training ASE Validation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 5.49 4.12 3.7 3.57 3.38 3.25 3.07 3.19 3.31 2.73 3.43 2.95 3.2 2.97 2.74 2.7 2.65 2.78 2.78 2.51 6.39 5.3 4.93 4.25 4.73 4.23 4.34 4.39 4.39 4.4 4.57 4.41 4.4 4.64 4.05 4.22 4.34 4.3 3.94 4.2 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 2.24 2.72 2.75 2.49 2.39 2.24 2.49 2.24 2.32 2.27 2.14 2.33 2.23 2.3 2.05 1.98 1.92 1.96 2.2 2 4.09 4.11 4.37 3.78 4.12 3.91 3.87 4.05 3.94 4.03 4 3.94 4.07 3.97 4.06 3.92 4.09 4.04 3.95 4.1 Average squared error 7 6 5 4 3 Training 2 Validation 1 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Number of neurons Figure 26. Average squared error vs. number of neurons. 43 The numbers of hidden neurons affect how well the network is able to predict the output variable. A large number of hidden neurons will ensure correct learning and prediction of the data the network has been trained on, but its performance on new data may be compromised. On the other hand, with too few hidden neurons the network may be unable to learn the relationships among the data. Thus, selection of the number of hidden neurons is crucial. The trial and error approach performed for the selection of an appropriate number of hidden neurons started with a small number of neurons and gradually increased the number if the network had failed to reduce the error. Although one hidden layer is always sufficient provided we have enough data, there are situations where a network with two or more hidden layers may require fewer hidden units and weights than a network with one hidden layer, thus using extra hidden layers sometimes can improve generalization. We built a second model with two hidden layers, and neurons varying from two to four which were differently distributed among the layers during each run. Estimations of the average squared error for each network are display in Table 6, and the results reveal that better predictions are not obtained by adding an extra hidden layer. Table 6. Average squared error of a feedforward MLP neural network model with two hidden layers. Neurons First Layer Neurons Second Layer ASE Training ASE Validation 1 1 1 2 3 2 1 2 3 1 1 2 5.64 5.01 5.55 9.29 5.26 5.93 6.26 5.62 6.33 10.05 5.69 6.80 44 The plot shown in Figure 27 displays the average squared error for each iteration of the training and validation sets of the MLP model with one hidden layer and four neurons. Figure 27. Assessment plot - MLP (1 hidden layer and 4 neurons). The lines for the training and validation data are progressing as the number of iterations increases. By default, the node completed 100 iterations but we could have continued the training process. However, given that the reduction in the average squared error was becoming less and less significant after the hundredth iteration, we decided to evaluate the default model. Color maps of the weight factors were constructed and they are displayed in Figures 28-33. Each input has its own relative weight, which gives the inputs the impact that is needed during the training process. Weights determine the intensity of the inputs signal as registered by the neurons. Some input variables are considered more important than others, and the color maps illustrate the effect that each input has on the network. 45 Figure 28. Weight 1 - Variable L. Figure 29. Weight 2 - Variable L. 46 Figure 30. Weight 3 - Variable L. Figure 31. Weight 4 - Variable L. 47 Figure 32. Weights - Variable E. Figure 33. Weights - Variables G and S. Subsequently, a scatter plot of predicted vs. observed values was obtained and it is shown in Figure 34. This plot reveals that the MLP neural network model with one hidden layer and four neurons produced better predictive results for the output variable K than the CHAID and CART tree models. Observed and predicted values are very close to each other which is expected from an accurate model. Observations lie close to the 45 degree reference line that passes through the origin showing a high correlation between the observed and predicted values. 48 However, the closer we get to the minimum and especially maximum values of the data, the more disperse the points tend to be, indicating that prediction of those values are less accurate. These points, lying far away from the diagonal line, represent cases with a few numbers of observations. Pr edi ct ed K 80 75 70 65 60 55 50 45 40 35 30 25 20 15 15 20 25 30 35 40 45 50 55 60 65 70 75 80 K Figure 34. Scatter plot of predicted vs. observed values. Additionally, in Figure 35 we can observe that even thought the residuals are fairly scattered around zero, there is a slight but discernable tendency for the residuals to increase as the predicted values increase. This indicates that the model performs less well when predicting high observed values. Resi dual s 30 20 10 0 - 10 - 20 - 30 10 20 30 40 50 Pr edi ct ed K Figure 35. Scatter plot of residuals vs. predicted values. 49 60 Once again, the histogram of the residuals shown in Figures 36 appears to follow a normal distribution pattern; however the tails are too long. When building complex predictive models, such as trees or neural networks, it is acceptable to obtain residuals that behave as the ones in this figure. Normal 2500 Frequency 2000 1500 1000 500 0 -21 -14 -7 0 7 Residuals 14 21 Figure 36. Histogram of residuals. Finally, we calculated the Nash-Sutcliffe efficiency measure for the validation set to evaluate the performance of the selected neural network. This time, the value obtained is closer to 1 indicating a reasonably good fit. 50 4.3 Scoring Process: The final and most important step during the process of building a predictive model is the generalization or scoring process, i.e., how well the model makes predictions for cases that were not available at the time of training and that do not contain a target value. The Score node in SAS Enterprise Miner generates and manages scoring code that is produced by the tree or neural network nodes. The code is encapsulated and can be used in most SAS environments even without the presence of Enterprise Miner. After scoring 43% of the collected data, from which the value of the variable K was not possible to calculate, we produced the overlaid histograms of observed and predicted values shown in Figure 36. The distribution of the predicted values is very much analogous to the distribution of the observed values, which indicates that we have obtained reliable predictive results. Variable K Predicted K 3000 Frequency 2500 2000 1500 1000 500 0 18 27 36 45 Data 54 63 72 Figure 37. Histograms of observed and predicted values for the variable K. 51 5 Discussion and conclusions Throughout this thesis work two well known supervised learning methods, regression trees and neural networks, were performed in order to build a predictive model that could accurately predict the output of the metric K. This metric describes the usage of Scania vehicles due to the road conditions and the driving needs, and it is used in the company during the process of developing vehicle components. The first major problem encountered when selecting the appropriate predictive method was the high dimensionality of the input data, as the presence of a large number of input variables can present some severe problems for pattern recognition systems. In addition, the underlying distribution of the input dataset was unknown, as well as the relationships between the input variables and the output variable, and the possible relations among all input variables. Given the complexity of the input dataset, methods that assume no distributional patterns in the data, and that can at the same time handle unknown high dimensional relationships were required. We first decided to implement CHAID and CART regression trees as they are easy to produce, understand, and use. Tree methods ability to incrementally build complex rules is simple and powerful, and they readily adapt to various twists and turns in the data. Nevertheless, given that the predictive results were not satisfactory; a MLP neural network model with one hidden layer and four neurons was performed. Neural networks are as well normally implemented to model complex relationships between inputs and outputs, and when having little prior knowledge of these relationships. They also have the ability to detect all possible interactions 52 between predictor variables. Moreover, no assumptions of the model have to be made; neural networks can solve difficult process problems that cannot be solved quickly or accurately with conventional methods given their limitation to strict assumptions of normality, linearity, variable independence, etc. Finally, MLPs can approximate almost any function with a high degree of accuracy given enough data, enough hidden units, and enough training time. Evaluation of the methods performance was based on the Nash-Sutcliffe efficiency measure which showed that the selected neural network model was able to capture the patterns and unknown relations existing between the input data and the output metric K with an accuracy of 0.86, whereas the measures of model performance for the CHAID and CART trees were 0.61 and 0.65 respectively. One of the reasons of the high accuracy of the neural network model is due to its computation of adequate weights for each one of the input attributes, thus accounting for all the predictive information each of these attributes contains. Later, these weights are combined and the computed value is passed along connections to other hidden units and output units, where first internal computations are performed, providing the nonlinearity that makes neural networks so powerful, and finally predicted output values close to the observed values are generated. On the other hand, both CHAID and CART regression trees use less number of inputs than the neural network model. They attempt to find strong relationships between the input and target variables, and only relationships that are strong enough are used for building the model. Some inputs attributes are treated as irrelevant or redundant and are not taken into account when building the predictive tree. Thus, the patterns and relations existing between these “irrelevant” inputs attribute and the output variable K are not captured, and the predicted values produced are not as accurate as when performing the neural network model. 53 In addition, knowledge about the input variable matrix L indicates that some of its adjacent positions must be considered as a whole when analyzing the patterns present in the data, even in the situation when they are somehow correlated. The tree methods do not take into account this special feature of the input data because attributes are subsequently treated one by one when producing the splitting rules, and in certain cases only some of them are considered as important inputs. On the contrary, neural networks take into account all input attributes when building the model, even if some of them have certain degree of correlation. Some attempts were made to try to understand how the weights produced by the neural network were distributed in the input data set in a way that they could capture the patterns shaped by adjacent positions of the matrix L. However, plots of the computed weights did not reveal any apparent pattern in the distribution of the weights over the entire input set, thus no evident explanation of how the neural network model relates adjacent positions of the matrix was found. One of the disadvantages of performing a neural network model is its "black box" nature, and therefore they are often implemented when the prediction task is more important than the interpretation of the built model. Even though the neural network model outperformed the tree models, due to its complex structure, it lacks of clear graphical representation of the results and it also requires longer computation time. 54 Satisfactory results were also achieved when applying the scoring formula from the neural network model to new cases, i.e., the generation of predicted values for the fraction of the data set that did not contain the metric K as the target value. The results obtained showed that it is possible to rely on the predictive power of the neural network model, and further analysis including other group of vehicles built in Scania for different purposes, can be made based on the model proposed. 55 6 Literature 1. Abdullah, M. (2010). Decision Tree Induction & Clustering Techniques in SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A Comparative Analysis. IABR & ITLC Conference Proceedings. 2. Berry, M.J.A. and Linoff, G. (1999). Mastering Data Mining: The Art and Science of Customer Relationship Management. New York: John Wiley & Sons, Inc. 3. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press Inc., New York. 4. De Ville, B. (2006). Decision Trees for Business Intelligence and Data Mining: Using SAS® Enterprise Miner™. SAS Publishing. 5. Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning; Data Mining, Inference, and Prediction. Springer Series in Statistics. 6. Krause P., Boyle DP., Bäse F. (2005). Comparison of different efficiency criteria for hydrological model assessment. Advance in Geosciences; 5:8997. 7. Kriesel D. (2005). A Brief Introduction to Neural Networks. www.dkriesel.com. 8. Neville, P. (1999). Decision Trees for Predictive Modeling. SAS Institute Inc. 9. SAS Enterprise Miner Tutorial, retrieved in 2010. 10. Scania Inline, retrieved in 2010 from www.sacnia.inline.com. 56 7 Appendix Appendix A. Parameters setting for the CHAID and CART algorithms Tree parameters setting to support appropriate assessment efforts: Minimum number of observations in a leaf: The smaller this value is, the more likely it is that the tree will overfit the training data set. If the value is too large, it is likely that the tree will underfit the training data set and miss relevant patterns in the data. In SAS the default setting is max (5, n/1000) where n is the number of observations in the training set. In our case, the default value for the minimum number of observations in a leaf is 20, and better predictive results were not obtained when trying different values for this parameter. Observations required for a split search: This option prevents the splitting of nodes with few observations. In other words, nodes with fewer observations than the value specified in observations required for a split search will not be split. The default is a calculated value that depends on the number of observations and the value stored in minimum number of observations in a leaf. The default value for our model is 202, and better predictive results were not obtained when trying different values. Maximum depth of tree: This parameter was changed from 6 to 15 to allow complex trees to be grown. The size of a tree may be the most important single determinant of quality, more important, perhaps, than creating good individual splits. Trees that are too small do not describe the data well. Trees that are too large have leaves with too little data to make any reliable predictions about the contents of the leaf when the tree is applied to a new sample. Splits deep in a large tree may be based on too little data to be reliable. 57 Parameters setting according to the different algorithms performed in each tree node: Approximation of the CHAID algorithm by using the tree node: The Model assessment measure property was set to Average squared error. This measure is the average of the square of the difference between the predicted outcome and the actual outcome, and it is used to calculate the worth of a tree when the target is continuous. The worth of the tree is calculated by using the validation set to compare trees of different sizes in order to pick the tree with the optimal number of leaves. The Splitting Criterion was set to F test to measure the degree of separation achieved by a split. The F test significance level was set to 0.05, as a stopping rule that accounts for the predictive reliability of the data. Partitioning stops when no split meets the threshold level of significance. To avoid automatic pruning, the Subtree method was set to The most leaves. The subtree method determines which subtree is selected from the fully grown tree. This option selects the full tree given that other options are relied on for stopping the training. The Maximum number of branches from a node option was changed from 2 to 100, and 10 was chosen given that not better predictive results were obtained when this value was increased. The Surrogate rules saved in each node option were set to 0. A surrogate rule is a back-up to the main splitting rule. When the main splitting rule relies on an input whose value is missing, the first surrogate rule is invoked. If the first surrogate also relies on an input whose value is missing, the next surrogate is invoked. If missing values prevent the main rule and all of the 58 surrogates from applying to an observation, then the main rule assigns the observation to the branch it has designated as receiving missing values. However, since missing values are not present in the data the use of surrogate rules was not implemented. To force a heuristic search, the Maximum tries in an exhaustive split search option was set to 0. This option allows finding the optimal split, even if it is necessary to evaluate every possible split on a variable. The Observations sufficient for split search option was set to the size of the training data set (20218). This option sets an upper limit on the number of observations used in the sample to determine a split. All observations in the node are then passed to the branches and a new sample is taken within each branch independently. The P-value adjustment was set to Kass, and the Apply Kass after choosing number of branches option was also selected. By choosing this option, the P-value is multiplied by a Bonferroni factor that depends on the number of branches, target values, and sometimes on the number of distinct input values. The algorithm applies this factor after the split is selected. The adjusted P-values are used in comparing splits on the same input and splits on different inputs. Approximation of the CART algorithm by using the tree node: Trees created by using the tree node are very similar to the ones grown by using the Classification and Regression Trees method without linear combination splits or Twoing or ordered Twoing splitting criteria. The Classification and Regression Trees method recommends using validation data unless the data set contains too few observations. The Tree node is intended for large data sets. The options in the Tree node were set as follow: The Model assessment measure property was set to Average squared error. 59 The Splitting Criterion was set to Variance reduction. This value measures the reduction in the squared error from node means. The Maximum number of branches from a node were set to 2. The Treat missing as an acceptable value check box was selected. However, this option did not affect the results since the data did not contain missing values. The Surrogate rules saved in each node were set to 5. Yet, for the same reason mentioned above, these rules were not implemented. The Subtree method was set to Best assessment value. This option selects the smallest subtree with the best assessment value. Validation data is used during the selection process. The Observations sufficient for split search were set to 1000. The Maximum tries in an exhaustive split search were set to 5000. To find the optimal split, it is sometimes necessary to evaluate every possible split on a variable. Sometimes the number of possible splits is extremely large. In this case, if the number for a specific variable in a specific node is larger than 5000, then a heuristic (stepwise, hill-climbing) search algorithm is used instead for that variable in that node. The P-value adjustment was set to Depth. By selecting this option, the P-values are adjusted for the number of ancestor splits where the adjustment depends on the depth of the tree at which the split is done. Depth is measured as the number of branches in the path from the current node, where the splitting is taking place, to the root node. The calculated P-value is multiplied by a depth multiplier, based on the depth in the tree of the current node, to arrive at the depth-adjusted P-value of the split. 60