CAR EVALUATION THE NEURAL NET PROJECT Rashid Lepshokov Nikolay Pukhovskiy HALDEN 2011 TABLE OF CONTENT Abstract ...................................................................................................... 3 Introdaction ............................................................................................... 4 Related work .............................................................................................. 4 Analyze the data set .................................................................................. 5 Experiment ................................................................................................. 6 Input Data ................................................................................................. 6 Output Data .............................................................................................. 7 Dividing the dataset ................................................................................. 9 Create and configure a neural network .................................................. 10 Training Algorithms ............................................................................... 13 Levenberg-Marquardt ....................................................................... 13 BFGS Quasi-Newton ........................................................................ 15 Resilient Backpropagation ................................................................ 16 Findings .................................................................................................... 17 Comparison with the C5. 0 .................................................................... 17 Conclusion ................................................................................................ 18 References ................................................................................................ 19 Appendix .................................................................................................. 20 2 Abstract Nowadays because of the variety and the number of cars the customers have to think about the right choice before buying a car. The modern car has variety of various characteristics and it is necessary for customer to give most trustworthy information and to facilitate procedure of choosing the car. This paper includes the C# application for the converting the dataset into the suitable format and neural network in the Neural Netwok Toolbox to solve the task of car evaluation. As the dataset has been used the modified result of simulation program of Marko Bohanec (4). As a neural network has been developed and investigated two-layer feedforward network, with a sigmoid transfer function in the hidden layer and a sigmoid transfer function in the output layer. With the created network the experiments were carried out to identify the best combination of network settings. 3 Introdaction Nowadays because of the variety and the number of cars the customers have to think about the right choice before buying a car. The modern car has variety of various characteristics and it is necessary for customer to give most trustworthy information and to facilitate procedure of choosing the car. The urgency of the given theme consists in importance of studying the cars assortment and possibility of practical application of the received information both legal, and physical persons. Now there is a set of the literature which helps the consumer to understand variety of assortment and consumer characteristics, various brands of cars. All changes and auto world novelties are surveyed in the modern periodic literature. The purpose of the given work is to use the decomposition methods for our data and to create an auxiliary user system to make a correct choice as a final result. Related work In the process of studying our data, we considered previous works(1)(2)(3). First, we are going to consider the view of our data and their structure. Our data is derived by the following features: Class Values: unacc, acc, good, vgood Attributes: buying: vhigh, high, med, low. maint: vhigh, high, med, low. doors: 2, 3, 4, 5more. persons: 2, 4, more. lug_boot: small, med, big. safety: low, med, high. Our data are as follows: vhigh,vhigh,2,4,big,med,unacc vhigh,vhigh,2,4,big,high,unacc vhigh,vhigh,2,more,small,low,unacc As you can see, our data are represented as a string with attributes. Similar research works revealed that this kind of data set is used very rarely. In other works of the car evaluation there were used parameters such as fuel consumption. In related works there were used more structured and useful data about each car. For example, in our data the size of the boot is represented in this form: 4 lug_boot: small, med, big. In related works it appears in another form: lug_boot: 10_l, 20_l, 30_l, 40_l, 50_l, 60_l, 70_l, 80_l. Thereby their projects can provide more useful information about the car. Another different: Our Data set persons: 2, 4, more doors: 2, 3, 4, 5more buying: vhigh, high, med, low maint: vhigh, high, med, low safety: low, med, high. Previous work persons: 2, 4,5, 6,7 doors: 2, 3, 4, 5,6. buying: 50000kr, 80000kr, 100000kr, 150000kr…., <Do not used> safety: low, med, high. lug_boot: 10_l, 20_l, 30_l, 40_l, 50_l, 60_l, 70_l, 80_l lug_boot: small, med, big. Analyze the data set Using findings from previous works we found that we need to change the structure of our data. For the formation of a new data structure we have written a special program. As you remember, there are 1209 instances in our dataset. Now our data is as follows: Class Values: unacc, acc, good, vgood Attributes (old): Attributes buying maint doors persons lug_boot safety 5 Type vhigh vhigh 2 2 small low high high 3 4 med med med med 4 more big high low low 5more Attributes (new): Attributes buying maint doors persons lug_boot safety Type 0 0 2 2 0 0 1 1 3 4 1 1 2 2 4 5 2 2 3 3 5 In further work with neural networks we used the Matlab. Experiment The dataset consists of six input and one output attributes. All attributes have a string format dataset. To use the Neural Network Toolbox correctly, (we have to convert all our attributes at numeric) all attributes were converted to numeric. Below there are two tables of new and old dataset. To convert the dataset suitable for Neural Network Toolbox, the program in C # on Visual Studio 2008 was created. Input Data Integers between 0 and 5 were chosen as new values for attributes. Values (old): buying maint doors persons lug_boot safety vhigh vhigh 2 2 small low high high 3 4 med med med med 4 more big high low low 5more 1 1 3 4 1 1 2 2 4 5 2 2 3 3 5 Values (new): buying maint doors persons lug_boot safety 6 0 0 2 2 0 0 Below there is is a histogram of values for the input attributes This graph shows us that the distribution of values of input attributes is virtually uniform. Output Data Old Target Unacc acc good vgood 0 1 2 3 New Target On the next page there is a histogram of values for the target attribute. 7 The dataset is unevenly distributed. The most frequent outlet for the dataset is the output 0 (Unacc). The schedule for the distribution of values of the target attribute is in the file bellow. The vertical scale determines the value of the attribute, and the horizontal axis displays the serial number of the attribute in the file. Distribution of the data in the dataset is uneven, as we can see from the chart. This distribution (will) influences the choice of the division algorithm on the training dataset, validation and test set. 8 Dividing the dataset In this case, it was convenient to use a random selection of the training dataset, validation and test. It was explained by the fact that the data in the dataset were distributed strictly in the order of five units (Figure N). Therefore, it would be wrong if we chose a certain chunk of data for each set Therefore, the choice of a certain chunk of data for each set would have been wrong. This could lead to a situation where there would have not been a complete set of values of the target attributes in dataset. Due to random choice sets there was high confidence to get evenly distributed sets. With random choice sets we have great confidence to get evenly distributed sets. As a percentage ratio for the separation, the recommended settings were used: Training – 70%; Validation – 15%; Testing-15%. 9 Creation and configuration of the neural network A two-layer feedforward network was used, with a sigmoid transfer function in the hidden layer and a sigmoid transfer function in the output layer. The diagram below shows the neural network. The neural network consists of input unit preprocessing, followed by a hidden layer, the core layer, the unit postprocessing and output. 10 Several experiments in different number of neurons in the hidden layer and different optimization algorithms were conducted. 11 The image below is the graph for the correlation of target and output training data, validation and test data. The following regression plots display the network outputs with respect to targets for training, validation, and test sets. For a perfect fit, the data should fall along a 45 degree line, where the network outputs are equal to the targets. The problem is that For our problem the fit is reasonably good for all data sets, with R values in each case of 0.91 or above. Also, the target and output schedule slightly differ in the values of outputs 2 and 3. The reason is the uneven distribution of the target attribute in the dataset. Data fact show that almost all the network errors are going to these values. This leads to the fact that despite of the very low number of errors, we will not have there will not be more errors in the case when the target attribute value is 2 or 3. 12 Training Algorithms To study the dependence of the quality of the neural network the variation of the number of neurons in the hidden layer and changes Training Algorithms for training the network were used. As the Training Algorithms three algorithms were chosen: Levenberg-Marquardt; BFGS Quasi-Newton; Resilient Backpropagation. As the values for the variation of the number of neurons in the hidden layer the following values were used: 3,5,10,20,40,60,80,100. To compare the results the following parameters were used: Mean Squared Error; Correlation between target and output data; Number of epochs. This comparison was used for the test data set. Levenberg-Marquardt Number of neurons on the hidden layer Mean Squared Error (MSE) Correlation Number of epochs between target and output data (R) 3 0.08 0.85 38 5 0.06 0.86 9 10 0.05 0.91 8 20 0.03 0.98 22 40 0.05 0.95 8 60 0.09 0.86 6 80 0.05 0.92 7 100 0.07 0.908 8 This algorithm showed good results for our neural network. It was found that by increasing the number of neurons in the hidden layer the quality of neural network is improving, but when the number of neurons reaches 40, the results begin to fall off. 13 Also, in the picture above it is seen that the graph for the training data quite significantly deviates from the graphs for validation and test data. The data fact can say us about overfitting the neural network. 14 Very high rate of R tells us that there is a perfect correlation between the target and the output data. The number is very close to 1, which indicates a good fit. The graph of gradient at each iteration shows a smooth change in values, which may indicate that the network training took place with no apparent discontinuities of the gradient. BFGS Quasi-Newton Mean Squared Error (MSE) Correlation between target and output data (R) Number of epochs ~avg 0.3 ~avg 0.87 ~avg 21 On the credit-default network the settings finished their learning and adaptation too early which resulted in high Mean Sqaued Error. After increasing the value of the delta to 0,9 the neural network showed much better results. If the default value index R was less than 0.2. After adjusting the parameters they have to fluctuate between 0.8 and 0.9 for different numbers of neurons in the hidden layer. 15 Resilient Backpropagation Number of neurons on the hidden layer Mean Squared Error (MSE) Correlation Number of epochs between target and output data (R) 3 0.13 0.72 12 5 0.06 0.86 41 10 0.04 0.93 133 20 0.05 0.928 149 40 0.03 0.96 110 60 0.05 0.97 84 80 0.02 0.98 109 100 0.04 0.95 137 This algorithm showed very good results with respect to the Mean Squared Error and correlation between the outputs and the targets. However, it was more demanding for performance, because the average learning and adaptation of the neural network took over 100 iterations. It was also noted that the first 10-15 iterations, the value of Mean Squared Error function significantly increased and its change was not smooth. 16 Findings Основной Основной Основной Основной Levenberg-Marquardt Основной Resilient Backpropagation Основной Основной On this chart the graph shows the performance of the neural networks, error rates and the correlation of the number of neurons in the hidden layer. For both algorithms there are similar behavior charts, but Resilient Backpropagation requires four times more neurons in layer razed than Levenberg-Marquardt. Comparison with the C5. 0 After experiments with the neural network in Neural Network Tool the results gotten were significantly better than decision tree at S5.0. It means that the neural network is better suited for these types of classification problems than the decision tree. Since there were missing values in the dataset a decision tree could not give the correct results in all of the cases. The neural network has the properties of a generalization that allows it to train on an incomplete set of data. When we were testing the neural network, we applied the input data which were absent in the training sample, and it was possible to classify them correctly. In the previous project to test data set we got error rate that equals 0.37%. But when we used neural network this percent was less than 0.25. However, decision tree in C5.0 took less time to run than training the neural network. Also, decision tree in C5.0 had less flexibility and settings than neural network which allowed to set network with different parameters. That gave big field for experiments. If we compare comprehensibility between these two tools we can say that Neural Network with GUI interface have advantage over the C5.0 and 17 comand-line interface. The comprehensibility of Neutral Network tool with GUI interface compared to the comprehensibility of the C5.0 and command-line interface had advantage. Conclusion As a result, work has been developed and investigated two-layer feedforward network, with a sigmoid transfer function in the hidden layer and a sigmoid transfer function in the output layer. Experiments were conducted in which we changed the training algorithm and the number of neurons in the hidden layer. As a result of these experiments, we found that the best configuration for the network is used as a training algorithm LevenbergMarquardt with the number of neurons in the hidden layer 20. In this case, we obtained correlation between outputs and targets is equal to 9.7 and the Mean Squared Error is equal to 0,03. Neural network in this case have been trained for 23 iterations. After analyzing the results we see that the target output and the regression graph differ slightly in the values of outputs 2 and 3. The reason is the uneven distribution of the target attribute in the dataset. Data fact tells us that almost all the network error is going to these values. This leads to the fact that despite the very low rate of error, we will not have some more errors in the case when the target attribute value is 2 or 3. We can solve the problem above by adding the missing input data for the relevant target attributes. Also, our dataset has been processed for use in a Neural Network Toolbox. For what has been created external application to convert the attribute values dataset. 18 References 1. E. Levrat, A. Voisin, S. Bombardier, J. Brémont , “Subjective evaluation of car seat comfort with fuzzy set techniques“ Issue International Journal of Intelligent Systems Volume 12, Issue 11-12, pages 891–913, November - December 1997 2. P. Chakroborty, S. Kikuchi “Evaluation of the General Motors based car-following models and a proposed fuzzy inference model” Transportation Research Part C: Emerging Technologies , Volume 7, Issue 4, August 1999, Pages 209-235 3. S. Chundury, B. Wolshon , “Evaluation of CORSIM Car-Following Model by Using Global Positioning System Field Data” Transportation Research Record: Journal of the Transportation Research Board , 114-121 4. Car Evaluation Data Set. Machine Learning Repository. [Online] 1990. [Cited: 4 2011, 25.] http://archive.ics.uci.edu/ml/datasets/Car+Evaluation. 5. B. Zupan, M. Bohanec. Knowledge acquisition and explanation for multi-attribute decision making. Avignon,France : In 8th Intl Workshop on Expert Systems and their Applications., 1988. p. 223. 19 Appendix using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.IO; namespace CarDataSetFormater { class Program { static void Main(string[] args) { Divide(); } static StringBuilder(); StringBuilder resBuilder = new static private void Divide() { StreamReader sr; StringBuilder builder; builder = new StringBuilder(); builder.AppendLine("buying maint doors persons lug_boot safety"); using (sr = new StreamReader(@"car.txt")) { String line; while ((line = sr.ReadLine()) != null) { 20 line = formated(line); builder.AppendLine(line); } sr.Close(); } using (StreamWriter sw StreamWriter(@"inputCar.txt", true, Encoding.UTF8)) = new using (StreamWriter sw = StreamWriter(@"targetCar.txt", true, Encoding.UTF8)) new { sw.Write(builder.ToString()); sw.Close(); } { sw.Write(resBuilder.ToString()); sw.Close(); } Console.WriteLine(builder); Console.WriteLine(builder.Length); Console.ReadKey(); } private static string formated(string line) { String[] attrs = line.Split(','); if (attrs.Length != 7) return " error "; String p1 = attrs[0]; String p2 = attrs[1]; String p3 = attrs[2]; String p4 = attrs[3]; String p5 = attrs[4]; String p6 = attrs[5]; String resP = attrs[6]; 21 p1 = p1.Replace("vhigh", "0"); p1 = p1.Replace("high", "1"); p1 = p1.Replace("med", "2"); p1 = p1.Replace("low", "3"); p1 = p1.Replace("v1", "0"); p2 = p2.Replace("vhigh", "0"); p2 = p2.Replace("high", "1"); p2 = p2.Replace("med", "2"); p2 = p2.Replace("low", "3"); p1 = p1.Replace("v1", "0"); p3 = p3.Replace("2", "0"); p3 = p3.Replace("3", "1"); p3 = p3.Replace("4", "2"); p3 = p3.Replace("5more", "3"); p4 = p4.Replace("2", "0"); p4 = p4.Replace("4", "1"); p4 = p4.Replace("more", "2"); p5 = p5.Replace("small", "0"); p5 = p5.Replace("med", "1"); p5 = p5.Replace("big", "2"); p6 = p6.Replace("high", "0"); p6 = p6.Replace("med", "1"); p6 = p6.Replace("low", "2"); resP = resP.Replace("unacc", "0"); resP = resP.Replace("acc", "1"); 22 resP = resP.Replace("good", "2"); resP = resP.Replace("vgood ", "3"); resP = resP.Replace("v2", "3"); resP = resP.Replace("un1", "0"); resBuilder.AppendLine(resP); return String.Format("{0} {5}", p1, p2, p3, p4, p5, p6); } } } 23 {1} {2} {3} {4}