Analyze the data set

advertisement
CAR EVALUATION
THE NEURAL NET PROJECT
Rashid Lepshokov
Nikolay Pukhovskiy
HALDEN 2011
TABLE OF CONTENT
Abstract ...................................................................................................... 3
Introdaction ............................................................................................... 4
Related work .............................................................................................. 4
Analyze the data set .................................................................................. 5
Experiment ................................................................................................. 6
Input Data ................................................................................................. 6
Output Data .............................................................................................. 7
Dividing the dataset ................................................................................. 9
Create and configure a neural network .................................................. 10
Training Algorithms ............................................................................... 13
Levenberg-Marquardt ....................................................................... 13
BFGS Quasi-Newton ........................................................................ 15
Resilient Backpropagation ................................................................ 16
Findings .................................................................................................... 17
Comparison with the C5. 0 .................................................................... 17
Conclusion ................................................................................................ 18
References ................................................................................................ 19
Appendix .................................................................................................. 20
2
Abstract
Nowadays because of the variety and the number of cars the customers
have to think about the right choice before buying a car. The modern car has
variety of various characteristics and it is necessary for customer to give most
trustworthy information and to facilitate procedure of choosing the car.
This paper includes the C# application for the converting the dataset into
the suitable format and neural network in the Neural Netwok Toolbox to solve
the task of car evaluation. As the dataset has been used the modified result of
simulation program of Marko Bohanec (4). As a neural network has been
developed and investigated two-layer feedforward network, with a sigmoid
transfer function in the hidden layer and a sigmoid transfer function in the
output layer. With the created network the experiments were carried out to
identify the best combination of network settings.
3
Introdaction
Nowadays because of the variety and the number of cars the customers
have to think about the right choice before buying a car. The modern car has
variety of various characteristics and it is necessary for customer to give most
trustworthy information and to facilitate procedure of choosing the car. The
urgency of the given theme consists in importance of studying the cars
assortment and possibility of practical application of the received information
both legal, and physical persons. Now there is a set of the literature which helps
the consumer to understand variety of assortment and consumer characteristics,
various brands of cars. All changes and auto world novelties are surveyed in the
modern periodic literature. The purpose of the given work is to use the
decomposition methods for our data and to create an auxiliary user system to
make a correct choice as a final result.
Related work
In the process of studying our data, we considered previous
works(1)(2)(3). First, we are going to consider the view of our data and their
structure.
Our data is derived by the following features:
Class Values:
unacc, acc, good, vgood
Attributes:
buying: vhigh, high, med, low.
maint: vhigh, high, med, low.
doors: 2, 3, 4, 5more.
persons: 2, 4, more.
lug_boot: small, med, big.
safety: low, med, high.
Our data are as follows:
vhigh,vhigh,2,4,big,med,unacc
vhigh,vhigh,2,4,big,high,unacc
vhigh,vhigh,2,more,small,low,unacc
As you can see, our data are represented as a string with attributes. Similar
research works revealed that this kind of data set is used very rarely. In other
works of the car evaluation there were used parameters such as fuel
consumption. In related works there were used more structured and useful data
about each car. For example, in our data the size of the boot is represented in
this form:
4
lug_boot: small, med, big.
In related works it appears in another form:
lug_boot: 10_l, 20_l, 30_l, 40_l, 50_l, 60_l, 70_l, 80_l.
Thereby their projects can provide more useful information about the car.
Another different:
Our Data set
persons: 2, 4, more
doors: 2, 3, 4, 5more
buying: vhigh, high, med, low
maint: vhigh, high, med, low
safety: low, med, high.
Previous work
persons: 2, 4,5, 6,7
doors: 2, 3, 4, 5,6.
buying: 50000kr, 80000kr,
100000kr, 150000kr….,
<Do not used>
safety: low, med, high.
lug_boot: 10_l, 20_l, 30_l, 40_l, 50_l,
60_l, 70_l, 80_l
lug_boot: small, med, big.
Analyze the data set
Using findings from previous works we found that we need to change the
structure of our data. For the formation of a new data structure we have written a
special program. As you remember, there are 1209 instances in our dataset.
Now our data is as follows:
Class Values:
unacc, acc, good, vgood
Attributes (old):
Attributes
buying
maint
doors
persons
lug_boot
safety
5
Type
vhigh
vhigh
2
2
small
low
high
high
3
4
med
med
med
med
4
more
big
high
low
low
5more
Attributes (new):
Attributes
buying
maint
doors
persons
lug_boot
safety
Type
0
0
2
2
0
0
1
1
3
4
1
1
2
2
4
5
2
2
3
3
5
In further work with neural networks we used the Matlab.
Experiment
The dataset consists of six input and one output attributes. All attributes
have a string format dataset. To use the Neural Network Toolbox correctly, (we
have to convert all our attributes at numeric) all attributes were converted to
numeric. Below there are two tables of new and old dataset. To convert the
dataset suitable for Neural Network Toolbox, the program in C # on Visual
Studio 2008 was created.
Input Data
Integers between 0 and 5 were chosen as new values for attributes.
Values (old):
buying
maint
doors
persons
lug_boot
safety
vhigh
vhigh
2
2
small
low
high
high
3
4
med
med
med
med
4
more
big
high
low
low
5more
1
1
3
4
1
1
2
2
4
5
2
2
3
3
5
Values (new):
buying
maint
doors
persons
lug_boot
safety
6
0
0
2
2
0
0
Below there is is a histogram of values for the input attributes
This graph shows us that the distribution of values of input attributes is
virtually uniform.
Output Data
Old
Target
Unacc
acc
good
vgood
0
1
2
3
New
Target
On the next page there is a histogram of values for the target attribute.
7
The dataset is unevenly distributed. The most frequent outlet for the
dataset is the output 0 (Unacc).
The schedule for the distribution of values of the target attribute is in the
file bellow.
The vertical scale determines the value of the attribute, and the horizontal
axis displays the serial number of the attribute in the file. Distribution of the data
in the dataset is uneven, as we can see from the chart. This distribution (will)
influences the choice of the division algorithm on the training dataset, validation
and test set.
8
Dividing the dataset
In this case, it was convenient to use a random selection of the training
dataset, validation and test. It was explained by the fact that the data in the
dataset were distributed strictly in the order of five units (Figure N). Therefore,
it would be wrong if we chose a certain chunk of data for each set Therefore, the
choice of a certain chunk of data for each set would have been wrong. This
could lead to a situation where there would have not been a complete set of
values of the target attributes in dataset. Due to random choice sets there was
high confidence to get evenly distributed sets. With random choice sets we have
great confidence to get evenly distributed sets. As a percentage ratio for the
separation, the recommended settings were used:
 Training – 70%;
 Validation – 15%;
 Testing-15%.
9
Creation and configuration of the neural network
A two-layer feedforward network was used, with a sigmoid transfer
function in the hidden layer and a sigmoid transfer function in the output layer.
The diagram below shows the neural network.
The neural network consists of input unit preprocessing, followed by a
hidden layer, the core layer, the unit postprocessing and output.
10
Several experiments in different number of neurons in the hidden layer
and different optimization algorithms were conducted.
11
The image below is the graph for the correlation of target and output
training data, validation and test data.
The following regression plots display the network outputs with respect to
targets for training, validation, and test sets. For a perfect fit, the data should fall
along a 45 degree line, where the network outputs are equal to the targets. The
problem is that For our problem the fit is reasonably good for all data sets, with
R values in each case of 0.91 or above.
Also, the target and output schedule slightly differ in the values of outputs
2 and 3. The reason is the uneven distribution of the target attribute in the
dataset. Data fact show that almost all the network errors are going to these
values. This leads to the fact that despite of the very low number of errors, we
will not have there will not be more errors in the case when the target attribute
value is 2 or 3.
12
Training Algorithms
To study the dependence of the quality of the neural network the variation
of the number of neurons in the hidden layer and changes Training Algorithms
for training the network were used. As the Training Algorithms three algorithms
were chosen:
 Levenberg-Marquardt;
 BFGS Quasi-Newton;
 Resilient Backpropagation.
As the values for the variation of the number of neurons in the hidden
layer the following values were used: 3,5,10,20,40,60,80,100.
To compare the results the following parameters were used:
 Mean Squared Error;
 Correlation between target and output data;
 Number of epochs.
This comparison was used for the test data set.
Levenberg-Marquardt
Number of
neurons on the
hidden layer
Mean Squared
Error (MSE)
Correlation
Number of epochs
between target and
output data (R)
3
0.08
0.85
38
5
0.06
0.86
9
10
0.05
0.91
8
20
0.03
0.98
22
40
0.05
0.95
8
60
0.09
0.86
6
80
0.05
0.92
7
100
0.07
0.908
8
This algorithm showed good results for our neural network. It was found
that by increasing the number of neurons in the hidden layer the quality of
neural network is improving, but when the number of neurons reaches 40, the
results begin to fall off.
13
Also, in the picture above it is seen that the graph for the training data
quite significantly deviates from the graphs for validation and test data. The data
fact can say us about overfitting the neural network.
14
Very high rate of R tells us that there is a perfect correlation between the
target and the output data. The number is very close to 1, which indicates a good
fit.
The graph of gradient at each iteration shows a smooth change in values,
which may indicate that the network training took place with no apparent
discontinuities of the gradient.
BFGS Quasi-Newton
Mean Squared
Error (MSE)
Correlation
between target and
output data (R)
Number of epochs
~avg 0.3
~avg 0.87
~avg 21
On the credit-default network the settings finished their learning and
adaptation too early which resulted in high Mean Sqaued Error. After increasing
the value of the delta to 0,9 the neural network showed much better results. If
the default value index R was less than 0.2. After adjusting the parameters they
have to fluctuate between 0.8 and 0.9 for different numbers of neurons in the
hidden
layer.
15
Resilient Backpropagation
Number of
neurons on the
hidden layer
Mean Squared
Error (MSE)
Correlation
Number of epochs
between target and
output data (R)
3
0.13
0.72
12
5
0.06
0.86
41
10
0.04
0.93
133
20
0.05
0.928
149
40
0.03
0.96
110
60
0.05
0.97
84
80
0.02
0.98
109
100
0.04
0.95
137
This algorithm showed very good results with respect to the Mean
Squared Error and correlation between the outputs and the targets. However, it
was more demanding for performance, because the average learning and
adaptation of the neural network took over 100 iterations. It was also noted that
the first 10-15 iterations, the value of Mean Squared Error function significantly
increased and its change was not smooth.
16
Findings
Основной
Основной
Основной
Основной
Levenberg-Marquardt
Основной
Resilient Backpropagation
Основной
Основной
On this chart the graph shows the performance of the neural networks,
error rates and the correlation of the number of neurons in the hidden layer. For
both algorithms there are similar behavior charts, but Resilient Backpropagation
requires four times more neurons in layer razed than Levenberg-Marquardt.
Comparison with the C5. 0
After experiments with the neural network in Neural Network Tool the
results gotten were significantly better than decision tree at S5.0. It means that
the neural network is better suited for these types of classification problems than
the decision tree. Since there were missing values in the dataset a decision tree
could not give the correct results in all of the cases. The neural network has the
properties of a generalization that allows it to train on an incomplete set of data.
When we were testing the neural network, we applied the input data which were
absent in the training sample, and it was possible to classify them correctly. In
the previous project to test data set we got error rate that equals 0.37%. But
when we used neural network this percent was less than 0.25. However, decision
tree in C5.0 took less time to run than training the neural network. Also,
decision tree in C5.0 had less flexibility and settings than neural network which
allowed to set network with different parameters. That gave big field for
experiments. If we compare comprehensibility between these two tools we can
say that Neural Network with GUI interface have advantage over the C5.0 and
17
comand-line interface. The comprehensibility of Neutral Network tool with GUI
interface compared to the comprehensibility of the C5.0 and command-line
interface had advantage.
Conclusion
As a result, work has been developed and investigated two-layer
feedforward network, with a sigmoid transfer function in the hidden layer and a
sigmoid transfer function in the output layer. Experiments were conducted in
which we changed the training algorithm and the number of neurons in the
hidden layer. As a result of these experiments, we found that the best
configuration for the network is used as a training algorithm LevenbergMarquardt with the number of neurons in the hidden layer 20. In this case, we
obtained correlation between outputs and targets is equal to 9.7 and the Mean
Squared Error is equal to 0,03. Neural network in this case have been trained for
23
iterations.
After analyzing the results we see that the target output and the regression graph
differ slightly in the values of outputs 2 and 3. The reason is the uneven
distribution of the target attribute in the dataset. Data fact tells us that almost all
the network error is going to these values. This leads to the fact that despite the
very low rate of error, we will not have some more errors in the case when the
target attribute value is 2 or 3.
We can solve the problem above by adding the missing input data for the
relevant target attributes. Also, our dataset has been processed for use in a
Neural Network Toolbox. For what has been created external application to
convert the attribute values dataset.
18
References
1. E. Levrat, A. Voisin, S. Bombardier, J. Brémont , “Subjective
evaluation of car seat comfort with fuzzy set techniques“ Issue International
Journal of Intelligent Systems Volume 12, Issue 11-12, pages 891–913,
November - December 1997
2. P. Chakroborty, S. Kikuchi “Evaluation of the General Motors based
car-following models and a proposed fuzzy inference model” Transportation
Research Part C: Emerging Technologies , Volume 7, Issue 4, August 1999,
Pages 209-235
3. S. Chundury, B. Wolshon , “Evaluation of CORSIM Car-Following
Model by Using Global Positioning System Field Data” Transportation
Research Record: Journal of the Transportation Research Board , 114-121
4. Car Evaluation Data Set. Machine Learning Repository. [Online] 1990.
[Cited: 4 2011, 25.] http://archive.ics.uci.edu/ml/datasets/Car+Evaluation.
5. B. Zupan, M. Bohanec. Knowledge acquisition and explanation for
multi-attribute decision making. Avignon,France : In 8th Intl Workshop on
Expert Systems and their Applications., 1988. p. 223.
19
Appendix
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace CarDataSetFormater
{
class Program
{
static void Main(string[] args)
{
Divide();
}
static
StringBuilder();
StringBuilder
resBuilder
=
new
static private void Divide()
{
StreamReader sr;
StringBuilder builder;
builder = new StringBuilder();
builder.AppendLine("buying maint doors persons
lug_boot safety");
using (sr = new StreamReader(@"car.txt"))
{
String line;
while ((line = sr.ReadLine()) != null)
{
20
line = formated(line);
builder.AppendLine(line);
}
sr.Close();
}
using
(StreamWriter
sw
StreamWriter(@"inputCar.txt", true, Encoding.UTF8))
=
new
using
(StreamWriter
sw
=
StreamWriter(@"targetCar.txt", true, Encoding.UTF8))
new
{
sw.Write(builder.ToString());
sw.Close();
}
{
sw.Write(resBuilder.ToString());
sw.Close();
}
Console.WriteLine(builder);
Console.WriteLine(builder.Length);
Console.ReadKey();
}
private static string formated(string line)
{
String[] attrs = line.Split(',');
if (attrs.Length != 7) return " error ";
String p1 = attrs[0];
String p2 = attrs[1];
String p3 = attrs[2];
String p4 = attrs[3];
String p5 = attrs[4];
String p6 = attrs[5];
String resP = attrs[6];
21
p1 = p1.Replace("vhigh", "0");
p1 = p1.Replace("high", "1");
p1 = p1.Replace("med", "2");
p1 = p1.Replace("low", "3");
p1 = p1.Replace("v1", "0");
p2 = p2.Replace("vhigh", "0");
p2 = p2.Replace("high", "1");
p2 = p2.Replace("med", "2");
p2 = p2.Replace("low", "3");
p1 = p1.Replace("v1", "0");
p3 = p3.Replace("2", "0");
p3 = p3.Replace("3", "1");
p3 = p3.Replace("4", "2");
p3 = p3.Replace("5more", "3");
p4 = p4.Replace("2", "0");
p4 = p4.Replace("4", "1");
p4 = p4.Replace("more", "2");
p5 = p5.Replace("small", "0");
p5 = p5.Replace("med", "1");
p5 = p5.Replace("big", "2");
p6 = p6.Replace("high", "0");
p6 = p6.Replace("med", "1");
p6 = p6.Replace("low", "2");
resP = resP.Replace("unacc", "0");
resP = resP.Replace("acc", "1");
22
resP = resP.Replace("good", "2");
resP = resP.Replace("vgood ", "3");
resP = resP.Replace("v2", "3");
resP = resP.Replace("un1", "0");
resBuilder.AppendLine(resP);
return String.Format("{0}
{5}", p1, p2, p3, p4, p5, p6);
}
}
}
23
{1}
{2}
{3}
{4}
Download