Assignment 1 – Data Mining Techniques

advertisement
Assignment 1 – Data Mining Techniques
Cliff Voetelink, 1554506, MSc. Econometrics & Operations Research, VU-net ID: cvk600.
1. Introduction
During the first lecture of the Data Mining Techniques (DMT) course in March 2011, each participating student
individually answered a series of 18 questions on a form. The questions were asked verbally by the teacher,
one after another. An example of such a question was: which of the colors red, blue of pink calms you down
the most? At the end of the class, the filled-out forms of all students were collected by the teacher and (pre-)
processed by a fellow student.
In total there are 51 records and each record contains values (classes) for the various attributes (such as the
previously mentioned most calming color), of a specific student that is participating in the course. I consider the
data that was recorded to be fairly reliable, since I witnessed the whole experiment and a detailed report was
made about how the pre-processing has been done.
This report contains a short analysis and summary of the specific dataset that has been constructed. It also
contains a classification model where the purpose is to predict the study-program class using the available of a
student. We finish with a discussion of the results.
2. Preprocessing of Data
Using Excel, I have made some changes to the data as part of the pre-processing stage that was done by Elektra
before loading the data into Rapidminer.
•
Since the attribute ‘college year’ only contains the value 2011 in every record, this part of data is not
very interesting and it has therefore been removed from the summary.
•
Furthermore, there is an error in the data in record 32. The value of stand-up attribute is 4 there while
only values from {0,1} are allowed, it has been corrected to value 1 since it is the most likely to take on
that value as well since almost everyone stood up. I suspect it to be a possible typo.
•
For the video_rec_device attribute, the`1d’ has been edited to `d’ due to reading errors of rapidminer.
•
The set of possible classes of the attribute study-program has been edited; the different AI masters are
now given under the same class AI, while students that were the only ones of their study-program are
now put together as one group under the class ‘other’.
3. Data Summary and Data Analysis
After the pre-processing stage, a quick overview of the data is given in Table 3.1.
# Records
51
# Attributes
18
Data Size and Attribute Type Overview
# Polynomial Type Attributes
# Integer Type Attributes
8
8
# Real Valued Attributes
2
Table 3.1: A quick overview of the size of data and the number of different attribute types.
An example of an integer type attribute is whether or not the student had already followed a course on
machine-learning, where the corresponding classes 0 and 1 where 0 represents no and 1 represents yes. An
example of a real valued attribute is the estimated length of a line and an example of a polynominal attribute is
the type of study program the student is into: BMI, Econometrics etc., it can also contain numbers as well.
1
After some preprocessing, the data we keep can be well summarized using the metaview Rapidminer output
(Table 3.2).
Attribute (ID)
Attribute Type
Relevant Statistics
Attribute Value Range or Set of Possible Values including frequency # Missing Values
study_program
polynominal
mode = BMI (23), least = CS (3)
BMI (23), AI (9), Other (5), Ect (7), Bioinform (4), CS (3)
0
mach.learning
integer
avg +/- stdev = 0.686 +/- 0.469
[0.000 ; 1.000]
0
inf.retrieval
integer
avg +/- stdev= 0.120 +/- 0.328
[0.000 ; 1.000]
1
statistics
integer
avg +/- stdev = 0.880 +/- 0.328
[0.000 ; 1.000]
1
databases
integer
avg +/- stdev = 0.706 +/- 0.460
[0.000 ; 1.000]
0
video_rec_device
polynominal
mode = 1 (47), least = 0 (2)
1 (47), 0 (2), d (2)
0
color
polynominal
mode = blue (38), least = red (1)
blue (38), pink (11), red (1), color_prob (1)
0
-0,138812155
(1), -0,24551105
(1), 0,168508287
0,237914365
(1), -0,036947514
(1), -0,207527624 (1), -0,382251381 (1), -0,348411602 (1), 0,169889503 (2),
0 -0,243093923 (1), 0,136395028 (1), -0,622237569 (1
birthday_t (1), -0,624654696
polynominal
mode = -0,622582873
(2), (1),
least
= -0,138812155
(1)
year_t
polynominal
mode
0,520378909 (4),
(9), 0,754018419
least = 0,987657929
(1)
0,520378909
(9),= 0,053099889
(6), 0,987657929
(1), -0,180539621 (8), 0,286739399 (7), -0,414179131 (4), -5,554248353 (1), -1,348737172
7
(1), -0,881458152 (2), -0,647818642 (1)
neighbours
integer
avg +/- stdev = 4.392 +/- 4.186
[0.000 ; 25.000]
0
stand_up
integer
avg +/- stdev = 0.765 +/- 0.428
[0.000 ; 1.000]
0
stress_level
integer
avg +/- stdev = 35.196 +/- 24.387
[0.000 ; 80.000]
0
line_length
real
avg +/- stdev = 114.985 +/- 29.524
[45.000 ; 180.000]
0
gorilla
integer
avg +/- stdev = 0.235 +/- 0.428
[0.000 ; 1.000]
0
random_num
real
avg +/- stdev = 6.235 +/- 2.187
[1.000 ; 10.000]
0
(2), 0,041666667 (5), 0,958333333 (3), 0,017361111bed_time
(1), 0,052083333 (2),polynominal
0,083333333 (1),mode
0,127777778
= 0,041666667
(1), 0,951388889
(5), least (2),
= 0,882638889
0,996527778
(1)(2), 0,979166667 (3), 0,006944444 (1), 0,003472222 (1), 0,982638889 (1), 0,9930555560(1), 0,020833333 (3), 0,111111111 (1), 0,0625 (2), 0,9
weather
polynominal
mode = sunny (40), least = stormy (1)
other (7), sunny (40), stormy (1), rainy (1), cloudy (1)
1
breakfast
polynominal
mode = good (38), least = n/a (6)
good (38), bad (7), n/a (6)
0
Table 3.2: The rapidminer output already gives us a decent and compact summary of our dataset.
Referring back to the question ‘which color calms you down the most?’, we see that from the highlighted row
from Table 3.2 that one color stood out in particular: a large majority of the students, 38 out of 51, picked blue
to be the most calming color while pink only takes a second place with 11 students.
Fig. 3.1: A histogram of the various different
backgrounds of the students.
Fig. 3.2: A histogram of chosen random numbers.
Denotes the frequency of each chosen number.
After some pre-processing of the data, we split the students into 6 categories according to their master
programs as shown in Fig. 3.1. The ‘other’ category contains students that were the only ones of their study
following the DMT course. Notice that there a lot of people from BMI following this course and that most
students that are following this course are either into applied mathematics where data can be very important
and/or into some form of computer science. Also notice almost half of participants are studying BMI.
There was another question in which each student was asked to pick an integer from the set {1,2,3,…,9, 10},
the results are displayed in Figure 3.2, there we see that number 7 was chosen a lot more frequently than any
other number. While you might (or might not) expect that if you ask a lot of people to pick an integer between
1 and 10, every number would be chosen about the same amount of times, the distribution does not seem to
be uniform. There can be a number of explanations why number 7 is chosen more often than any other
number, e.g., it is considered a lucky number in some religions.
2
4. Classification of Study-Programs
In this section we are going to build a classification model that can classify students according to their studyprograms based on all the other attribute data. Different studies have different course programs and there are
attributes that specify if a student had already taken courses into statistics, machine learning etc. For example,
the econometrics study program does not have a course on machine learning while the BMI program contains
that for sure, it may make sense that study-program of a student can be predicted using the attributes we
have. Apart from that, we are also interested in the accuracy of the model that will be built.
We have selected all attributes to be used for the classification model. By default, we use the 10-fold crossvalidation and stratified sampling. In Fig 4.1 the process is summarized, the validation part of building a model
is a nested process and it is shown in Fig. 4.2.
Fig. 4.1: The main process to create and validate a model that is able to label the study-programs
using all other data. The first process makes sure that the preprocessed data is retrieved. The
second process will make sure that the study-program will be a label instead of an attribute. The Xvalidation process is a nested process and it will create, train and validate a classification model
and use a (by default) 10-fold cross validation.
Fig. 4.2: The validation process with a 10-fold cross-validation using stratified sampling. We use the
Naive Bayes algorithm on the 17 other attributes, the dataset is used to train and validate the
performance the model.
Perf. Measure \ Algorithm Naïve Bayes Decision Tree
Estimated Classification Rate
42.67%
59.00%
St. Dev
19.60%
13.00%
Table 4.1: The results of two algorithms. The only difference in the complete process is that Naive
Bayes algorithm in Fig. 4.2 is replaced by the Decision Tree algorithm. Clearly the Decision Tree
algorithm is performing better.
From the results Table 4.1 we conclude that it is better to use the decision tree algorithm when using all
attributes since the classification rate is 59% while this is only 42.67% when using Naïve Bayes, the standard
deviation is also lower for the Decision Tree algorithm. The decision tree that was found can be found in
3
Appendix A, Figure A1. Clearly, the Naïve Bayes algorithm is performing worse; a possible explanation may be
that this is because some attributes are strongly dependent. For example: there will be a correlation between
the attributes ‘statistics’ and ‘machine learning’ since all BMI students have followed both courses. Suggestions
for further research would be to first narrow down the optimal set of attributes and/or the most important
attributes and check for independence.
More detailed results can be seen in the confusion matrices (Table 4.2 and Table A1) when using the different
algorithms. Notice that for both algorithms it holds that especially the BMI students are the easiest to classify
for both performance measures for accuracy, for both performance measures it holds that higher is better:
Precision of Class A:
CPA = P{true class = A | predicted class = A}
Recall of class A:
CRA = P{ predicted class = A | true class = A}
The bioinformatics study is the hardest to classify (0% accuracy on both algorithms). Furthermore, notice that
the Naïve Bayes classifier is doing lots better on predicting if someone is from the econometrics program.
Confusion Matrix DT true BMI
pred. BMI
22
pred. AI
0
pred. Other
0
pred. Econometrics
1
pred. Bioinformatics
0
pred. CS
0
class recall
95.65%
true AI
0
4
0
0
2
3
44.44%
true Other
2
0
0
3
0
0
0.00%
true Ect
0
0
4
3
0
0
42.86%
true Bioinf
3
1
0
0
0
0
0.00%
true CS
0
1
1
0
0
1
33.33%
class precision
81.48%
66.67%
0.00%
42.86%
0.00%
25.00%
59.00% +- 13.00%
Table 4.2: The confusion matrix when the Decision Tree algorithm is used. The green marked cell is
the overall classification rate +- st.dev.
We conclude that the best classification rates were obtained using the Decision Tree algorithm; better results
could be obtained by using optimal attribute selection and further exploration of other algorithms, e.g.: neural
networks.
6. Masked Data
We finish up with something completely different. It may be interesting to know how the sensitive information
of people here such as date of birth is masked in this experiment. It was given that linear transformation was
used to mask the original data and using this information I have uncovered the following mappings:
•
For the date-month attribute in DDMM format, the following formula was used to mask the data:
y = ax + b where a = -0.000345303867662849 and b = −1000a .
•
For the birth-year, the relation between the masked birth year data y and the true birth year x is
given by the relation y = ax + b where
•
a = -0.233620618439508 and b = -463.917184.
The transformation for the bedtime data has been done in a similar way: consider the interval [0, 24)
where 0 represents 00:00 hours and 24 is 24:00 hours. The original bedtimes were mapped onto
[0,1) according to the transformation y = x / 24 where x denotes the number of hours past
midnight for x ∈ [0, 24) .
4
Appendix A
Fig. A1: The decision tree that is found when the Decision Tree algorithm is used.
Confusion Matrix NB true BMI
pred. BMI
9
pred. AI
0
pred. Other
3
pred. Econometrics
1
pred. Bioinformatics
10
pred. CS
0
class recall
39.13%
true AI
0
4
1
0
3
1
44.44%
true Other
1
2
1
1
0
0
20.00%
true Ect
0
0
0
7
0
0
100.00%
true Bioinf
3
0
1
0
0
0
0.00%
true CS
0
2
0
0
0
1
33.33%
class precision
69.23%
50.00%
16.67%
77.78%
0.00%
50.00%
42.67% +- 19.60%
Table A1: The confusion matrix when Naive Bayes algorithm is used. The green marked cell is the
overall classification rate +- st.dev.
5
Download