Assignment 1 – Data Mining Techniques Cliff Voetelink, 1554506, MSc. Econometrics & Operations Research, VU-net ID: cvk600. 1. Introduction During the first lecture of the Data Mining Techniques (DMT) course in March 2011, each participating student individually answered a series of 18 questions on a form. The questions were asked verbally by the teacher, one after another. An example of such a question was: which of the colors red, blue of pink calms you down the most? At the end of the class, the filled-out forms of all students were collected by the teacher and (pre-) processed by a fellow student. In total there are 51 records and each record contains values (classes) for the various attributes (such as the previously mentioned most calming color), of a specific student that is participating in the course. I consider the data that was recorded to be fairly reliable, since I witnessed the whole experiment and a detailed report was made about how the pre-processing has been done. This report contains a short analysis and summary of the specific dataset that has been constructed. It also contains a classification model where the purpose is to predict the study-program class using the available of a student. We finish with a discussion of the results. 2. Preprocessing of Data Using Excel, I have made some changes to the data as part of the pre-processing stage that was done by Elektra before loading the data into Rapidminer. • Since the attribute ‘college year’ only contains the value 2011 in every record, this part of data is not very interesting and it has therefore been removed from the summary. • Furthermore, there is an error in the data in record 32. The value of stand-up attribute is 4 there while only values from {0,1} are allowed, it has been corrected to value 1 since it is the most likely to take on that value as well since almost everyone stood up. I suspect it to be a possible typo. • For the video_rec_device attribute, the`1d’ has been edited to `d’ due to reading errors of rapidminer. • The set of possible classes of the attribute study-program has been edited; the different AI masters are now given under the same class AI, while students that were the only ones of their study-program are now put together as one group under the class ‘other’. 3. Data Summary and Data Analysis After the pre-processing stage, a quick overview of the data is given in Table 3.1. # Records 51 # Attributes 18 Data Size and Attribute Type Overview # Polynomial Type Attributes # Integer Type Attributes 8 8 # Real Valued Attributes 2 Table 3.1: A quick overview of the size of data and the number of different attribute types. An example of an integer type attribute is whether or not the student had already followed a course on machine-learning, where the corresponding classes 0 and 1 where 0 represents no and 1 represents yes. An example of a real valued attribute is the estimated length of a line and an example of a polynominal attribute is the type of study program the student is into: BMI, Econometrics etc., it can also contain numbers as well. 1 After some preprocessing, the data we keep can be well summarized using the metaview Rapidminer output (Table 3.2). Attribute (ID) Attribute Type Relevant Statistics Attribute Value Range or Set of Possible Values including frequency # Missing Values study_program polynominal mode = BMI (23), least = CS (3) BMI (23), AI (9), Other (5), Ect (7), Bioinform (4), CS (3) 0 mach.learning integer avg +/- stdev = 0.686 +/- 0.469 [0.000 ; 1.000] 0 inf.retrieval integer avg +/- stdev= 0.120 +/- 0.328 [0.000 ; 1.000] 1 statistics integer avg +/- stdev = 0.880 +/- 0.328 [0.000 ; 1.000] 1 databases integer avg +/- stdev = 0.706 +/- 0.460 [0.000 ; 1.000] 0 video_rec_device polynominal mode = 1 (47), least = 0 (2) 1 (47), 0 (2), d (2) 0 color polynominal mode = blue (38), least = red (1) blue (38), pink (11), red (1), color_prob (1) 0 -0,138812155 (1), -0,24551105 (1), 0,168508287 0,237914365 (1), -0,036947514 (1), -0,207527624 (1), -0,382251381 (1), -0,348411602 (1), 0,169889503 (2), 0 -0,243093923 (1), 0,136395028 (1), -0,622237569 (1 birthday_t (1), -0,624654696 polynominal mode = -0,622582873 (2), (1), least = -0,138812155 (1) year_t polynominal mode 0,520378909 (4), (9), 0,754018419 least = 0,987657929 (1) 0,520378909 (9),= 0,053099889 (6), 0,987657929 (1), -0,180539621 (8), 0,286739399 (7), -0,414179131 (4), -5,554248353 (1), -1,348737172 7 (1), -0,881458152 (2), -0,647818642 (1) neighbours integer avg +/- stdev = 4.392 +/- 4.186 [0.000 ; 25.000] 0 stand_up integer avg +/- stdev = 0.765 +/- 0.428 [0.000 ; 1.000] 0 stress_level integer avg +/- stdev = 35.196 +/- 24.387 [0.000 ; 80.000] 0 line_length real avg +/- stdev = 114.985 +/- 29.524 [45.000 ; 180.000] 0 gorilla integer avg +/- stdev = 0.235 +/- 0.428 [0.000 ; 1.000] 0 random_num real avg +/- stdev = 6.235 +/- 2.187 [1.000 ; 10.000] 0 (2), 0,041666667 (5), 0,958333333 (3), 0,017361111bed_time (1), 0,052083333 (2),polynominal 0,083333333 (1),mode 0,127777778 = 0,041666667 (1), 0,951388889 (5), least (2), = 0,882638889 0,996527778 (1)(2), 0,979166667 (3), 0,006944444 (1), 0,003472222 (1), 0,982638889 (1), 0,9930555560(1), 0,020833333 (3), 0,111111111 (1), 0,0625 (2), 0,9 weather polynominal mode = sunny (40), least = stormy (1) other (7), sunny (40), stormy (1), rainy (1), cloudy (1) 1 breakfast polynominal mode = good (38), least = n/a (6) good (38), bad (7), n/a (6) 0 Table 3.2: The rapidminer output already gives us a decent and compact summary of our dataset. Referring back to the question ‘which color calms you down the most?’, we see that from the highlighted row from Table 3.2 that one color stood out in particular: a large majority of the students, 38 out of 51, picked blue to be the most calming color while pink only takes a second place with 11 students. Fig. 3.1: A histogram of the various different backgrounds of the students. Fig. 3.2: A histogram of chosen random numbers. Denotes the frequency of each chosen number. After some pre-processing of the data, we split the students into 6 categories according to their master programs as shown in Fig. 3.1. The ‘other’ category contains students that were the only ones of their study following the DMT course. Notice that there a lot of people from BMI following this course and that most students that are following this course are either into applied mathematics where data can be very important and/or into some form of computer science. Also notice almost half of participants are studying BMI. There was another question in which each student was asked to pick an integer from the set {1,2,3,…,9, 10}, the results are displayed in Figure 3.2, there we see that number 7 was chosen a lot more frequently than any other number. While you might (or might not) expect that if you ask a lot of people to pick an integer between 1 and 10, every number would be chosen about the same amount of times, the distribution does not seem to be uniform. There can be a number of explanations why number 7 is chosen more often than any other number, e.g., it is considered a lucky number in some religions. 2 4. Classification of Study-Programs In this section we are going to build a classification model that can classify students according to their studyprograms based on all the other attribute data. Different studies have different course programs and there are attributes that specify if a student had already taken courses into statistics, machine learning etc. For example, the econometrics study program does not have a course on machine learning while the BMI program contains that for sure, it may make sense that study-program of a student can be predicted using the attributes we have. Apart from that, we are also interested in the accuracy of the model that will be built. We have selected all attributes to be used for the classification model. By default, we use the 10-fold crossvalidation and stratified sampling. In Fig 4.1 the process is summarized, the validation part of building a model is a nested process and it is shown in Fig. 4.2. Fig. 4.1: The main process to create and validate a model that is able to label the study-programs using all other data. The first process makes sure that the preprocessed data is retrieved. The second process will make sure that the study-program will be a label instead of an attribute. The Xvalidation process is a nested process and it will create, train and validate a classification model and use a (by default) 10-fold cross validation. Fig. 4.2: The validation process with a 10-fold cross-validation using stratified sampling. We use the Naive Bayes algorithm on the 17 other attributes, the dataset is used to train and validate the performance the model. Perf. Measure \ Algorithm Naïve Bayes Decision Tree Estimated Classification Rate 42.67% 59.00% St. Dev 19.60% 13.00% Table 4.1: The results of two algorithms. The only difference in the complete process is that Naive Bayes algorithm in Fig. 4.2 is replaced by the Decision Tree algorithm. Clearly the Decision Tree algorithm is performing better. From the results Table 4.1 we conclude that it is better to use the decision tree algorithm when using all attributes since the classification rate is 59% while this is only 42.67% when using Naïve Bayes, the standard deviation is also lower for the Decision Tree algorithm. The decision tree that was found can be found in 3 Appendix A, Figure A1. Clearly, the Naïve Bayes algorithm is performing worse; a possible explanation may be that this is because some attributes are strongly dependent. For example: there will be a correlation between the attributes ‘statistics’ and ‘machine learning’ since all BMI students have followed both courses. Suggestions for further research would be to first narrow down the optimal set of attributes and/or the most important attributes and check for independence. More detailed results can be seen in the confusion matrices (Table 4.2 and Table A1) when using the different algorithms. Notice that for both algorithms it holds that especially the BMI students are the easiest to classify for both performance measures for accuracy, for both performance measures it holds that higher is better: Precision of Class A: CPA = P{true class = A | predicted class = A} Recall of class A: CRA = P{ predicted class = A | true class = A} The bioinformatics study is the hardest to classify (0% accuracy on both algorithms). Furthermore, notice that the Naïve Bayes classifier is doing lots better on predicting if someone is from the econometrics program. Confusion Matrix DT true BMI pred. BMI 22 pred. AI 0 pred. Other 0 pred. Econometrics 1 pred. Bioinformatics 0 pred. CS 0 class recall 95.65% true AI 0 4 0 0 2 3 44.44% true Other 2 0 0 3 0 0 0.00% true Ect 0 0 4 3 0 0 42.86% true Bioinf 3 1 0 0 0 0 0.00% true CS 0 1 1 0 0 1 33.33% class precision 81.48% 66.67% 0.00% 42.86% 0.00% 25.00% 59.00% +- 13.00% Table 4.2: The confusion matrix when the Decision Tree algorithm is used. The green marked cell is the overall classification rate +- st.dev. We conclude that the best classification rates were obtained using the Decision Tree algorithm; better results could be obtained by using optimal attribute selection and further exploration of other algorithms, e.g.: neural networks. 6. Masked Data We finish up with something completely different. It may be interesting to know how the sensitive information of people here such as date of birth is masked in this experiment. It was given that linear transformation was used to mask the original data and using this information I have uncovered the following mappings: • For the date-month attribute in DDMM format, the following formula was used to mask the data: y = ax + b where a = -0.000345303867662849 and b = −1000a . • For the birth-year, the relation between the masked birth year data y and the true birth year x is given by the relation y = ax + b where • a = -0.233620618439508 and b = -463.917184. The transformation for the bedtime data has been done in a similar way: consider the interval [0, 24) where 0 represents 00:00 hours and 24 is 24:00 hours. The original bedtimes were mapped onto [0,1) according to the transformation y = x / 24 where x denotes the number of hours past midnight for x ∈ [0, 24) . 4 Appendix A Fig. A1: The decision tree that is found when the Decision Tree algorithm is used. Confusion Matrix NB true BMI pred. BMI 9 pred. AI 0 pred. Other 3 pred. Econometrics 1 pred. Bioinformatics 10 pred. CS 0 class recall 39.13% true AI 0 4 1 0 3 1 44.44% true Other 1 2 1 1 0 0 20.00% true Ect 0 0 0 7 0 0 100.00% true Bioinf 3 0 1 0 0 0 0.00% true CS 0 2 0 0 0 1 33.33% class precision 69.23% 50.00% 16.67% 77.78% 0.00% 50.00% 42.67% +- 19.60% Table A1: The confusion matrix when Naive Bayes algorithm is used. The green marked cell is the overall classification rate +- st.dev. 5