CIS338B: Data Mining Assignment Summer 2014 Submission deadline: 28 August at 11am Submission instructions: The coursework report has to be submitted in hard copy and has to reach the Computing departmental secretariat by the deadline. Make sure you mention the course name, the course code and your name on the first page. Requirements and marking scheme This assignment requires you work individually. The coursework is marked out of 100 marks, according to the marks indicated in [] for each task. 94 marks are awarded for the Tasks 1 to 3, and 6 marks for the presentation (readability, appropriate indentation, clarity of the ideas, submission format). Tasks Task 1 [55 marks]: Download the following Cardiology dataset containing information about patients that have been diagnosed for heart problems. The output attribute called class, has two values ‘healthy’ and ‘sick’ showing the result of the diagnosis. http://www.doc.gold.ac.uk/~mas01ds/cis338/cardiology.arff Using Weka Data Mining software, you are required to: a) Pre-process the dataset in order to select the 9 best attributes. Include in your report screenshots showing the algorithm you have applied for pre-processing (include the chosen parameter values if any). [5] b) On the dataset obtained at point (a) apply precisely 4 different classification algorithms in order to produce 8 models (up to 3 models per algorithm), that can be used to automatically diagnose future patients. At least one of the produced models has to be a decision tree. All the models will be learned and tested by splitting the dataset in a training and a test dataset, each of which consisting in 70% and 30% of instances, respectively. For each built model, include in you report a screenshot with the algorithm name and the parameter values chosen for its application, and a screenshot with the 1 confusion matrix and the measures of performance of the model (in particular the accuracy). Include a screenshot with one decision tree that you obtained. [24] c) Calculate for each model the precision, sensitivity, specificity and the lift, for the class ‘sick’. [16] d) Choose the best, the second best and the third best model from (b). Justify your answer. [6] e) Mention three characteristics of sick people based on the decision tree built and displayed in (b). List the production rule(s) that you have used to mention these characteristics. [4] Task 2 [24 marks]: The following dataset having the input attributes SIZE and WEIGHT needs to be clustered in two clusters by applying the K-Means algorithm. The first centres are the instances 1 and 4 (where instances are numbered from top to bottom: the first instance is on top, the last instance in on bottom). The clustering will stop in this case after two iterations even though the centres changed. The resulting clusters will be displayed together with their centres and the within cluster squared error. Mention the purpose of this measure. SIZE W EIG HT 6 6 8 8 10 14 6 12 10 6 8 15 Task 3 [15 marks]: Input the dataset from the table above in a text file with comma separated values (csv), having on the first row the attributes X and Y separated by comma. Name the file “dataset.csv”, load it in Weka, and cluster it using the K-means algorithm. Choose 2 clusters to be produced. Include in the report a screenshot showing the chosen parameter values, and another screenshot with the two final centres. In addition include the composition of each cluster (showing which point goes in which cluster), and a screenshot from Weka with the clusters points in a plan. What is the squared error you got and what is its role? 2