CIS338B: Data Mining Assignment Summer 2014

advertisement
CIS338B: Data Mining
Assignment
Summer 2014
Submission deadline: 28 August at 11am
Submission instructions: The coursework report has to be submitted in hard copy
and has to reach the Computing departmental secretariat by the deadline. Make sure you
mention the course name, the course code and your name on the first page.
Requirements and marking scheme
This assignment requires you work individually. The coursework is marked out of 100
marks, according to the marks indicated in [] for each task. 94 marks are awarded for the
Tasks 1 to 3, and 6 marks for the presentation (readability, appropriate indentation, clarity
of the ideas, submission format).
Tasks
Task 1 [55 marks]:
Download the following Cardiology dataset containing information about patients that have been
diagnosed for heart problems. The output attribute called class, has two values ‘healthy’ and
‘sick’ showing the result of the diagnosis.
http://www.doc.gold.ac.uk/~mas01ds/cis338/cardiology.arff
Using Weka Data Mining software, you are required to:
a) Pre-process the dataset in order to select the 9 best attributes. Include in your report
screenshots showing the algorithm you have applied for pre-processing (include the
chosen parameter values if any).
[5]
b) On the dataset obtained at point (a) apply precisely 4 different classification algorithms in
order to produce 8 models (up to 3 models per algorithm), that can be used to
automatically diagnose future patients. At least one of the produced models has to be a
decision tree. All the models will be learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 70% and 30% of instances,
respectively. For each built model, include in you report a screenshot with the algorithm
name and the parameter values chosen for its application, and a screenshot with the
1
confusion matrix and the measures of performance of the model (in particular the
accuracy). Include a screenshot with one decision tree that you obtained.
[24]
c) Calculate for each model the precision, sensitivity, specificity and the lift, for the class
‘sick’.
[16]
d) Choose the best, the second best and the third best model from (b). Justify your answer.
[6]
e) Mention three characteristics of sick people based on the decision tree built and displayed
in (b). List the production rule(s) that you have used to mention these characteristics.
[4]
Task 2 [24 marks]:
The following dataset having the input attributes SIZE and WEIGHT needs to be
clustered in two clusters by applying the K-Means algorithm. The first centres are the
instances 1 and 4 (where instances are numbered from top to bottom: the first instance is
on top, the last instance in on bottom). The clustering will stop in this case after two
iterations even though the centres changed. The resulting clusters will be displayed
together with their centres and the within cluster squared error. Mention the purpose of
this measure.
SIZE
W EIG HT
6
6
8
8
10
14
6
12
10
6
8
15
Task 3 [15 marks]:
Input the dataset from the table above in a text file with comma separated values (csv), having on
the first row the attributes X and Y separated by comma. Name the file “dataset.csv”, load it in
Weka, and cluster it using the K-means algorithm. Choose 2 clusters to be produced. Include in
the report a screenshot showing the chosen parameter values, and another screenshot with the two
final centres. In addition include the composition of each cluster (showing which point goes in
which cluster), and a screenshot from Weka with the clusters points in a plan. What is the squared
error you got and what is its role?
2
Download