Assignment 3 - Steven Graham - B00444855

advertisement
Assignment 3
Health Informatics
Steven Graham
B0044855
Table of Contents
Introduction ............................................................................................................................................ 2
Task 1 ...................................................................................................................................................... 2
Task 2 ...................................................................................................................................................... 2
Task 3 ...................................................................................................................................................... 3
Task 4 ...................................................................................................................................................... 3
Task 5 ...................................................................................................................................................... 4
Task 6 ...................................................................................................................................................... 5
Random Tree ....................................................................................................................................... 5
Decision Table ..................................................................................................................................... 6
Classifier Analysis ................................................................................................................................ 6
Task 7 ...................................................................................................................................................... 7
SimpleKmeans ..................................................................................................................................... 7
Task 8 ...................................................................................................................................................... 8
Works Cited ............................................................................................................................................. 9
1
Introduction
For this assignment I will be investigate the mining of clinical and medical data. Data mining is the
process of extracting useful information from raw data. The objective of the assignment is to
investigate the medical datasets and classify the data using well-known classification algorithms. For
the classification I will be using a piece of open source data mining software called WEKA.
(Weka, 2010)
Task 1
The term attribute simply means a feature or a variable within the dataset. For example within the
dataset I will be using for this assignment a variable or attribute would be diagnosis.
Definition taken from Wikipedia - In computing, an attribute is a specification that defines a property
of an object, element, or file. It may also refer to or set the specific value for a given instance of
such.
(Attribute, 2010)
Task 2
For this assignment I will download two files from the Machine Learning repository. The first will be
data set description and the second will be the actual data. Below is a screen shot of the data set
description file which I will used later to convert the file to a suitable format for WEKA.
(Breast Cancer Wisconsin (Diagnostic) Data Set , 1995)
2
Task 3
The dataset contains data on breast cancer diagnosis. The data contains information on both benign
and malignant tumours. The data set also gives further characteristics of the benign and malignant
tumours.
I believe that the dataset is intended for research into ways in which to improve tumour diagnosis,
by the introduction of computerised diagnosis. The dataset contains the information that a system
could be tested on to see how accurate it is.
The data set could also be used to see if there are any similarities or if there are any trends within
the data.
Task 4
1. The dataset contains 569 instances
2. The dataset contains 32 attributes, below are the details
3. There are two classes within the dataset these are Malignant and Benign.
3
Task 5
To enable me to use WEKA to classify the raw data, I must first place it into a format in which WEKA
will understand. To convert the text file to weka format I simple list the attributes that are contain
within the data file and then save it as an arff file. Below is a screen shot of the attributes placed at
the beginning of the file
4
Task 6
Supervised learning is where the machine learns task by inferring a function from a supervised
training dataset. The training data set will contain a number of training examples. The examples will
consist of an input object and a desired output. A supervised learning algorithm will analyse the
training data and produce a classifier, this should predict the correct output for any valid input.
(Supervised learning, 2010)
Random Tree
Below are the results from the random tree supervised classification
The random tree classification algorithm classified 92.091% of data set correctly, with an average TP
of 0.921.
5
Decision Table
Below are the results from the decision tree supervised classification
The decision table classification algorithm classified 94.024% of data set correctly, with an average
TP of 0.94.
Classifier Analysis
Out of the two supervised classification methods I selected the decision table perform the best. The
decision table classified 94.024% of the instances correctly compare to the 92.091% of the random
tree. The decision table also had a better weight TP rate with 0.94 compared to the random tree
with 0.921.
6
Task 7
Unsupervised learning is a set of problems in which one seeks to determine how the data is
organised. Many of the methods employed here are based on methods from data mining to preprocess data. It differs from supervised learning in that the learner is given only the unlabelled
example.
(Unsupervised learning, 2010)
SimpleKmeans
Below are the results from the SimpleKmeans unsupervised classification
From the results of the
SimpleKmeans classification I can
see that there were 4 iterations
within the dataset.
The 569 instances where classified
into both of the cluster with Benign
having 63% or 358 instances and
malignant having 37% or 211
instances.
In comparison to the supervised
algorithms, the SimpleKmeans
classified more of the instances as
Benign by around 13, and more
instances Malignant by around 21.
7
Task 8
Unsupervised methods of classification although useful, only work well when the users has an idea
of the expected result. If the user does not have an idea of what the results should look like then
they should use supervised as this method is more accurate.
If the dataset has missing data this can have an effect on the performance of the classifier. The data
may not be placed into the right classification or the classifier may not display how many values
where missing and the user may not notice this. One method which has been used to treat missing
values is deleting instances containing at least one missing value of a feature.
(The treatment of missing values and its effect, 2005)
8
Works Cited
Breast Cancer Wisconsin (Diagnostic) Data Set . (1995, 11 01). Retrieved 10 2010, from Machine
Learning Repository: http://archive.ics.uci.edu/ml/machine-learning-databases/breastcancer-wisconsin/wdbc.names
The treatment of missing values and its effect. (2005). Retrieved 12 2010, from University of Puerto
Rico: http://academic.uprm.edu/~eacuna/IFCS04r.pdf
Attribute. (2010, 11 17). Retrieved 10 2010, from Wikipedia:
http://en.wikipedia.org/wiki/Attribute_(computing)
Supervised learning. (2010, 12 02). Retrieved 12 2010, from Wikipedia:
http://en.wikipedia.org/wiki/Supervised_learning
Unsupervised learning. (2010, 10 04). Retrieved 12 2010, from Wikipedia:
http://en.wikipedia.org/wiki/Unsupervised_learning
Weka. (2010). Retrieved 10 2010, from Machine Learning Group at University of Waikato:
http://www.cs.waikato.ac.nz/~ml/weka/
9
Download