Assignment 3 Health Informatics Steven Graham B0044855 Table of Contents Introduction ............................................................................................................................................ 2 Task 1 ...................................................................................................................................................... 2 Task 2 ...................................................................................................................................................... 2 Task 3 ...................................................................................................................................................... 3 Task 4 ...................................................................................................................................................... 3 Task 5 ...................................................................................................................................................... 4 Task 6 ...................................................................................................................................................... 5 Random Tree ....................................................................................................................................... 5 Decision Table ..................................................................................................................................... 6 Classifier Analysis ................................................................................................................................ 6 Task 7 ...................................................................................................................................................... 7 SimpleKmeans ..................................................................................................................................... 7 Task 8 ...................................................................................................................................................... 8 Works Cited ............................................................................................................................................. 9 1 Introduction For this assignment I will be investigate the mining of clinical and medical data. Data mining is the process of extracting useful information from raw data. The objective of the assignment is to investigate the medical datasets and classify the data using well-known classification algorithms. For the classification I will be using a piece of open source data mining software called WEKA. (Weka, 2010) Task 1 The term attribute simply means a feature or a variable within the dataset. For example within the dataset I will be using for this assignment a variable or attribute would be diagnosis. Definition taken from Wikipedia - In computing, an attribute is a specification that defines a property of an object, element, or file. It may also refer to or set the specific value for a given instance of such. (Attribute, 2010) Task 2 For this assignment I will download two files from the Machine Learning repository. The first will be data set description and the second will be the actual data. Below is a screen shot of the data set description file which I will used later to convert the file to a suitable format for WEKA. (Breast Cancer Wisconsin (Diagnostic) Data Set , 1995) 2 Task 3 The dataset contains data on breast cancer diagnosis. The data contains information on both benign and malignant tumours. The data set also gives further characteristics of the benign and malignant tumours. I believe that the dataset is intended for research into ways in which to improve tumour diagnosis, by the introduction of computerised diagnosis. The dataset contains the information that a system could be tested on to see how accurate it is. The data set could also be used to see if there are any similarities or if there are any trends within the data. Task 4 1. The dataset contains 569 instances 2. The dataset contains 32 attributes, below are the details 3. There are two classes within the dataset these are Malignant and Benign. 3 Task 5 To enable me to use WEKA to classify the raw data, I must first place it into a format in which WEKA will understand. To convert the text file to weka format I simple list the attributes that are contain within the data file and then save it as an arff file. Below is a screen shot of the attributes placed at the beginning of the file 4 Task 6 Supervised learning is where the machine learns task by inferring a function from a supervised training dataset. The training data set will contain a number of training examples. The examples will consist of an input object and a desired output. A supervised learning algorithm will analyse the training data and produce a classifier, this should predict the correct output for any valid input. (Supervised learning, 2010) Random Tree Below are the results from the random tree supervised classification The random tree classification algorithm classified 92.091% of data set correctly, with an average TP of 0.921. 5 Decision Table Below are the results from the decision tree supervised classification The decision table classification algorithm classified 94.024% of data set correctly, with an average TP of 0.94. Classifier Analysis Out of the two supervised classification methods I selected the decision table perform the best. The decision table classified 94.024% of the instances correctly compare to the 92.091% of the random tree. The decision table also had a better weight TP rate with 0.94 compared to the random tree with 0.921. 6 Task 7 Unsupervised learning is a set of problems in which one seeks to determine how the data is organised. Many of the methods employed here are based on methods from data mining to preprocess data. It differs from supervised learning in that the learner is given only the unlabelled example. (Unsupervised learning, 2010) SimpleKmeans Below are the results from the SimpleKmeans unsupervised classification From the results of the SimpleKmeans classification I can see that there were 4 iterations within the dataset. The 569 instances where classified into both of the cluster with Benign having 63% or 358 instances and malignant having 37% or 211 instances. In comparison to the supervised algorithms, the SimpleKmeans classified more of the instances as Benign by around 13, and more instances Malignant by around 21. 7 Task 8 Unsupervised methods of classification although useful, only work well when the users has an idea of the expected result. If the user does not have an idea of what the results should look like then they should use supervised as this method is more accurate. If the dataset has missing data this can have an effect on the performance of the classifier. The data may not be placed into the right classification or the classifier may not display how many values where missing and the user may not notice this. One method which has been used to treat missing values is deleting instances containing at least one missing value of a feature. (The treatment of missing values and its effect, 2005) 8 Works Cited Breast Cancer Wisconsin (Diagnostic) Data Set . (1995, 11 01). Retrieved 10 2010, from Machine Learning Repository: http://archive.ics.uci.edu/ml/machine-learning-databases/breastcancer-wisconsin/wdbc.names The treatment of missing values and its effect. (2005). Retrieved 12 2010, from University of Puerto Rico: http://academic.uprm.edu/~eacuna/IFCS04r.pdf Attribute. (2010, 11 17). Retrieved 10 2010, from Wikipedia: http://en.wikipedia.org/wiki/Attribute_(computing) Supervised learning. (2010, 12 02). Retrieved 12 2010, from Wikipedia: http://en.wikipedia.org/wiki/Supervised_learning Unsupervised learning. (2010, 10 04). Retrieved 12 2010, from Wikipedia: http://en.wikipedia.org/wiki/Unsupervised_learning Weka. (2010). Retrieved 10 2010, from Machine Learning Group at University of Waikato: http://www.cs.waikato.ac.nz/~ml/weka/ 9