Computational Intelligence in Biomedical and Health Care Informatics HCA 590 (Topics in Health Sciences) Rohit Kate Machine Learning: Some Topics and Weka Software 1 Learning Curves • Train the classifier with increasing amount of training examples and plot accuracy vs. size of training set • Helps to answer: – Whether maximum accuracy has nearly been reached or will more training examples help? – Is one technique better when training data is limited? • Most learners eventually converge to the maximum accuracy given sufficient training examples Test Accuracy 100% Maximum Accuracy Method 1 Method 2 # Training examples 2 Comparing Learning Curves • Gap usually has a “banana shape” • Often a better picture emerges if learning curves are compared “horizontally” instead of “vertically” 100% Maximum Accuracy Method 1 Method 2 Test Accuracy 85% Method 1 can achieve 85% accuracy with half the training data needed by method 2! 300 600 # Training examples 3 Datasets • Datasets are important for empirically evaluating machine learning techniques • It is important to test them on a variety of domains. Testing on 20+ data sets is common. • Variety of freely available datasets – UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html – KDD Cup (large data sets for data mining) http://www.kdnuggets.com/datasets/kddcup.html 4 Which is the Best Machine Learning Technique? • There is no single machine learning technique that performs better than every other technique on every dataset • One can always come up with a dataset on which a particular machine learning technique will do miserably – Flip its predictions and call them the correct answers • As such there is no basis for preferring one label over another for classifying a never before seen test example even after seeing a lot of training data – It is unknown so it could be anything! • Hence every machine learning technique makes some assumptions (“bias”) which helps it generalize from training data to test data 5 Which is the Best Machine Learning Technique? • Depending upon how the assumptions of a machine learning technique hold in a given dataset, some techniques perform better than others Assumptions: – Naïve Bayes & Bayesian networks: Conditional independence assumptions – SVM & NN: A hyperplane can separate the examples – Decision Trees: Some feature values separate the examples 6 Training Data • Training data is critical for applying any machine learning technique • Obtaining it is often the most difficult part – Availability of data, particularly medical data – Obtaining correct labels, often manually done by experts, expensive and labor intensive • As learning curves show, “more data is better data” – But it is expensive to get more training data • Some approaches have been designed to compensate for the lack of training data 7 Various Forms of Supervision • If all the training data have correct labels then it is called supervised learning • Some methods also utilize unlabeled training data in addition to the labeled data and are called semisupervised learning – Most learning methods can be extended to leverage unlabeled training data – Predict labels for unlabeled examples and take them as the correct labels and train again; iterate a few times • Often helps as if by magic! • Some methods, like clustering examples into groups, learn completely unsupervised, but they are useful only in limited situations 8 Weka: The Most Well-Known Machine Learning Software • Freely available • Includes several machine learning techniques • Download from the web-site: http://www.cs.waikato.ac.nz/ml/weka/ • A tutorial (only classification part): http://prdownloads.sourceforge.net/weka/weka.ppt 9 ARFF Format for Data • Once the data is in the ARFF format (attribute-relation file format), you can play with several machine learning techniques using Weka! • See Weka tutorial slides 5 & 6 • More description of the ARFF format: http://weka.wikispaces.com/ARFF+%28book+version%29 • Plain text file (use notepad etc. to open or create) • Save with .arff extension • See several examples: http://repository.seasr.org/Datasets/UCI/arff/ • Comments after ‘%’ character • Unknowns marked by ‘?’ • If the last attribute is nominal then it is a classification task, if it is numeric then it is a regression task 10