Machine Leanring Topics and Weka Software

Computational Intelligence in Biomedical and Health Care Informatics HCA 590 (Topics in Health Sciences) Rohit Kate Machine Learning: Some Topics and Weka Software 1 Learning Curves • Train the classifier with increasing amount of training examples and plot accuracy vs. size of training set • Helps to answer: – Whether maximum accuracy has nearly been reached or will more training examples help? – Is one technique better when training data is limited? • Most learners eventually converge to the maximum accuracy given sufficient training examples Test Accuracy 100% Maximum Accuracy Method 1 Method 2 # Training examples 2 Comparing Learning Curves • Gap usually has a “banana shape” • Often a better picture emerges if learning curves are compared “horizontally” instead of “vertically” 100% Maximum Accuracy Method 1 Method 2 Test Accuracy 85% Method 1 can achieve 85% accuracy with half the training data needed by method 2! 300 600 # Training examples 3 Datasets • Datasets are important for empirically evaluating machine learning techniques • It is important to test them on a variety of domains. Testing on 20+ data sets is common. • Variety of freely available datasets – UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html – KDD Cup (large data sets for data mining) http://www.kdnuggets.com/datasets/kddcup.html 4 Which is the Best Machine Learning Technique? • There is no single machine learning technique that performs better than every other technique on every dataset • One can always come up with a dataset on which a particular machine learning technique will do miserably – Flip its predictions and call them the correct answers • As such there is no basis for preferring one label over another for classifying a never before seen test example even after seeing a lot of training data – It is unknown so it could be anything! • Hence every machine learning technique makes some assumptions (“bias”) which helps it generalize from training data to test data 5 Which is the Best Machine Learning Technique? • Depending upon how the assumptions of a machine learning technique hold in a given dataset, some techniques perform better than others Assumptions: – Naïve Bayes & Bayesian networks: Conditional independence assumptions – SVM & NN: A hyperplane can separate the examples – Decision Trees: Some feature values separate the examples 6 Training Data • Training data is critical for applying any machine learning technique • Obtaining it is often the most difficult part – Availability of data, particularly medical data – Obtaining correct labels, often manually done by experts, expensive and labor intensive • As learning curves show, “more data is better data” – But it is expensive to get more training data • Some approaches have been designed to compensate for the lack of training data 7 Various Forms of Supervision • If all the training data have correct labels then it is called supervised learning • Some methods also utilize unlabeled training data in addition to the labeled data and are called semisupervised learning – Most learning methods can be extended to leverage unlabeled training data – Predict labels for unlabeled examples and take them as the correct labels and train again; iterate a few times • Often helps as if by magic! • Some methods, like clustering examples into groups, learn completely unsupervised, but they are useful only in limited situations 8 Weka: The Most Well-Known Machine Learning Software • Freely available • Includes several machine learning techniques • Download from the web-site: http://www.cs.waikato.ac.nz/ml/weka/ • A tutorial (only classification part): http://prdownloads.sourceforge.net/weka/weka.ppt 9 ARFF Format for Data • Once the data is in the ARFF format (attribute-relation file format), you can play with several machine learning techniques using Weka! • See Weka tutorial slides 5 & 6 • More description of the ARFF format: http://weka.wikispaces.com/ARFF+%28book+version%29 • Plain text file (use notepad etc. to open or create) • Save with .arff extension • See several examples: http://repository.seasr.org/Datasets/UCI/arff/ • Comments after ‘%’ character • Unknowns marked by ‘?’ • If the last attribute is nominal then it is a classification task, if it is numeric then it is a regression task 10

Machine Leanring Topics and Weka Software

Related documents

Products

Support

Machine Leanring Topics and Weka Software

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib