Machine Leanring Topics and Weka Software

advertisement
Computational Intelligence in
Biomedical and Health Care Informatics
HCA 590 (Topics in Health Sciences)
Rohit Kate
Machine Learning: Some Topics
and Weka Software
1
Learning Curves
• Train the classifier with increasing amount of training
examples and plot accuracy vs. size of training set
• Helps to answer:
– Whether maximum accuracy has nearly been reached or will more
training examples help?
– Is one technique better when training data is limited?
• Most learners eventually converge to the maximum accuracy
given sufficient training examples
Test Accuracy
100%
Maximum Accuracy
Method 1
Method 2
# Training examples
2
Comparing Learning Curves
• Gap usually has a “banana shape”
• Often a better picture emerges if learning
curves are compared “horizontally” instead of
“vertically”
100%
Maximum Accuracy
Method 1
Method 2
Test Accuracy
85%
Method 1 can achieve 85% accuracy with
half the training data needed by method 2!
300
600
# Training examples
3
Datasets
• Datasets are important for empirically
evaluating machine learning techniques
• It is important to test them on a variety of
domains. Testing on 20+ data sets is
common.
• Variety of freely available datasets
– UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– KDD Cup (large data sets for data mining)
http://www.kdnuggets.com/datasets/kddcup.html
4
Which is the Best Machine
Learning Technique?
• There is no single machine learning technique that
performs better than every other technique on every
dataset
• One can always come up with a dataset on which a
particular machine learning technique will do
miserably
– Flip its predictions and call them the correct answers
• As such there is no basis for preferring one label over
another for classifying a never before seen test
example even after seeing a lot of training data
– It is unknown so it could be anything!
• Hence every machine learning technique makes some
assumptions (“bias”) which helps it generalize from
training data to test data
5
Which is the Best Machine
Learning Technique?
• Depending upon how the assumptions of a
machine learning technique hold in a given
dataset, some techniques perform better than
others
Assumptions:
– Naïve Bayes & Bayesian networks: Conditional
independence assumptions
– SVM & NN: A hyperplane can separate the
examples
– Decision Trees: Some feature values separate the
examples
6
Training Data
• Training data is critical for applying any machine
learning technique
• Obtaining it is often the most difficult part
– Availability of data, particularly medical data
– Obtaining correct labels, often manually done by
experts, expensive and labor intensive
• As learning curves show, “more data is better
data”
– But it is expensive to get more training data
• Some approaches have been designed to
compensate for the lack of training data
7
Various Forms of Supervision
• If all the training data have correct labels then it is
called supervised learning
• Some methods also utilize unlabeled training data in
addition to the labeled data and are called semisupervised learning
– Most learning methods can be extended to leverage
unlabeled training data
– Predict labels for unlabeled examples and take them as the
correct labels and train again; iterate a few times
• Often helps as if by magic!
• Some methods, like clustering examples into groups,
learn completely unsupervised, but they are useful only
in limited situations
8
Weka: The Most Well-Known
Machine Learning Software
• Freely available
• Includes several machine learning techniques
• Download from the web-site:
http://www.cs.waikato.ac.nz/ml/weka/
• A tutorial (only classification part):
http://prdownloads.sourceforge.net/weka/weka.ppt
9
ARFF Format for Data
• Once the data is in the ARFF format (attribute-relation file
format), you can play with several machine learning
techniques using Weka!
• See Weka tutorial slides 5 & 6
• More description of the ARFF format:
http://weka.wikispaces.com/ARFF+%28book+version%29
• Plain text file (use notepad etc. to open or create)
• Save with .arff extension
• See several examples:
http://repository.seasr.org/Datasets/UCI/arff/
• Comments after ‘%’ character
• Unknowns marked by ‘?’
• If the last attribute is nominal then it is a classification task,
if it is numeric then it is a regression task
10
Download