BCB 444/544 Lab 10 (11/8/07) Machine Learning

advertisement
BCB 444/544
Lab 10 (11/8/07)
Machine Learning
Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu
Objectives
1. Experiment with applying machine learning algorithms to biological problems.
2. Learn about how to set up a machine learning experiment.
Introduction
Machine learning combines principles from computer science, statistics, psychology, and
other disciplines to develop computer programs for specific tasks. The tasks that
machine learning programs have been developed for vary widely, from diagnosing cancer
to driving a car. In biology, machine learning approaches are very popular for problems
such as protein secondary structure prediction, gene prediction, analyzing microarray
data, and many others. Machine learning is often quite effective, especially on problems
that have a lot of data available. Molecular biology certainly has lots of data.
A note about our training and test set files:
The data set we are using in this lab is a set of RNA-binding proteins. Our input data is
15 amino acids from the protein sequence and a label of 1 or 0 indicating whether the
central amino acid in the list of 15 binds to RNA or not (1 means binding, 0 means not
binding). The training set contains an equal number of RNA-binding and non-binding
residues, which is not the natural distribution. The entire data set contains only about
20% binding residues. We are using a set with equal numbers of binding and nonbinding residues to make things a little easier. The test set is a single protein sequence,
the 50S ribosomal protein L20 from E. coli. We will use this test case to see how well
our classifiers we build perform on a protein sequence not in the training set.
Exercises
Before we get started on the exercises, we need to learn a little about machine learning
experiments. The first concept is training and testing. In order to estimate the
performance of any classifier, we need to train the classifier on some data and then
measure performance on some other data. There are a few ways to do this. In the lab
today, we will use cross validation and a separate test set.
Go to http://en.wikipedia.org/wiki/Cross_validation and read about cross validation.
1. What is K-fold cross validation? What data is used for training? What data is
used for testing?
The most important point in training and testing is that the same data can never be used
in both the training set and the test set. Usually, we have limited data and want to use as
much as possible in training, which is why we do cross validation experiments. In our
lab today, we will give ourselves the luxury of a test case that is not in the training set to
test our classifiers on.
Algorithms
Read the sections on Naïve Bayes, J48 Decision tree, and SVM here:
http://www.d.umn.edu/~padhy005/Chapter5.html
2. What assumption is used in the Naïve Bayes classifier?
3. What criterion does the decision tree classifier use to decide which attribute to
put first in the decision tree?
4. What is the purpose of the kernel function in a SVM classifier?
5. Based on what you read, which method(s) can a human interpret? What
method(s) can a human not interpret, i.e., “black box” method(s)?
6. According to this web page, which algorithm tends to have the highest
classification accuracy?
Experiments
In this lab, we will be using the program Weka. Weka is a program that contains
implementations of many machine learning algorithms in a standard framework that
makes it easy to experiment with many methods. If you are in the computer lab in 1340
MBB, Weka is already installed on the machines. If you are working from home, you
will have to download and install Weka.
Weka is available at:
http://www.cs.waikato.ac.nz/ml/weka/
The instructions should be fairly easy to follow for installing Weka on your computer. If
you have trouble, send me an email and I may be able to help you. Or come into the lab
and use the machines here. The lab in MBB is open most of the time; our class is the
only one that currently uses this room.
Running Weka:
Some final notes on what we will do with Weka before the instructions for how to do it.
First, Weka implements a lot of different algorithms. We will use Naïve Bayes, J48
decision tree, and SVM and then you will get to choose a fourth algorithm. Each
algorithm has quite a few parameters that can be changed, and as we have seen all
semester, changing parameters can drastically change the results. That being said, we
will accept the default parameters for all of the algorithms, with only one small tweak
that will be described later.
Second, Weka allows you to run cross validation experiments or use a supplied test set
(along with some other options). We will do both in this lab.
Finally, a short primer on how to read the results that Weka gives you. A typical results
section looks like this:
The output includes a lot of information, some useful, some not so useful. The top of the
output shows the information about your data set, the algorithm used, information about
the model produced after training (which is really only useful if you are using an
algorithm that is interpretable by humans), and finally the amount of time it took to build
the model. The most useful section for our purposes is at the bottom, which has the
performance statistics. For this lab, all of the results you are required to fill into the
tables can be read directly from the output as long as you know where to find them. The
table asks for accuracy, which in the Weka output is listed as “Correctly Classified
Instances.” The next entries in your results table are TPRate, FPRate, Precision, and
Recall, which can be read in the section called “Detailed Accuracy By Class.” These
values are listed for both classes (1 and 0 for RNA-binding and non-RNA-binding
respectively). I only want to see the values for class 1.
The final numbers for your results table are TP (true positive), FP (false positive), FN
(false negative), and TN (true negative). TP means we predicted RNA-binding and it
actually is RNA-binding. FP means we predicted RNA-binding and it is not actually
RNA-binding. FN means we predicted non-RNA-binding and it actually is RNAbinding. TN means we predicted non-RNA-binding and it actually is non-RNA-binding.
Our correct predictions are TP and TN, our incorrect predictions are FP and FN. The
counts of TP, FP, FN, and TN can be found in the section called “Confusion Matrix.”
Our confusion matrix shows four numbers, the top left corner shows the number of TP
predictions, the top right number is FP, bottom left is FN, and bottom right is TN.
Finally, on to running some programs. For the lab machines, you can simply doubleclick on the weka.jar file on the desktop.
Click on the button that says “Explorer” to get started.
We will use the following files:
Training set
Test set
Click on the Open file button and choose the training set file.
Click on the Classify tab to get to the classification algorithms.
To choose the algorithm, click on the Choose button near the top in the classifier section.
Click on bayes, then NaiveBayes. Be sure that Cross validation is selected, and change
the number in the box from 10 to 5. Then click on the Start button to run the classifier.
Record the performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
For our next algorithm, we will use the J48 decision tree. To choose the algorithm, click
on the Choose button near the top in the classifier section. Click on trees, then J48. Be
sure that Cross validation is selected, and make sure the number in the box is 5. Then
click on the Start button to run the classifier. Record the performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
For our next algorithm, we will use a SVM. To choose the algorithm, click on the
Choose button near the top in the classifier section. Click on functions, then SMO. Be
sure that Cross validation is selected, and make sure the number in the box is 5. Then
click on the Start button to run the classifier. Record the performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
Next, we will run the SVM algorithm again using a different kernel function. To change
the kernel function, click on the text box next to the Choose button at the top. This will
bring up a window showing the algorithm parameters. At the bottom of this window,
there is a line that says “useRBF.” Change the value in this box to true to use the RBF
kernel function. Click OK. Be sure that Cross validation is selected, and make sure the
number in the box is 5. Then click on the Start button to run the classifier. Record the
performance in the table below.
To run the predictions on the test case, click the circle next to Supplied test set, then click
the Set… button and choose the test file. Then click Start to build the classifier and make
predictions on our test case. Record the performance in the table below.
Finally, choose a different algorithm and run both 5 fold cross validation and predictions
on the test case, and add the performance values to the tables in the blank line at the
bottom. Be sure to include the name of the algorithm you chose. You can choose any of
the algorithms available in Weka.
(Side note – some of the algorithms will not work on this data set. They will produce an
error message saying something about incompatibility. If this happens to you, simply
choose a different algorithm. Also, some algorithms run much faster than others. If the
algorithm you chose is taking much longer than our SVM runs, you may want to choose a
different algorithm.)
To find out more about the algorithms, you can select the algorithm with the Choose
button at the top, then click on the text box near the Choose button (just like we did for
the SVM when we changed to the RBF kernel). In the window that opens up, there is a
section called “About” that gives a one line description of the algorithm. Click on the
More button in this section to get more information, including a reference to a paper
describing the algorithm. Another option for finding out more about the algorithm is to
do an internet search with the name of the algorithm.
5 Fold Cross Validation Results
Algorithm Accuracy TPRate FPRate Precision Recall
NB
J48
SVM
SVM
RBF
Test Case Results
Algorithm Accuracy TPRate FPRate Precision Recall
NB
J48
SVM
SVM
RBF
TP
FP
FN
TN
TP
FP
FN
TN
7. What algorithm did the best and under what conditions?
8. Did the cross validation results indicate accurately what performance on the test
case would be?
9. Briefly describe what the algorithm you chose does.
Download