BCB 444/544 Lab 10 (11/8/07) Machine Learning Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu Objectives 1. Experiment with applying machine learning algorithms to biological problems. 2. Learn about how to set up a machine learning experiment. Introduction Machine learning combines principles from computer science, statistics, psychology, and other disciplines to develop computer programs for specific tasks. The tasks that machine learning programs have been developed for vary widely, from diagnosing cancer to driving a car. In biology, machine learning approaches are very popular for problems such as protein secondary structure prediction, gene prediction, analyzing microarray data, and many others. Machine learning is often quite effective, especially on problems that have a lot of data available. Molecular biology certainly has lots of data. A note about our training and test set files: The data set we are using in this lab is a set of RNA-binding proteins. Our input data is 15 amino acids from the protein sequence and a label of 1 or 0 indicating whether the central amino acid in the list of 15 binds to RNA or not (1 means binding, 0 means not binding). The training set contains an equal number of RNA-binding and non-binding residues, which is not the natural distribution. The entire data set contains only about 20% binding residues. We are using a set with equal numbers of binding and nonbinding residues to make things a little easier. The test set is a single protein sequence, the 50S ribosomal protein L20 from E. coli. We will use this test case to see how well our classifiers we build perform on a protein sequence not in the training set. Exercises Before we get started on the exercises, we need to learn a little about machine learning experiments. The first concept is training and testing. In order to estimate the performance of any classifier, we need to train the classifier on some data and then measure performance on some other data. There are a few ways to do this. In the lab today, we will use cross validation and a separate test set. Go to http://en.wikipedia.org/wiki/Cross_validation and read about cross validation. 1. What is K-fold cross validation? What data is used for training? What data is used for testing? The most important point in training and testing is that the same data can never be used in both the training set and the test set. Usually, we have limited data and want to use as much as possible in training, which is why we do cross validation experiments. In our lab today, we will give ourselves the luxury of a test case that is not in the training set to test our classifiers on. Algorithms Read the sections on Naïve Bayes, J48 Decision tree, and SVM here: http://www.d.umn.edu/~padhy005/Chapter5.html 2. What assumption is used in the Naïve Bayes classifier? 3. What criterion does the decision tree classifier use to decide which attribute to put first in the decision tree? 4. What is the purpose of the kernel function in a SVM classifier? 5. Based on what you read, which method(s) can a human interpret? What method(s) can a human not interpret, i.e., “black box” method(s)? 6. According to this web page, which algorithm tends to have the highest classification accuracy? Experiments In this lab, we will be using the program Weka. Weka is a program that contains implementations of many machine learning algorithms in a standard framework that makes it easy to experiment with many methods. If you are in the computer lab in 1340 MBB, Weka is already installed on the machines. If you are working from home, you will have to download and install Weka. Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/ The instructions should be fairly easy to follow for installing Weka on your computer. If you have trouble, send me an email and I may be able to help you. Or come into the lab and use the machines here. The lab in MBB is open most of the time; our class is the only one that currently uses this room. Running Weka: Some final notes on what we will do with Weka before the instructions for how to do it. First, Weka implements a lot of different algorithms. We will use Naïve Bayes, J48 decision tree, and SVM and then you will get to choose a fourth algorithm. Each algorithm has quite a few parameters that can be changed, and as we have seen all semester, changing parameters can drastically change the results. That being said, we will accept the default parameters for all of the algorithms, with only one small tweak that will be described later. Second, Weka allows you to run cross validation experiments or use a supplied test set (along with some other options). We will do both in this lab. Finally, a short primer on how to read the results that Weka gives you. A typical results section looks like this: The output includes a lot of information, some useful, some not so useful. The top of the output shows the information about your data set, the algorithm used, information about the model produced after training (which is really only useful if you are using an algorithm that is interpretable by humans), and finally the amount of time it took to build the model. The most useful section for our purposes is at the bottom, which has the performance statistics. For this lab, all of the results you are required to fill into the tables can be read directly from the output as long as you know where to find them. The table asks for accuracy, which in the Weka output is listed as “Correctly Classified Instances.” The next entries in your results table are TPRate, FPRate, Precision, and Recall, which can be read in the section called “Detailed Accuracy By Class.” These values are listed for both classes (1 and 0 for RNA-binding and non-RNA-binding respectively). I only want to see the values for class 1. The final numbers for your results table are TP (true positive), FP (false positive), FN (false negative), and TN (true negative). TP means we predicted RNA-binding and it actually is RNA-binding. FP means we predicted RNA-binding and it is not actually RNA-binding. FN means we predicted non-RNA-binding and it actually is RNAbinding. TN means we predicted non-RNA-binding and it actually is non-RNA-binding. Our correct predictions are TP and TN, our incorrect predictions are FP and FN. The counts of TP, FP, FN, and TN can be found in the section called “Confusion Matrix.” Our confusion matrix shows four numbers, the top left corner shows the number of TP predictions, the top right number is FP, bottom left is FN, and bottom right is TN. Finally, on to running some programs. For the lab machines, you can simply doubleclick on the weka.jar file on the desktop. Click on the button that says “Explorer” to get started. We will use the following files: Training set Test set Click on the Open file button and choose the training set file. Click on the Classify tab to get to the classification algorithms. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on bayes, then NaiveBayes. Be sure that Cross validation is selected, and change the number in the box from 10 to 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. For our next algorithm, we will use the J48 decision tree. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on trees, then J48. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. For our next algorithm, we will use a SVM. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on functions, then SMO. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. Next, we will run the SVM algorithm again using a different kernel function. To change the kernel function, click on the text box next to the Choose button at the top. This will bring up a window showing the algorithm parameters. At the bottom of this window, there is a line that says “useRBF.” Change the value in this box to true to use the RBF kernel function. Click OK. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. Finally, choose a different algorithm and run both 5 fold cross validation and predictions on the test case, and add the performance values to the tables in the blank line at the bottom. Be sure to include the name of the algorithm you chose. You can choose any of the algorithms available in Weka. (Side note – some of the algorithms will not work on this data set. They will produce an error message saying something about incompatibility. If this happens to you, simply choose a different algorithm. Also, some algorithms run much faster than others. If the algorithm you chose is taking much longer than our SVM runs, you may want to choose a different algorithm.) To find out more about the algorithms, you can select the algorithm with the Choose button at the top, then click on the text box near the Choose button (just like we did for the SVM when we changed to the RBF kernel). In the window that opens up, there is a section called “About” that gives a one line description of the algorithm. Click on the More button in this section to get more information, including a reference to a paper describing the algorithm. Another option for finding out more about the algorithm is to do an internet search with the name of the algorithm. 5 Fold Cross Validation Results Algorithm Accuracy TPRate FPRate Precision Recall NB J48 SVM SVM RBF Test Case Results Algorithm Accuracy TPRate FPRate Precision Recall NB J48 SVM SVM RBF TP FP FN TN TP FP FN TN 7. What algorithm did the best and under what conditions? 8. Did the cross validation results indicate accurately what performance on the test case would be? 9. Briefly describe what the algorithm you chose does.