CAP4770 – Introduction to Data Mining Final Project Instruction Data: We use a gene data set as our data for the final project. This data set is in attributes-in-rows format, comma-separated values. It can be downloaded by following this link: http://users.cis.fiu.edu/~lli003/teaching/hw-sol/finalproject_datafiles.zip Username/Password: CAP4770/student The zip file contains three files: train.csv: training data, consisting of 69 instances with 7,070 attributes. train_class.txt: training classes, corresponding to the true labels for each instance in training data in the order. There are 5 classes in total, MED, MGL, RHB, JPA and EPD. test.csv: test data, consisting of 112 unlabeled instances with 7,070 attributes. Goal: Learn the best classifier from the training data and use it to predict the classes for test data. Due Date: December 6th, 2012 Submission: 1. Project report, describing how to establish your classifier step by step. Specifically, your report should include: 1) How to do data cleaning; 2) How to do feature selection; 3) How to train classifier. 2. Predicted result (in YourPatherId.txt file, one class per line in uppercase, as the same order of the test data) 3. Make sure that all the files you submit are zipped into one single file, named as “CAP4770_finalproject_firstname_lastname_patherid”. Important hints: 1. The training and testing data are all in the format of attribute-in-row. Probably you need to transform the data into the format of attribute-incolumn so that the data can be appropriately fed into Weka. 2. You need to do some preprocessing, e.g., feature selection to select N attributes that are more important and predictive, by using Weka or writing a java program. 3. You can use any classification method available in Weka or some other tools, as long as you explain well why you choose it in your report. 4. You can use ensemble classification to get more reasonable classifier; however, you need to consider carefully from technical perspective how to integrate different classifiers together to get better results. 5. It is better if I can reconstruct your classifier with my own environment; therefore, you’d better provide me a specific instruction in your report on how to build your own classifier. 6. More accurate your classifier is, more detailed explanation your report has, more points you get. The following steps suggest one way of finding the best classifier. Note that this is just one way for doing the project. You can definitely make improvements or use other ways to get a classifier with higher accuracy. 1. Data Cleaning Threshold both train and test data to a minimum value of 20, maximum of 16,000. 2. Selecting top genes (i.e., attributes) by class You can use any feature selection methods from Weka. o Here is one method: remove from train data genes with fold differences across samples less than 2; (Note that Fold difference is defined as (max-min)/2 where. max and min are the maximum and minimum values of the gene expression for all the instances, respectively. A fold difference of less than 2 means that across the samples, the gene value does not change much and as such, cannot influence the class significantly.) for each class, generate subsets with top 2,4,6,8,10,12,15,20,25, and 30 top genes with the highest Tvalue Optional: for each class, select top genes using highest absolute T-value (i.e. also include genes with high negative T-value) (Note that T-value is calculated as follows: ( Avg1 Avg2 ) ( 12 / N1 22 / N 2 ) where Avg1 is the average for one class across the gene sample and Avg2 is the average for the other 4 classes. Stdev1 is the standard deviation for one class and Stdev2 is the standard deviation for the other classes. Similarly, N1 is the number of samples that have the class whose T-value we are interested in, and N2 is the number of samples that does not have the T-value that we are interested in. ) for each N=2,4,6,8,10,12,15,20,25,30 combine top genes for each class into one file (removing duplicates, if any) and call the resulting file train_topn.csv Add the class as the last column, remove sample no, transpose each file to “attributes-in-columns” format and convert it to .arff format. 3. Find the best classifier/best gene set combination You can use any classifiers you want. For example, you can use: NaiveBayes J48 IB1 IBk (for each value of K=2, 3, 4) one more Weka classifier of your choice -- that can work with multiclass data. a. For each classifier, using default settings, measure classifier accuracy on the training set using previously generated files with top N=2,4,6,8,10,12,15,20,25,30 genes. For IBk, test accuracy with K=2, 3 and 4. b. Select the model and the gene set with the lowest cross-validation error. Optional: once you found the gene set with the lowest cross-validation error, you can vary 1-2 additional relevant parameters for each classifier to see if the accuracy will improve. E.g. for J4.8, you can vary reducedErrorPruning and binarySplits c. Use the gene names from best train gene set and extract the data corresponding to these genes from the test set. d. Convert test set to genes-in-columns format. e. Add a Class column with “?” values as the last column 4. Generate predictions for the test set You should now have the best train file, call it train.bestN.csv, (with 69 samples and bestN number of genes for whatever bestN you found) and a corresponding test file, call it test.bestN.csv, with the same genes and 112 test samples. The train file will have all Class values while the test file Class column will have only “?”. a. Convert test file to arff format (you should already have .arff for train file from Step 3). Important: In Weka, the variable declarations should be exactly the same for test and train file. To achieve that, change the Class entry in test.bestN.arff header section to be the same as in train file, i.e. @attribute Class {MED,MGL,RHB,EPD,JPA} b. Use the best train file and the matching test file and generate predictions for the test file class. 5. Write a paper describing your effort. The above steps are just one way to classify the test data. You can come up with your own solution, as long as the procedure is reasonable.