Assignment 1: Using the WEKA Workbench A. Become familiar with the use of the WEKA workbench to invoke several different machine learning schemes. Use latest stable version. Use both the graphical interface (Explorer) and command line interface (CLI). See Weka home page for Weka documentation. B. Use the following learning schemes, with the default settings to analyze the weather data (in weather.arff). For test options, first choose "Use training set", then choose "Percentage Split" using default 66% percentage split. Report model percent error rate. ZeroR (majority class) OneR Naive Bayes Simple J4.8 C. Which of these classifiers are you more likely to trust when determining whether to play? Why? D. What can you say about accuracy when using training set data and when using a separate percentage to train? Assignment 2: Preparing the data and mining it A. Take the file genes-leukemia.csv (here is the description of the data) and convert it to Weka file genes-a.arff. You can convert the file either using a text editor like emacs (brute force way) or find a Weka command that converts .csv file to .arff (a smart way). B. Target field is CLASS. Use J48 on genes-leukemia with "Use training set" option. C. Use genes-leukemia.arff to create two subsets: genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72). D. Train J48 on genes-leukemia-train.arff and specify "Use training set" as the test option. What decision tree do you get? What is its accuracy? E. Now specify genes-leukemia-test.arff as the test set. What decision tree do you get and how does its accuracy compare to one in the previous question? F. Now remove the field "Source" from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu) and repeat steps D and E. What do you observe? Does the accuracy on test set improve and if so, why do you think it does? G. Extra credit: which classifier gives the highest accuracy on the test set? Assignment 3: Data Cleaning and Preparing for Modeling The previous assignment was with the selected subset of top 50 genes for a particular Leukemia dataset. In this assignment you will be doing the work of real data miner, and you will be working with an actual genetic dataset, starting from the beginning. You will see that the process of data mining frequently has many small steps that all need to be done correctly to get good results. However tedious these steps may seem, the goal is a worthy one -- help make an early diagnosis for leukemia -- a common form of cancer. Making a correct diagnosis is literally a life and death decision, and so we need to be careful that we do the analysis correctly. 3A. Get data Take ALL_AML_original_data.zip file from Data directory and extract from it Train file: data_set_ALL_AML_train.txt Test file: data_set_ALL_AML_independent.txt Sample and class data: table_ALL_AML_samples.txt This data comes from pioneering work by Todd Golub et al at MIT Whitehead Institute (now MIT Broad Institute). 1. Rename the train file to ALL_AML_grow.train.orig.txt and test file to ALL_AML_grow.test.orig.txt . Convention: we use the same file root for files of similar type and use different extensions for different versions of these files. Here "orig" stands for original input files and "grow" stands for genes in rows. We will use extension .tmp for temporary files that are typically used for just one step in the process and can be deleted later. Note: The pioneering analysis of MIT biologists is described in their paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (pdf). Both train and test datasets are tab-delimited files with 7130 records. The "train" file should have 78 fields and "test" 70 fields. The first two fields are Gene Description (a long description like GB DEF = PDGFRalpha protein) and Gene Accession Number (a short name like X95095_at) The remaining fields are pairs of a sample number (e.g. 1,2,..38) and an Affymetix "call" (P is gene is present, A if absent, M if marginal). Think of the training data as a very tall and narrow table with 7130 rows and 78 columns. Note that it is "sideways" from machine learning point of view. That is the attributes (genes) are in rows, and observations (samples) are in columns. This is the standard format for microarray data, but to use with machine learning algorithms like WEKA, we will need to do "matrix transpose" (flip) the matrix to make files with genes in columns and samples in rows. We will do that in step 3B.6 of this assignment. Here is a small extract Gene Description Gene Accession 1 call 2 call ... Number GB DEF = GABAa receptor alpha-3 subunit A28102_at 151 A 263 P ... ... AB000114_at 72 A 21 A ... ... AB000115_at 281 A 250 P ... ... AB000220_at 36 A 43 A ... 3B: Clean the data Perform the following cleaning steps on both the train and test sets. Use unix tools, scripts or other tools for each task. Document all the steps and create intermediate files for each step. After each step, report the number of fields and records in train and test files. (Hint: Use unix command wc to find the number of records and use awk or gawk to find the number of fields). Microarray Data Cleaning Steps 1. Remove the initial records with Gene Description containing "control". (Those are Affymetrix controls, not human genes). Call the resulting files ALL_AML_grow.train.noaffy.tmp and ALL_AML_grow.test.noaffy.tmp. Hint: You can use unix command grep to remove the control records. How many such control records are in each file? 2. Remove the first field (long description) and the "call" fields, i.e. keep fields numbered 2,3,5,7,9,... Hint: use unix cut command to do that. 3. Replace all tabs with commas 4. Change "Gene Accession Number" to "ID" in the first record. (You can use emacs here). (Note: That will prevent possible problems that some data mining tools have with blanks in field names.) 5. Normalize the data: for each value, set the minimum field value to 20 and the maximum to 16,000. (Note: The expression values less than 20 or over 16,000 were considered by biologists unreliable for this experiment.) Write a small Java program or Perl script to do that. Call the generated files ALL_AML_grow.train.norm.tmp and ALL_AML_grow.test.norm.tmp 6. Write a short java program or shell script to transpose the training data to get ALL_AML_gcol.test.tmp and ALL_AML_gcol.train.tmp ("gcol" stands for genes in columns). These files should each have 7071 fields, and 39 records in "train", 35 records in "test" datasets. 7. Extract from file table_ALL_AML_samples.txt tables ALL_AML_idclass.train.txt and ALL_AML_idclass.test.txt with sample id and sample labels, space separated. Here you can use a combination of unix commands and manual editing by emacs. Add a header row with "ID Class" to each of the files. File ALL_AML_idclass.train.txt should have 39 records and two columns. First record (header) has "ID Class", next 27 records have class "ALL" and last 11 records have class "AML". Be sure to remove all spaces and tabs from this file. ALL_AML_idclass.test.txt should have 20 "ALL" samples and 14 "AML" samples, intermixed. 8. Note that the sample numbers in ALL_AML_gcol*.csv files are in different order than in *idclass files. Use Unix commands to create combined files ALL_AML_gcol_class.train.csv and ALL_AML_gcol_class.test.csv which have ID as the first field, Class as the last field, and gene expression fields in between. 3C: Build Models on a full dataset 1. As in assignment 2, convert ALL_AML_gcol_class.train.csv to ALL_AML_allgenes.train.arff ALL_AML_gcol_class.test.csv to ALL_AML_allgenes.test.arff 2. Using ALL_AML_allgenes.train.arff as train file and ALL_AML_allgenes.test.arff as test, build a model using OneR. What accuracy do you get? 3. Now, excluding the field ID, build models using OneR, NaiveBayes Simple, and J4.8, using training set only. What models and error rates you get with each method? Warning: some of the methods may not finish or give you errors due to the large number of attributes for this data. 4. If you got thus far -- congratulations! Based on your experience, what three things are important in the process of data mining ? Assignment 4: Feature Reduction Start from the ALL_AML_train_processed data set (zipped file in Data directory). This data was thresholded to >= 20 and <= 16,000, with genes in rows. You should have generated it as part of assignment 3. Name it ALL_AML_gr.thr.train.csv (Note almost all of this data preprocessing is much easier done with data with genes-in-rows format). We will convert the data to genes-in-columns when we are ready to do the modeling). 1. Examining gene variation A. Write a program (or a script) to compute a fold difference for each gene. Fold difference is the Maximum Value across samples divided by minimum value. This value is frequently used by biologists to assess gene variability. B. What is the largest fold difference and how many genes have it? C. What is the lowest fold difference and how many genes have it? D. Count how many genes have fold ratio in the following ranges Range Count Val <= 2 .. 2 <Val <= 4 .. 4 <Val <= 8 .. 8 <Val <= 16 .. 16 <Val <= 32 .. 32 <Val <= 64 .. 64 <Val <= 128 .. 128 <Val <= 256 .. 256 <Val <= 512 .. 512 <Val .. E: Extra Credit: Graph fold ratio distribution appropriately. 2. Finding most significant genes For train set, samples 1-27 belong to class ALL, and 28-38 to class AML. Let Avg1, Avg2 be the average expression values. Let Stdev1, Stdev2 be the sample standard deviations, which can be computed as Stdev = sqrt((N*Sum_sq - Sum_val*Sum_val)/(N*(N-1))) Here N is the number of observations, Sum_val is the sum of values, Sum_sq is the sum of squares of values. Signal to Noise (S2N) ratio is defined as (Avg1 - Avg2) / (Stdev1 + Stdev2) T-value is defined as (Avg1 - Avg2) / sqrt(Stdev1*Stdev/N1 + Stdev2*Stdev2/N2) A. Write a script that will compute for each gene, the average and standard deviation for both classes ("ALL" and "AML"). Also compute for each gene T-value and Signal to Noise ratio. B. Select for each class, top 50 genes with the highest S2N ratio. Which gene has the highest S2N for "ALL" (i.e. its high value is most correlated with class "ALL")? 50th highest ? Give gene names and S2N values Same question for "AML" class. What is the relationship between S2N values for ALL and AML ? C. Select for each class top 50 genes with the highest T-value Which gene has the highest T-value for class "ALL"? 50th highest ? Give gene names and T values For AML ? What is the relationship between T values for ALL and AML ? Will a similar relationship hold if there are more than 2 classes? D. How many genes are in common between top 50 genes for ALL selected using S2N and those selected using T-value ? How many genes are in common among top 3 genes in each list? E. Same question for top genes for "AML" . 3. Lessons Learned What have you learned so far about the process of feature selection and data preparation? Predict Treatment Outcome Note: For this assignment, we used CART from Salford Systems, which was available to us under an educational license. If CART is not available, another decision tree tool, such as J4.8 in Weka can be used instead. Start with genes-leukemia.csv dataset used in assignment 2. (See Dataset directory). As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or "?" (missing) Step 1. Examine the records where TREATMENT_RESPONSE is non-missing. Q1: How many such records are there? Q2: Can you describe these records using other sample fields (e.g. Year from XXXX to YYYY , or Gender = X, etc) Q3: Why is it not correct to build predictive models for TREATMENT_RESPONSE using records where it is missing? Step 2. Select only the records with non-missing TREATMENT_RESPONSE. Keep SNUM (sample number) but remove sample fields that are all the same or missing. Call the reduced dataset genes- reduced.csv Q4: Which sample fields you should keep? Step 3. Build a CART Model using leave-one-out cross validation. Q5: what tree do you get? and what is the expected error rate? Q6: what are the important variables and their relative importance, according to CART? Q7: Remove the top predictor -- and re-run the CART -- what do you get? Step 4: Extra credit (10%): Use Google to search the web for the name of top gene that predicts the outcome and briefly report relevant information that you find. Step 5: Randomization test Randomize the TREATMENT_RESPONSE variable 10 times and re-run the CART for each randomized class. Q8: Report the trees and error rates you get. Q9: Based on the results in Q8, do you think the tree that you found with the original data is significant? ID Mid I/15 Mid II/15 120060194 120070044 120070063 120070167 207210067 208110169 208110227 8 -11 5 11 13 9 8 -11 11 8 12 11 Hws(13)/30 10 15 20 10 10 25 20 Sub Total/60 26 15 42 32 32 51 42 Final/40 Total/100 Grade