BMI 214 Spring 2006 Assignment 2: Machine learning for expression data and Genotype-phenotype associations Contents: Part 1: Weka for machine learning in expression data a. Introduction b. Supervised learning c. Feature selection d. Unsupervised learning Part 2: Genotype-phenotype associations a. Introduction b. Feature selection Submission: Send your answers as a PDF document to biomedin214-spr0506-submit@lists.stanford.edu Part 1: Weka for machine learning in expression data Part 1a: Introduction Download / familiarize yourself with Weka. As discussed in class, Weka is a useful tool that has implemented most of the major machine learning algorithms. In this part you will re-live Project 2, but now let Weka do the work. It will be both a tutorial and problem set. We will be using the Weka GUI in this assignment. There are also an excellent command line interface and Java API, which will not be covered. Read more in the documentation if interested. Download and start the Weka GUI. Follow the instructions on the Weka site: http://www.cs.waikato.ac.nz/ml/weka/ You will see four buttons; in this class we will only use the Explorer functionality. There is much Weka documentation available. You may familiarize yourself with Explorer as much as you like by reading the user guide and using their provided sample datasets. Some good (optional) starting points: Explorer guide: http://easynews.dl.sourceforge.net/sourceforge/weka/ExplorerGuide-3.4.pdf Weka Wiki: http://weka.sourceforge.net/wekadoc/index.php/Main_Page Part 1b: Supervised learning We will be using the same leukemia data as in Project 2. Recall that that dataset comprised 72 leukemia patients, 28 of which were AML and 44 of which were ALL. Each patient had expression measurements for 7129 genes. We have compiled these data into a Weka-compatible CSV file. http://helix-web.stanford.edu/bmi214/assignment2/data/leukemia.csv This file has the 72 leukemia patients (rows) and expression values for 150 genes (columns). This data matrix is slightly altered from the one in Project 2: it is transposed, and we chose a subset of 150 genes to make your life easier. The gene and experiment names, as well as a link to the reference paper, are available in Project 2. Open the leukemia.csv file in Weka. Question 1. What is the mean value of expression of the gene labeled “CD33 CD33 antigen (differentiation antigen)”? Now we will run KNN. Go to the “classify” tab. Under “Classifier” click the “Choose” button. Expand the “lazy” menu (yes, we use only the best!). Choose “IBk” and OK. This is KNN: IBk stands for Instance-Based k. Now next to the button it should say “IBk” with some parameters. Click on this text and a menu will pop up. For “KNN”, enter 5. Recall that this means the algorithm will use the five nearest neighbors to classify each data point. Leave the rest of the values as default. Under “Test options” choose the radio button “Cross-validation” and under “Folds” enter 5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This means that the algorithm will classify “leukemia_type” (AML or ALL), using the experiments as attributes. Click the “Start’ button. The main window will show a variety of result summary statistics, such as accuracy, true positives, false positives, and a confusion matrix. Question 2. What is the % of correctly classified instances? ROC curves, as discussed in lecture, illustrate the tradeoffs between sensitivity and specificity. Roc curves plot Sensitivity vs. 1 – Specificity, or TP (true positive) rate vs. FP (false positive) rate. Recall that TP rate = Sensitivity = TP / (TP + FN) FP rate = 1 – Specificity = FP / (TN + FN) Specificity = 1 = FP rate = TN / (FP + TN) Where TP = number of true positives and FP = number of false positives. Question X. What are the TP and FP rates for ALL and AML? ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Optional: Right click on your result in the “Result list” on the left side of the screen. Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP) rate vs. false positive (FP) rate, which are the defaults. You can also view other types of curves by clicking the dropdown menus. For example, precision-recall curves are an alternative to ROC curves; precision and recall are options in the dropdown menu. Recall that we can trivially achieve a TP rate of 1 for AML by classifying all of the patients as positive (AML). That would correspond to a TP rate of 0 for ALL, since none of the patients could then be classified ALL. Question 3. Now we’ll try a different classification algorithm, ZeroR. Click the Choose button under Classifier, and expand the “rules” folder. Choose “ZeroR”. Again use cross-validation with Folds=5. Run it. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Explanation: ZeroR is a baseline classifier that simply identifies the class that is most abundant (in this case, ALL with 44 patients), and predicts all variables (patients) to be in that class. This illustrates that when the variables are unevenly assigned to classes, it can be a problem to simply look at accuracy. For example, if you have a dataset of 100 variables (patients), where 90 were ALL, and 10 were AML, then ZeroR will predict them all to be ALL, with 90% correctly classified (90% accuracy). Another classifier might have 85% accuracy, which looks pretty good until you compare it with ZeroR. You can get around the problem of uneven class sizes by weighting, a topic which we won’t cover here. (But check out the “CostSensitiveClassifier” and “Cost Sensitive evaluation” if you’re interested.) We’ll try a few more algorithms on this dataset. Question 4. Under the “bayes” Classifier folder, choose “NaiveBayes” and run. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Question 5. Under the “functions” Classifier folder, choose “SMO” (a popular implementation of SVM) and run. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = Question 6. Under the “trees” Classifier folder, choose “ADTree” (a popular implementation of a decision tree) and run. % correctly classified instances = ALL TP Rate = ALL FP Rate = AML TP Rate = AML FP Rate = You can read about the various other algorithms in the documentation if interested. Part 1c: Feature selection (aka attribute selection) We will now use feature selection algorithms to extract the most “informative” genes for classifying AML vs ALL. As mentioned in lecture, this is a useful real world exercise because many times we are interested not in measuring all the gene expressions, but in finding one or a few whose expression level can be used as a marker for the disease. The general area of “Biomarkers” is economically important because it may lead to diagnostic tests that can be marketed to clinical laboratories. If interested, you can learn more about the theory behind feature selection in machine learning textbooks and online resources. Wikipedia is always a good place to start. http://en.wikipedia.org/wiki/Feature_selection First we will choose a subset of five genes arbitrarily – the first five. To do this in Weka, go back to the “Preprocess” tab and click the checkbox next to the first five genes (Zyxin through RNS2). Scroll down to the bottom of the gene list and click “leukemia_type”, which is the class label (AML or ALL). Check this box, too, so that there are 6 total boxes checked. Click the “Invert” button above the list, so that the other 145 boxes become checked. Click the “Remove” button below the list, so that you are left with 5 genes and leukemia_type. Go to the “Classify” tab again. Classify using IBk and K = 5. Question 7. What is the % correctly classified instances using just the first five genes? Re-load the dataset with “Open file…” to get all 150 genes back. We’ll now find five genes that are “informative” according to some metric. Choose the “Select attributes” tab at the top of the screen. Under ‘Attribute Evaluator”, choose the algorithm called “InfoGainAttributeEval” and Search Method “Ranker” and Start. The output shows the genes ranked by information gain, which is one possible metric for measuring the “goodness” of a partition. The genes at the top of the list are the most “informative” as attributes for classification. Question 8. What are the top five genes in this output? Back under “Preprocess”, remove all attributes except the top five genes from the previous question, in the same manner as described above, so that you are left with the five genes from the previous question and the class attribute leukemia_type. Under “Classify”, run the exact same algorithm as before (IBk, K=5). Question 9. What is the % of correctly classified instances? Part 1d: Unsupervised learning K-means clustering This part will use the same dataset that you used for Project 2 k-means clustering. Recall that that dataset comprised expression profiles for 2467 genes across 79 experiments. Download a Weka-compatible CSV file at http://helix-web.stanford.edu/bmi214/assignment2/data/yeast.dat.csv The rows are genes and the columns are experiments. The gene and experiment names, as well as a link to the reference paper, is available on the Project 2 instructions. In this particular file, we have added one additional column called “ribosomal” with value Y or N, to indicate whether a gene is ribosomal. Question 10. What is the mean value of expression of the experiment labeled “alpha 21”? Now we will run K-means clustering. Go to the Cluster tab. Under Clusterer choose SimpleKMeans. Click the text “SimpleKMeans” and set numClusters to 2. “Cluster mode” should be “Use training set”. Run it. Question 11. How many genes in each of the two clusters? An optional exercise for those interested -- If you want to see which genes were assigned to which clusters, right-click on the result in the Result list and choose “Visualize cluster assignments”. Click the Save button and choose a location. This will save a file in ARFF format (the Weka format), where the last value on each line will be a cluster assignment. (See the Weka documentation for more on ARFF format). Question 12. Recall that K-means depends on a random starting seed. Change this seed by clicking on the text “SimpleKMeans” and changing the “seed” value to 15. How many genes are now in the two clusters? Question 13. Try adjusting the seed value to various values (e.g. 1.2, 5, 20, 100). How much do the numbers of genes in each cluster change? What might this say about the genes and their assignments to clusters? (I.e. are the assignments robust or weak?) Question 14. Now under “Cluster mode”, choose “Classes to clusters evaluation” and run. How many of the 121 ribosomal genes are assigned to the same cluster? Question 15. How many other genes are assigned to the same cluster as the one that is predominantly ribosomal? (If you were interested in what those genes were, you might save the ARFF output file as described above, extract the genes in that cluster, and enter them in the GO Term Finder, as discussed in Project 2, to get a feel for their functions. That’s left as an optional exercise.) Question 16. For completeness, we’ll now classify the yeast data. Under the Classify tab, again choose the IBk algorithm, and use KNN=5. Use crossvalidation with Folds=5 and run. % correctly classified instances = Class N TP Rate = Class N FP Rate = Class Y TP Rate = Class Y FP Rate = Question 17. Run as in the previous question but using trivial classifier ZeroR, which was discussed above. % correctly classified instances = Class N TP Rate = Class N FP Rate = Class Y TP Rate = Class Y FP Rate = Part 2: Genotype/Phenotype Part 2a: Introduction Researching and answering these introductory questions will help you understand the genotype-phenotype section below. Question 18. What does diploid mean? How many copies of every chromosome are there in one cell of a diploid organism? How many copies of every gene? Question 19. What is an “allele”, with respect to a single SNP? How many alleles are theoretically possible for one SNP locus? Usually, how many alternate forms actually arise in life for one SNP locus? Question 20. Approximately how many SNPs are there in the human genome, assuming a definition of “SNP” as 1% minor allele frequency? (1% minor allele frequency means that 1% of the population has the uncommon allele.) Hint: It should be in millions. Question 21. SNPs are one type of genetic variation – what other types exist? Question 22. What is penetrance, in terms of phenotype? What does 50% penetrance mean? 100%? (three sentences or less) Question 23. What is a haplotype? How is it related to linkage disequilibrium? (three sentences or less) Part 2b: Feature selection (aka attribute selection) Background: With genome sequencing cost drastically falling, we will soon be able to sequence millions of people cheaply. With this wealth of new data, we will be able to associate genotypes with phenotypes (e.g. diseases or drug responses). The goal is to find the genes (or more specifically, the SNPs) that are found in conjunction with certain phenotypes. We will be using a semi-synthetic dataset to explore this type of study. The dataset can be found at http://helix-web.stanford.edu/bmi214/assignment2/data/genotenureitus1.arff Dataset description: Many past genotype-phenotype association studies only examined a single gene hypothesized to cause the phenotype (disease). In reality, there are complex interactions of genes that cause disease. We will use machine learning to find such complex associations, which is a very cutting-edge research area. We will use Weka for this section. Open the file “genotenureitis1.arff”. Instead of CSV, this file is in Weka’s ARFF format, which is described at http://www.cs.waikato.ac.nz/~ml/weka/arff.html. This file contains genotype/phenotype data for 557 subjects (people). Each row in the data section represents one subject (person). The (comma-delimited) columns are the attributes of the subjects. The first column is the person’s identifier. The next 188 columns represent genotypes, which in this case are SNPs (single nucleotide polymorphisms). The SNPs were identified from three regions in the genome. The SNPs are named RjSNPi where j is the region (1,2, or 3), and i is the SNP number within that region. The values each SNP can take are {11, 12, 22}. Recall that humans are diploid. In this file, a value of “11” for a subject’s SNP means that the both copies are allele 1 (where allele 1’s possible values are the nucleotides {A,T,G,C}; the actual allele isn’t specified.). “12” means that one was allele 1, and the other was allele 2. “22” means that both were allele 2. (Allele 2 would be a nucleotide other than allele 1.) The last 20 columns in the file represent (artificially created) phenotypes. Some of the phenotypes, for example, are - “degree”, with possible values MD, MD/PhD, or PhD - “npubs” (number of publications), with possible values 4 through 12. - “gotgrants” (total grant dollars earned, in millions), with possible values 0 through 19. Note that the genotype data are real, but the phenotype data are artificial. A more lengthy (and far more humorous) description of the phenotypes can be found here: http://helix-web.stanford.edu/bmi214/assignment2/data/pgrn_2005.pdf This is an optional read; it is not necessary for understanding this assignment. This document and the genotenuritis data file were created for a recent pharmacogenetics conference. Data filtering Back to Weka and genotenureitis1.arff. Scroll to the bottom of the attribute names. Examine the distributions for the attribute “irep” by clicking on it. Question 24. How many class labels exist for the attribute irep? The attribute irep is not useful to us; we will remove it and other useless fields with a filter. Under “Filter”, choose unsupervised->attribute->RemoveUseless, and click Apply. This will remove the “irep” attribute and the “ID” attribute(which actually appeared twice – attribute number 1 and 192). You can undo operations like this one by clicking the Undo button at the top. If you don’t apply this filter, you may get erroneous results in the following sections. Genotype-genotype associations First we’ll examine whether SNPs are associated with each other. Recall the discussion in class about haplotypes and linkage disequilibrium. Question 25. Under the “Associate” tab, run the default associator (Apriori). Copy here the output under “Best rules found:” (This should be nine SNP association pairs and one phenotype-SNP pair.) Question 26. Within all but one SNP-SNP pair, the two SNPs have something in common. What is it, and what does it mean biologically? Hint: look at the SNP names and the explanation of SNP names above. (3 sentences or less.) Attribute selection We now perform attribute selection to find which SNPs best predict some of the phenotypes. Phenotype 1: gotgrants The first phenotype we’ll investigate is “gotgrants”. We want to know whether there are any SNPs related to the amount of grant money earned. We disclose that the first phenotype was created artificially as a simple linear function of the penetrance of a single SNP. You will try to discover what this SNP was. Choose “gotgrants” from the dropdown menu on the left. Under the “Select attributes” tab, choose the GainRatioAttributeEval evaluator with Ranker Search Method. Use cross-validation with Folds=5. Run. Question 27. What is the top SNP (genotype) in this feature selection result? Question 28. Run multiple Attribute Evaluators with multiple Search Methods; is the top gene the same in general? (Note that some Evaluators require specific Search Methods; see the error log and/or documentation if you get funny results.) Question 29. Look at the top 80 or so SNPs. Do you see anything interesting about them? What does this mean biologically? (3 sentences or less.) Phenotype 2: pctdrivel The next phenotype is “pctdrivel”; again, we’ll investigate whether there are any SNPs that are good predictors of this phenotype. This phenotype was created artificially as a function of two other phenotypes. The inclusion of other phenotypes represents the fact that environment can impact phenotype, not just genetic makeup (cf. the nature vs. nurture debate). Run multiple Attribute Evaluators with multiple Search Methods. Question 30. What do you think are the two phenotypes affecting pctdrivel and why? A well-justified answer will be accepted. (3 sentences or less.) EXTRA CREDIT: Phenotype 3: rivalside This phenotype is very difficult to decipher; doing so will earn 5% extra credit. This phenotype was created artificially as a more complex function of one other phenotype and six SNPs. What attributes do you think are affecting rivalside and why?