doc - BMI 214

advertisement
BMI 214
Spring 2006
Assignment 2:
Machine learning for expression data and
Genotype-phenotype associations
Contents:
Part 1: Weka for machine learning in expression data
a. Introduction
b. Supervised learning
c. Feature selection
d. Unsupervised learning
Part 2: Genotype-phenotype associations
a. Introduction
b. Feature selection
Submission: Send your answers as a PDF document to
biomedin214-spr0506-submit@lists.stanford.edu
Part 1: Weka for machine learning in expression data
Part 1a: Introduction
Download / familiarize yourself with Weka.
As discussed in class, Weka is a useful tool that has implemented most of the major
machine learning algorithms. In this part you will re-live Project 2, but now let Weka do
the work. It will be both a tutorial and problem set.
We will be using the Weka GUI in this assignment. There are also an excellent
command line interface and Java API, which will not be covered. Read more in the
documentation if interested.
Download and start the Weka GUI. Follow the instructions on the Weka site:
http://www.cs.waikato.ac.nz/ml/weka/
You will see four buttons; in this class we will only use the Explorer functionality.
There is much Weka documentation available. You may familiarize yourself with
Explorer as much as you like by reading the user guide and using their provided sample
datasets. Some good (optional) starting points:
Explorer guide:
http://easynews.dl.sourceforge.net/sourceforge/weka/ExplorerGuide-3.4.pdf
Weka Wiki:
http://weka.sourceforge.net/wekadoc/index.php/Main_Page
Part 1b: Supervised learning
We will be using the same leukemia data as in Project 2. Recall that that dataset
comprised 72 leukemia patients, 28 of which were AML and 44 of which were ALL.
Each patient had expression measurements for 7129 genes.
We have compiled these data into a Weka-compatible CSV file.
http://helix-web.stanford.edu/bmi214/assignment2/data/leukemia.csv
This file has the 72 leukemia patients (rows) and expression values for 150 genes
(columns). This data matrix is slightly altered from the one in Project 2: it is transposed,
and we chose a subset of 150 genes to make your life easier. The gene and experiment
names, as well as a link to the reference paper, are available in Project 2.
Open the leukemia.csv file in Weka.
Question 1. What is the mean value of expression of the gene labeled “CD33
CD33 antigen (differentiation antigen)”?
Now we will run KNN. Go to the “classify” tab. Under “Classifier” click the “Choose”
button. Expand the “lazy” menu (yes, we use only the best!). Choose “IBk” and OK.
This is KNN: IBk stands for Instance-Based k. Now next to the button it should say
“IBk” with some parameters. Click on this text and a menu will pop up. For “KNN”, enter
5. Recall that this means the algorithm will use the five nearest neighbors to classify
each data point. Leave the rest of the values as default.
Under “Test options” choose the radio button “Cross-validation” and under “Folds” enter
5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This
means that the algorithm will classify “leukemia_type” (AML or ALL), using the
experiments as attributes.
Click the “Start’ button. The main window will show a variety of result summary
statistics, such as accuracy, true positives, false positives, and a confusion matrix.
Question 2. What is the % of correctly classified instances?
ROC curves, as discussed in lecture, illustrate the tradeoffs between sensitivity and
specificity. Roc curves plot Sensitivity vs. 1 – Specificity, or TP (true positive) rate vs.
FP (false positive) rate. Recall that
TP rate = Sensitivity = TP / (TP + FN)
FP rate = 1 – Specificity = FP / (TN + FN)
Specificity = 1 = FP rate = TN / (FP + TN)
Where TP = number of true positives and FP = number of false positives.
Question X. What are the TP and FP rates for ALL and AML?
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Optional: Right click on your result in the “Result list” on the left side of the screen.
Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP)
rate vs. false positive (FP) rate, which are the defaults. You can also view other types of
curves by clicking the dropdown menus. For example, precision-recall curves are an
alternative to ROC curves; precision and recall are options in the dropdown menu.
Recall that we can trivially achieve a TP rate of 1 for AML by classifying all of the
patients as positive (AML). That would correspond to a TP rate of 0 for ALL, since none
of the patients could then be classified ALL.
Question 3. Now we’ll try a different classification algorithm, ZeroR. Click the
Choose button under Classifier, and expand the “rules” folder. Choose “ZeroR”.
Again use cross-validation with Folds=5. Run it.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Explanation: ZeroR is a baseline classifier that simply identifies the class that is most
abundant (in this case, ALL with 44 patients), and predicts all variables (patients) to be
in that class. This illustrates that when the variables are unevenly assigned to classes, it
can be a problem to simply look at accuracy. For example, if you have a dataset of 100
variables (patients), where 90 were ALL, and 10 were AML, then ZeroR will predict them
all to be ALL, with 90% correctly classified (90% accuracy). Another classifier might
have 85% accuracy, which looks pretty good until you compare it with ZeroR. You can
get around the problem of uneven class sizes by weighting, a topic which we won’t cover
here. (But check out the “CostSensitiveClassifier” and “Cost Sensitive evaluation” if
you’re interested.)
We’ll try a few more algorithms on this dataset.
Question 4. Under the “bayes” Classifier folder, choose “NaiveBayes” and run.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Question 5. Under the “functions” Classifier folder, choose “SMO” (a popular
implementation of SVM) and run.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
Question 6. Under the “trees” Classifier folder, choose “ADTree” (a popular
implementation of a decision tree) and run.
% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =
You can read about the various other algorithms in the documentation if interested.
Part 1c: Feature selection (aka attribute selection)
We will now use feature selection algorithms to extract the most “informative” genes for
classifying AML vs ALL. As mentioned in lecture, this is a useful real world exercise
because many times we are interested not in measuring all the gene expressions, but in
finding one or a few whose expression level can be used as a marker for the disease.
The general area of “Biomarkers” is economically important because it may lead to
diagnostic tests that can be marketed to clinical laboratories.
If interested, you can learn more about the theory behind feature selection in machine
learning textbooks and online resources. Wikipedia is always a good place to start.
http://en.wikipedia.org/wiki/Feature_selection
First we will choose a subset of five genes arbitrarily – the first five. To do this in Weka,
go back to the “Preprocess” tab and click the checkbox next to the first five genes (Zyxin
through RNS2). Scroll down to the bottom of the gene list and click “leukemia_type”,
which is the class label (AML or ALL). Check this box, too, so that there are 6 total
boxes checked. Click the “Invert” button above the list, so that the other 145 boxes
become checked. Click the “Remove” button below the list, so that you are left with 5
genes and leukemia_type.
Go to the “Classify” tab again. Classify using IBk and K = 5.
Question 7. What is the % correctly classified instances using just the first five
genes?
Re-load the dataset with “Open file…” to get all 150 genes back.
We’ll now find five genes that are “informative” according to some metric. Choose the
“Select attributes” tab at the top of the screen. Under ‘Attribute Evaluator”, choose the
algorithm called “InfoGainAttributeEval” and Search Method “Ranker” and Start. The
output shows the genes ranked by information gain, which is one possible metric for
measuring the “goodness” of a partition. The genes at the top of the list are the most
“informative” as attributes for classification.
Question 8. What are the top five genes in this output?
Back under “Preprocess”, remove all attributes except the top five genes from the
previous question, in the same manner as described above, so that you are left with the
five genes from the previous question and the class attribute leukemia_type. Under
“Classify”, run the exact same algorithm as before (IBk, K=5).
Question 9. What is the % of correctly classified instances?
Part 1d: Unsupervised learning
K-means clustering
This part will use the same dataset that you used for Project 2 k-means clustering.
Recall that that dataset comprised expression profiles for 2467 genes across 79
experiments.
Download a Weka-compatible CSV file at
http://helix-web.stanford.edu/bmi214/assignment2/data/yeast.dat.csv
The rows are genes and the columns are experiments. The gene and experiment
names, as well as a link to the reference paper, is available on the Project 2 instructions.
In this particular file, we have added one additional column called “ribosomal” with value
Y or N, to indicate whether a gene is ribosomal.
Question 10. What is the mean value of expression of the experiment labeled
“alpha 21”?
Now we will run K-means clustering. Go to the Cluster tab. Under Clusterer choose
SimpleKMeans. Click the text “SimpleKMeans” and set numClusters to 2. “Cluster
mode” should be “Use training set”. Run it.
Question 11. How many genes in each of the two clusters?
An optional exercise for those interested -- If you want to see which genes were
assigned to which clusters, right-click on the result in the Result list and choose
“Visualize cluster assignments”. Click the Save button and choose a location. This will
save a file in ARFF format (the Weka format), where the last value on each line will be a
cluster assignment. (See the Weka documentation for more on ARFF format).
Question 12. Recall that K-means depends on a random starting seed. Change
this seed by clicking on the text “SimpleKMeans” and changing the “seed” value
to 15. How many genes are now in the two clusters?
Question 13. Try adjusting the seed value to various values (e.g. 1.2, 5, 20, 100).
How much do the numbers of genes in each cluster change? What might this say
about the genes and their assignments to clusters? (I.e. are the assignments
robust or weak?)
Question 14. Now under “Cluster mode”, choose “Classes to clusters evaluation”
and run. How many of the 121 ribosomal genes are assigned to the same cluster?
Question 15. How many other genes are assigned to the same cluster as the one
that is predominantly ribosomal?
(If you were interested in what those genes were, you might save the ARFF output file
as described above, extract the genes in that cluster, and enter them in the GO Term
Finder, as discussed in Project 2, to get a feel for their functions. That’s left as an
optional exercise.)
Question 16. For completeness, we’ll now classify the yeast data. Under the
Classify tab, again choose the IBk algorithm, and use KNN=5. Use crossvalidation with Folds=5 and run.
% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =
Question 17. Run as in the previous question but using trivial classifier ZeroR,
which was discussed above.
% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =
Part 2:
Genotype/Phenotype
Part 2a: Introduction
Researching and answering these introductory questions will help you understand the
genotype-phenotype section below.
Question 18. What does diploid mean? How many copies of every chromosome
are there in one cell of a diploid organism? How many copies of every gene?
Question 19. What is an “allele”, with respect to a single SNP? How many alleles
are theoretically possible for one SNP locus? Usually, how many alternate forms
actually arise in life for one SNP locus?
Question 20. Approximately how many SNPs are there in the human genome,
assuming a definition of “SNP” as 1% minor allele frequency? (1% minor allele
frequency means that 1% of the population has the uncommon allele.)
Hint: It should be in millions.
Question 21. SNPs are one type of genetic variation – what other types exist?
Question 22. What is penetrance, in terms of phenotype? What does 50%
penetrance mean? 100%? (three sentences or less)
Question 23. What is a haplotype? How is it related to linkage disequilibrium?
(three sentences or less)
Part 2b: Feature selection (aka attribute selection)
Background:
With genome sequencing cost drastically falling, we will soon be able to sequence
millions of people cheaply. With this wealth of new data, we will be able to associate
genotypes with phenotypes (e.g. diseases or drug responses). The goal is to find the
genes (or more specifically, the SNPs) that are found in conjunction with certain
phenotypes. We will be using a semi-synthetic dataset to explore this type of study.
The dataset can be found at
http://helix-web.stanford.edu/bmi214/assignment2/data/genotenureitus1.arff
Dataset description:
Many past genotype-phenotype association studies only examined a single gene
hypothesized to cause the phenotype (disease). In reality, there are complex
interactions of genes that cause disease. We will use machine learning to find such
complex associations, which is a very cutting-edge research area.
We will use Weka for this section. Open the file “genotenureitis1.arff”. Instead of CSV,
this file is in Weka’s ARFF format, which is described at
http://www.cs.waikato.ac.nz/~ml/weka/arff.html.
This file contains genotype/phenotype data for 557 subjects (people). Each row in the
data section represents one subject (person).
The (comma-delimited) columns are the attributes of the subjects. The first column is
the person’s identifier.
The next 188 columns represent genotypes, which in this case are SNPs (single
nucleotide polymorphisms). The SNPs were identified from three regions in the
genome. The SNPs are named
RjSNPi
where j is the region (1,2, or 3), and i is the SNP number within that region. The values
each SNP can take are {11, 12, 22}.
Recall that humans are diploid. In this file, a value of “11” for a subject’s SNP means
that the both copies are allele 1 (where allele 1’s possible values are the nucleotides
{A,T,G,C}; the actual allele isn’t specified.). “12” means that one was allele 1, and the
other was allele 2. “22” means that both were allele 2. (Allele 2 would be a nucleotide
other than allele 1.)
The last 20 columns in the file represent (artificially created) phenotypes. Some of the
phenotypes, for example, are
- “degree”, with possible values MD, MD/PhD, or PhD
- “npubs” (number of publications), with possible values 4 through 12.
- “gotgrants” (total grant dollars earned, in millions), with possible values 0 through
19.
Note that the genotype data are real, but the phenotype data are artificial. A more
lengthy (and far more humorous) description of the phenotypes can be found here:
http://helix-web.stanford.edu/bmi214/assignment2/data/pgrn_2005.pdf
This is an optional read; it is not necessary for understanding this assignment. This
document and the genotenuritis data file were created for a recent pharmacogenetics
conference.
Data filtering
Back to Weka and genotenureitis1.arff. Scroll to the bottom of the attribute names.
Examine the distributions for the attribute “irep” by clicking on it.
Question 24. How many class labels exist for the attribute irep?
The attribute irep is not useful to us; we will remove it and other useless fields with a
filter. Under “Filter”, choose unsupervised->attribute->RemoveUseless, and click Apply.
This will remove the “irep” attribute and the “ID” attribute(which actually appeared twice –
attribute number 1 and 192). You can undo operations like this one by clicking the Undo
button at the top.
If you don’t apply this filter, you may get erroneous results in the following sections.
Genotype-genotype associations
First we’ll examine whether SNPs are associated with each other. Recall the discussion
in class about haplotypes and linkage disequilibrium.
Question 25. Under the “Associate” tab, run the default associator (Apriori).
Copy here the output under “Best rules found:” (This should be nine SNP
association pairs and one phenotype-SNP pair.)
Question 26. Within all but one SNP-SNP pair, the two SNPs have something in
common. What is it, and what does it mean biologically?
Hint: look at the SNP names and the explanation of SNP names above.
(3 sentences or less.)
Attribute selection
We now perform attribute selection to find which SNPs best predict some of the
phenotypes.
Phenotype 1: gotgrants
The first phenotype we’ll investigate is “gotgrants”. We want to know whether there are
any SNPs related to the amount of grant money earned.
We disclose that the first phenotype was created artificially as a simple linear function of
the penetrance of a single SNP. You will try to discover what this SNP was.
Choose “gotgrants” from the dropdown menu on the left. Under the “Select attributes”
tab, choose the GainRatioAttributeEval evaluator with Ranker Search Method. Use
cross-validation with Folds=5. Run.
Question 27. What is the top SNP (genotype) in this feature selection result?
Question 28. Run multiple Attribute Evaluators with multiple Search Methods; is
the top gene the same in general?
(Note that some Evaluators require specific Search Methods; see the error log and/or
documentation if you get funny results.)
Question 29. Look at the top 80 or so SNPs. Do you see anything interesting
about them? What does this mean biologically? (3 sentences or less.)
Phenotype 2: pctdrivel
The next phenotype is “pctdrivel”; again, we’ll investigate whether there are any SNPs
that are good predictors of this phenotype.
This phenotype was created artificially as a function of two other phenotypes. The
inclusion of other phenotypes represents the fact that environment can impact
phenotype, not just genetic makeup (cf. the nature vs. nurture debate).
Run multiple Attribute Evaluators with multiple Search Methods.
Question 30. What do you think are the two phenotypes affecting pctdrivel and
why? A well-justified answer will be accepted. (3 sentences or less.)
EXTRA CREDIT:
Phenotype 3: rivalside
This phenotype is very difficult to decipher; doing so will earn 5% extra credit.
This phenotype was created artificially as a more complex function of one other
phenotype and six SNPs.
What attributes do you think are affecting rivalside and why?
Download