BCB 444/544 Lab 11 Answer Key

advertisement
BCB 444/544
Lab 11
Answer Key
Each question worth ½ point, except the last question, which is worth 1 point
1) In your own words, describe in several sentences what normalization is, and
why we need to do it prior to any further processing of the data.
I expected a general description that normalization is a statistical
technique used to account for various sources of experimental variability.
2.1) How many gene expression patterns are in the data set?
6153
2.2) How many of them are replicated?
51
2.3) Are there a lot of missing values, in total?
No, there were only 199 missing values before merging replicates, and
195 missing after merging, which is only 0.5%
2.4) Was the data set already in a symmetrical scale?
No. For repressed genes the range of possible values is 0-1, while for
over-expressed genes the range of possible values is 1 - ∞. (This can be seen by
viewing the histogram of values and comparing to the histogram of values in
log2-scale, below. Notice the difference in tail lengths in each distribution). This is
the main point regarding why log2 values are typically used in microarray
analysis.
2.5) How many gene expression patterns have 3 missing values?
0
2.6) How many genes will remain if we want to keep only genes with at least 3
peaks
higher than 1 or lower than -1 (in a log2 scale)?
142
This answer is found by using the appropriate filter setting and running the
analysis.
2.7) Which functions does the pre-analyzer recommend activating?
Apply log transform, base 2; merge replicates (by averaging); filter missing
values, with minimum percentage of existing values @ 70%.
The KNNimpute with K value 15 appears to be activated, but if you look at the
Impute checkbox, it is not selected, so imputing is not performed. Points were not
subtracted for listing this as one of the activated options, due to the confusion.
3.1 In Scenario I, a single number can be changed for Gene 1 that results in
dramatic changes in the correlation. Use the two graphs for the scenario to guide
your experimentation of the following changes.
a. Change a single sample for Gene 1 that causes the correlation to jump up to
approximately 0.68, and list the change made.
Change Scenario I, Gene 1, Sample 4, from 100 to –150. The correlation
jumps up to approximately 0.68.
b. Change Scenario I, Gene 1, Sample 8, from 100 to –150. Note that the
correlation jumps down to approximately –0.63. Explain this jump by seeing what
changes in each of the two graphs.
In the first graph, it can be seen that this change causes gene 2
expression to be nearly inversely related to gene 1, thus explaining the significant
negative correlation. In the scatter plot in the second graph, it can be seen that
initially there is no apparent linear trend in the points, but after the change the
negative correlation is apparent.
3.2 To help answer the following, first notice that in Scenario II, the pattern for
Gene 2 is evenly spaced between 10 and 80, changing in increments of 10.
a. Change the pattern for Gene 1 in Scenario II such that the correlation is
exactly 1, and list the changes made. You will need to change all but one or two
of the values.
Change the pattern for Gene 1 to be evenly spaced and increasing, for
example, increasing from –100 to 110 in increments of 30. You can watch
the correlation steadily approach 1 as you change the numbers for samples
1 through 8.
b. Change the pattern for Gene 1 in Scenario II such that the correlation is
exactly –1, and list the changes made. You will need to change all but one or two
of the values.
Change the pattern for Gene 1 to be evenly spaced and decreasing, for
example, decreasing from 110 to –100 in increments of 30.
3.3 Scenario III illustrates how sensitive the correlation can be to small changes.
Here we examine a gene whose log ratio changes substantially across samples
and a gene with essentially constant log ratio across samples.
a. Find a pair of samples for which Gene 2 can be changed from 7 to 6,
resulting in a much larger positive correlation, and list the changes made.
By changing only samples 2 and 6 for Gene 2 from 7 to 6, the correlation
jumps to nearly 0.78.
b. Return the two samples found in part (a) to their original values of 7, and find
a new pair of samples for which Gene 2 can be changed from 7 to 6, resulting in
a fairly large negative correlation, and list the changes made.
By changing only samples 4 and 8 for Gene 2 from 7 to 6, the correlation
falls to approximately –0.38.
3.4 Scenario IV shows that correlation is undefined if one of the patterns is
constant across samples. As in the previous scenario, changing just one of the
values for Gene 2 has a significant effect on the correlation.
a. Change the value for sample 1 from 4 to 3, and record the effect on
correlation.
Changing sample 1 causes correlation to jump to 0.632.
b. Change the value for sample 8 from 4 to 3, and record the effect on
correlation.
Changing sample 8 causes correlation to jump to –0.577.
c. Explain why one of these changes has a greater magnitude effect than the
other.
The first change has greater magnitude impact on the correlation because
Sample 4 (note that in the rightmost graph, this is the 2nd point from the
right, NOT the middle point), with Gene 1 value of 100, keeps the line from
dropping too far on the right when Sample 8 is changed. The line tries to be close
to all sample points. The closest point to Sample 1 on the left end is Sample 2,
but it is further from Sample 1 than Sample 4 is from Sample 8, so the “pull” on
the line when Sample 8 is changed is not as great.
d. Which single change from 4 to 3 would give the correlation nearest to 0?
Why?
Changing Sample 3, Gene 2, from 4 to 3 gives a correlation of 0.027. No
other single change from 4 to 3 results in a correlation this close to 0. The
reason this correlation is so near 0 is that this Gene 1 value (10) is closest
to the average, so changing its Gene 2 value has little effect on the line of
best fit.
4. Write a couple paragraphs comparing and contrasting the 4 different methods.
For the tree results, be sure to expand the tree several levels to get an idea of
how well the genes are clustered at different levels. Did any of the four methods
perform particularly poorly? How about particularly well?
This was intended to be an open-ended question, without any specific
answer required. Credit was given as long as there was clearly reasonable
thought put into the answer. One obvious difference in the four methods is that
two of the methods (Cluster and SOTA) result in a hierarchical tree, whereas the
other two (SOM and k-means clustering) simply return a number of distinct
clusters, without specifying any relationship between each of the final clusters. In
other words, the CAAT viewer isn’t going to show us anything new for these
results, since CAAT is a tree viewer.
Another aspect I was looking for, but don’t require for credit, is to notice the
difference in the trees produced by the Cluster and SOTA methods. As nodes
from the Cluster tree are expanded, you should have noticed that one branch
would contain only a few similar expression profiles, with the majority of
instances remaining in the other branch. SOTA, on the other hand, appears to
split examples fairly evenly at each node, with each branch splitting the examples
into two clusters having fairly distinct mean expression profiles.
Answers for 3.1-3.4 from exercises provided in online companion to:
Campbell, M., & Heyer, L (2006) Discovering Genomics, Proteomics, &
Bioinformatics
(http://www.geneticsplace.com)
Download