BaySeq lab_ANSWERS

advertisement
baySeq Lab
Genetics 875
October 8, 2012
Today we will analyze RNA-seq data generated from four S. cerevisiae yeast strains grown in three
environments: standard conditions (25C or ‘t0’), 30 min after a shift from 25C to 37C (‘heat’), and 30
min after a shift to medium containing 5% ethanol (‘EtOH’). Strains were collected in duplicates, with
‘rep1’ of all samples done on one day and ‘rep2’ done on a different day. Our goal is to identify genes
that are a) differentially expressed in response to each condition and b) differentially expressed across
strains grown under different conditions.
We will use the methods of Hardcastle and Kelley (2009, “Empirical Bayesian analysis of patterns of
differential expression in count data”) as implemented in the Bioconductor package baySeq. This method
uses an empirical Bayesian approach to identify differential gene expression starting with unnormalized
count data per gene. The newest version can handled paired data and can identify differential expression
based on a variety of multi-sample models. The required data are: unnormalized sum read counts per
gene (or transcript/exon unit) and a list of gene/transcript/exon unit length.
We will perform three comparisons in class today, and follow up on these results next week in
the clustering lab. The first is a practice run to identify the number of genes whose expression
responds to heat shock – follow the attached bayseq Short Course handout to do this.
** An important hint: when you get errors, look at them closely and try to figure out where the
problem is – often it is a missing character (,) or typo.
[ Short Course walk through for simple two-sample comparison ]
HOMEWORK QUESTION: Open the file of differentially expressed genes in Excel to survey
the results. What was the total number of genes on your list and how many got assigned a
PP/FDR? How many genes, and what percent of the total genes, were selected as
differentially expressed at an FDR of 0.05? Which of these two gene sets do you have more
confidence in and why? Which set is better?
Out of 9018 genes on the list, 7856 were assigned PP/FDRs. Those that weren’t scored had at
least one missing value in the duplicate data.
Of the 7856 that were scored, 3215 (40.9%) were differentially expressed at an FDR of 5% and
1964 (25%) were differentially expressed at an FDR of 1%
Which of these gene sets is “better” depends on what you want out of the data: FDR 1% will
have fewer false positives but more false negatives than FDR 5%.
How does this compare to using an average fold-change cutoff of >2X? (you can do this in
Excel by comparing heat shock rep1 to unstressed sample rep1, and same for rep2, then taking
the absolute value of the average fold change (abs(average(B:B) ). How many of these genes
with an average fold-change >2X had an FDR < 5%?
3582 genes had an average fold-change greater than 2X. 2669 of these had an FDR < 0.05.
Next, we will look at genes that respond to heat shock versus ethanol shock in the lab strain,
DBY8268. Load the data file “ YeastStrains_HSEtOH_75p.txt”. This file has duplicate data for
t0, heat, and ethanol for four different strains. Open the file with read.delim, with row.names in
the first column.
> data2<-read.delim(“PATH/YeastSTrains_HSEtOH_90p.txt”, row.names=1)
Survey the column headers by printing out the column names:
> names(data2)
You will see that columns 1:2, 3:4, 5:6 correspond to replicate t0, heat, and ethanol for the first
strain, 7:8, 9:10, 11:12 for the second strain, 13:14, 15:16, 17:18 for the third strain, and 19:20,
21:22, 23:24 for the fourth strain. Here, we will just focus on the first strain and identify genes
differentially expressed in response to heat and/or ethanol. The easiest is to make a separate data
container for just the first six rows:
> subset<-data[, 1:6]
>row.names(subset)<-row.names(data)
Walk through the instructions in the Short Course manual. It is advisable to use new names for
these objects so as to avoid confusion with the ones you already created that are still in memory
(** don’t forget to change those container names within the functions you type).
This time we will use a more complicated set of models for differential expression:
NDE = no difference in expression in any sample
DEH = expression difference upon heat but NOT ethanol
DEE = expression difference upon ethanol but NOT heat
DEHE = expression difference upon heat and ethanol
(*note that we could split the last model into two related models: one in which the H and E
response is the same, and one in which the H and E response is different from each other (and
from the t0 control).
HOMEWORK QUESTION: What are the vectors of models you will use to create the ‘groups’
container?
NDE =
1,1,1,1,1,1
DEH =
1,1,2,2,1,1
DEE =
1,1,1,1,2,2
DEHE =
1,1,2,2,2,2
(you could also have DEHE2 = 1,1,2,2,3,3)
Create a new “countData” object (with a new name for clarity) and walk through the procedure
to get the posterior probabilities and FDRs. (you can jump ahead to the next section while
waiting for the computation …)
HOMEWORK QUESTION: How many genes are assigned to each of your models of
differential expression at an FDR of 0.01?
DEH:
162 genes
DEE:
258 genes
DEHE: 1575 genes
Transcript potentCRB_71 had a large fold-change but did not meet the FDR cutoff – why?
Data were missing from one replicate, and therefore no statistical analysis was performed
Transcript 638212746 had a nearly 4X average expression change in response to ethanol, but
did not have a significant FDR level for any of the models – why?
I’m not even sure why this transcript didn’t get selected in the combined heat-ethanol analysis.
In the heat-only analysis, this gene is called differentially expressed with an FDR of 2.5%. In the
combined analysis with the more ‘complex’ expression model (i.e. more possibilities), the
likelihood is by far the highest for the NDE model, which is not correct. This is a good lesson
that the more complex the model, the harder it is to assign a particular model with confidence.
3. Assessing false positives and negatives
* You can do the remaining in Excel outside of class if short of time – if so, email yourself the
output of the first heat shock analysis.
The yeast transcription factor Hsf1p is known to be activated by heat shock, and therefore we
expect Hsf1p target genes to be induced in these experiments. You will use a list of “known”
targets of Hsf1p to assess the sensitivity of your selection. These targets were identified in
genome-wide chromatin-immunoprecipitation experiments (“chip-ChIP” experiments) from
Hahn et al. 2004 http://mcb.asm.org.ezproxy.library.wisc.edu/cgi/content/full/24/12/5249?view=long&pmid=15169889.
We will focus on differentially expressed genes from your first analysis of the heat shock data.
Use the “VLOOKUP” function to identify the DE-model FDR from your baySeq output onto the
Hsf1_targets.xls list: this function will use the gene name in the adjacent cell, look for the
corresponding gene name in the output file, and retrieve the value from the specified column of
your output file.
a. In the Hsf1_targets file, add a new column after the JGI UID column. In the first
cell, Insert -> Function -> “VLOOKUP”
b. Lookup_value: select the gene name in the adjacent JGI UID cell to the LEFT
c. Table_array: highlight the columns of your DE-model FDR file, where the first cell
has the exact same gene names as our Hsf1_targets file.
d. Col_index_num: enter “7” – this will pull the value from the 7th column in the
designated Table_array (which should be the FDR column of your file)
e. Range_lookup: enter “false” – this will return from exact matches only
f. This should return the FDR value from the output file for that gene
g. Fill the whole column with this function, to retrieve all the values
HOMEWORK QUESTION:
At an FDR cutoff of 0.01:
How many true positives have you identified? How many false negatives? What is the
sensitivity of your selection? Please show your values and calculations.
38/64 known positives (with data!!) are identified at FDR 0.01. 28/64 were missed at
this cutoff. The sensitivity is how well we identified known true positives: 38/64 =
59.4%
At an FDR cutoff of 0.05:
How many true positives have you identified? How many false negatives? What is the
sensitivity of your selection?
45/64 known positives are identified at FDR 0.01. 19/64 were missed at this cutoff.
The sensitivity is 45/64 = 70%
Download