Lab 05: Selecting differentially expressed genes

advertisement
Genetics 875: Clustering Yeast Gene Expression Data
Monday, October 15, 2012
Today you will cluster yeast gene expression data in different ways to see how your
choice of clustering parameters and data selection affect the clustering output. You will
use the program Gene Cluster 3.0 (coded by M. de Hoon based on Mike Eisen’s original
Cluster program for PC’s (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/) to
cluster data and the program Java Treeview (http://jtreeview.sourceforge.net/) to
visualize the clustered data.
Data:
Heat+EtOH_RNAseq_FDR0.05: This includes the gene expression data you analyzed
last week in the baySeq lab. The data have been converted to log2 fold-change in
expression after a 15 min heat shock or ethanol exposure (compared to unstressed cells
immediately before treatment). Only genes that met the 5% FDR cutoff in heat or
ethanol (*analyzed separately) are included in this file. Note the gene annotations visible
in Treeview include “HS” or “EtOH” if the gene met the 5% FDR cutoff in each
respective experiment.
NEW_HS+EtOH+H2O2+aa+starvation_90p.txt: This file contains the RNA-seq data
used in the above analysis merged with some older microarray data looking at the
response to several other stress conditions.
In both cases, the log2 data reflect expression in stressed cells versus unstressed cells:
therefore a positive value indicates an increase in transcript abundance upon stress (i.e.
gene induction), and a negative value represents a decrease in transcript abundance (gene
repression). The file has been made compatible with the Cluster 3.0 and Java Treeview
software – open the text tab-delimited file in Excel to peruse the format. The first
column has a unique gene identifier, the second column has a gene annotation, and the
remaining columns are log2 fold-change data.
1-1. Clustering Heat+EtOH data
Open Cluster 3.0. Take a moment to observe what functions are included in the
program.
a. Load the data file Heat+EtOH_RNAseq_FDR0.05: File->Open data file. Size of the
dataset is shown.
b. We will initially work with a smaller subset of the data. Under the Filter Data field,
select genes that have “At least 2 observations with abs(Val) >- 1” Hit “Apply Filter”
and then “Accept Filter”. ** Change the job name to
“Heat+EtOH_RNAseq_FDR0.05_2obs2X” to keep track of this selection. It should now
say that your “Data set has 2,439 Rows 4 Columns”
1
c. Go to the Hierarchical tab. Under Genes select “Cluster” – we will not cluster arrays
for now, so leave the “Cluster” box under Arrays deselected.
d. Use the default similarity metric, “Correlation (uncentered)” which is the uncentered
Pearson correlation.
e. Cluster the data using the left-most button, “Centroid linkage.” The program will
generate two new files for you: JobName.cdt and JobName.gtr. Open the .cdt file in
Excel. Note that the file is very similar to the input .txt file, except that it contains a) data
only for the selected genes b) in clustered order c) along with an extra column, “GID”
that links the .cdt file to the .gtr file that contains information to draw the dendrogram,
and d) a few extra “WEIGHT” columns. You need both of the .cdt and .gtr files to fully
visualize in TreeView.
1-2. Visualize the clustering output
Open the program Java Treeview and open your .cdt file (File -> Open). The left-most
panel shows a zoomed-out image of the clustered data along with the dendrogram. Click
and drag onto that image, and a zoomed-in image of your selection will be displayed in
the center panel. Above are the sample annotations and to the right are the gene
annotations. Click on one of the annotations to link to (in the case of this data) the
Saccharomyces Genome Database, SGD. [Note that you can link annotations to any
specified website that uses the same gene identifiers]
Take a few minutes to look at your clustering and observe the output.
2. Changing clustering parameters
Go back and repeat the clustering using “Euclidean distance” (change the job-name) and
Centroid Linkage. Compare the clustering to the previous clustering using the Pearson
correlation (uncentered).
How does using Euclidean distance affect the clustering and why. When might you
want to use this metric?
Euclidean distance is more sensitive to the magnitude of expression, rather than the
pattern of expression across samples. You would use this distance in cases where you
cared about magnitude (e.g. identifying genes based on drug response).
Go back and repeat the clustering using “Absolute correlation (uncentered)” (change the
job-name) and Centroid Linkage.
How does using Absolute correlation affected the clustering and why.
Absolute correlation takes the absolute value of the Pearson correlation. In this case,
information about anticorrelation is lost, so induced genes and repressed genes with
similar (but opposite) patterns of expression will group together.
List a circumstance in which you might want to use absolute correlation.
2
You would use this whenever you don’t care about the directionality of the expression
change (e.g. if trying to identify downstream targets of a pathway that may regulate
induced and repressed genes with the same timing).
3. Clustering combined datasets.
Hierarchical clustering will give different gene clusters depending on the experiments
being clustered.
Download the file NEW_HS+EtOH+H2O2+aa+starvation.txt. This file contains the
heat shock and ethanol data from above plus corresponding microarray data from
timecourses measuring the response to hydrogen peroxide, amino acid starvation, and
nutrient starvation (labeled ‘YPD’).
Open the file in Cluster, select genes that are affected >2X (1 in log2 space) in at least 4
experiments, and cluster the data using Pearson correlation (uncentered) and centroid
linkage. View the data in a New Window of Treeview (in your existing browser, go to
“Window” and select “New Window”) and compare to the initial clustering of Heat +
Ethanol data.
Describe briefly how the two clustered files (HS+EtOH vs
HS+EtOH+H2O2+aa+starvation) compare and describe why.
The second analysis has a more detailed hierarchical tree, with finer resolution and
generally more smaller clusters. This is because there is more (differential) data for the
algorithm to distinguish patterns based on.
4. Using clustering to assess reproducibility and confidence.
Find the gene SHM2 (Analysis  Find genes .. search for and then click on the gene of
interest to display). If you hold your mouse over the thumbnail/zoomed out view, you
can use the up and down arrow keys to expand the node according to the hierarchical
clustering tree.
Based on the FDR cutoff (shown in the annotation), do you believe this gene was
differentially expressed in response to heat shock?
The FDR for this gene is >5% - with nothing else to go on, we probably would not call
this gene significant.
Now look at the cluster to which SHM2 belongs. Does this information change your
opinion – why or why not?
When you consider other genes in the cluster you should see a) many genes with similar
expression patterns as SHM2 and b) many of those genes are functionally related to
SHM2, suggesting that the whole biological pathway is likely similarly (and
significantly) affected).
5. Functional prediction based on clustering.
Based on the clustered datasets, you will assign a hypothetical function to the gene
3
YIL165C.
Return to the Heat+EtOH_RNAseq_FDR0.05 dataset and cluster all the data (*without
filtering based on fold-cutoff).
Search for YIL165C and expand the cluster to see what genes it clusters with: Move your
mouse to the far-left dendrogram panel, then use the up arrow key to expand the cluster
to a reasonable cutoff.
Scan the gene annotations and look for common features in the gene set, if you can.
Next, use GO tools webpage at Princeton to look for functional enrichment:
1. Retrieve the gene names in your selected cluster by going to ‘Export’ and
‘Save List’ in Treeview. Copy the unique identifiers to paste into the
website below.
2. Go to http://go.princeton.edu/
3. Select ‘Generic GO Term Finder’
4. Enter gene list into the appropriate field, enter email address
5. Try the functional enrichment based on all 3 GO classifications.
6. Go back and repeat the process based on the clustered
NEW_HS+EtOH+H2O2+aa+starvation.txt dataset.
What cellular process would you predict this gene functions in and why? Which
clustering was more effective in assigning the hypothetical function?
First analysis: most people got different answers, some related to membrane protein, heat
response, and other functions, and many people did not find a significant enrichment.
The differences likely reflect the fact that it’s harder to define the cluster (and different
people pulled different gene sets depending on how they ‘defined’ the cluster).
For the second analysis on the more detailed stress dataset, most people found the gene
belonging to a cluster linked to amino acid metabolism. Most people find it easier to
identify the boundaries of the cluster in this analysis, and the pattern is obvious by eye as
well as by the hierarchical tree.
4
Download