Genetics 875: Clustering Yeast Gene Expression Data Monday, October 15, 2012 Today you will cluster yeast gene expression data in different ways to see how your choice of clustering parameters and data selection affect the clustering output. You will use the program Gene Cluster 3.0 (coded by M. de Hoon based on Mike Eisen’s original Cluster program for PC’s (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/) to cluster data and the program Java Treeview (http://jtreeview.sourceforge.net/) to visualize the clustered data. Data: Heat+EtOH_RNAseq_FDR0.05: This includes the gene expression data you analyzed last week in the baySeq lab. The data have been converted to log2 fold-change in expression after a 15 min heat shock or ethanol exposure (compared to unstressed cells immediately before treatment). Only genes that met the 5% FDR cutoff in heat or ethanol (*analyzed separately) are included in this file. Note the gene annotations visible in Treeview include “HS” or “EtOH” if the gene met the 5% FDR cutoff in each respective experiment. NEW_HS+EtOH+H2O2+aa+starvation_90p.txt: This file contains the RNA-seq data used in the above analysis merged with some older microarray data looking at the response to several other stress conditions. In both cases, the log2 data reflect expression in stressed cells versus unstressed cells: therefore a positive value indicates an increase in transcript abundance upon stress (i.e. gene induction), and a negative value represents a decrease in transcript abundance (gene repression). The file has been made compatible with the Cluster 3.0 and Java Treeview software – open the text tab-delimited file in Excel to peruse the format. The first column has a unique gene identifier, the second column has a gene annotation, and the remaining columns are log2 fold-change data. 1-1. Clustering Heat+EtOH data Open Cluster 3.0. Take a moment to observe what functions are included in the program. a. Load the data file Heat+EtOH_RNAseq_FDR0.05: File->Open data file. Size of the dataset is shown. b. We will initially work with a smaller subset of the data. Under the Filter Data field, select genes that have “At least 2 observations with abs(Val) >- 1” Hit “Apply Filter” and then “Accept Filter”. ** Change the job name to “Heat+EtOH_RNAseq_FDR0.05_2obs2X” to keep track of this selection. It should now say that your “Data set has 2,439 Rows 4 Columns” 1 c. Go to the Hierarchical tab. Under Genes select “Cluster” – we will not cluster arrays for now, so leave the “Cluster” box under Arrays deselected. d. Use the default similarity metric, “Correlation (uncentered)” which is the uncentered Pearson correlation. e. Cluster the data using the left-most button, “Centroid linkage.” The program will generate two new files for you: JobName.cdt and JobName.gtr. Open the .cdt file in Excel. Note that the file is very similar to the input .txt file, except that it contains a) data only for the selected genes b) in clustered order c) along with an extra column, “GID” that links the .cdt file to the .gtr file that contains information to draw the dendrogram, and d) a few extra “WEIGHT” columns. You need both of the .cdt and .gtr files to fully visualize in TreeView. 1-2. Visualize the clustering output Open the program Java Treeview and open your .cdt file (File -> Open). The left-most panel shows a zoomed-out image of the clustered data along with the dendrogram. Click and drag onto that image, and a zoomed-in image of your selection will be displayed in the center panel. Above are the sample annotations and to the right are the gene annotations. Click on one of the annotations to link to (in the case of this data) the Saccharomyces Genome Database, SGD. [Note that you can link annotations to any specified website that uses the same gene identifiers] Take a few minutes to look at your clustering and observe the output. 2. Changing clustering parameters Go back and repeat the clustering using “Euclidean distance” (change the job-name) and Centroid Linkage. Compare the clustering to the previous clustering using the Pearson correlation (uncentered). How does using Euclidean distance affect the clustering and why. When might you want to use this metric? Euclidean distance is more sensitive to the magnitude of expression, rather than the pattern of expression across samples. You would use this distance in cases where you cared about magnitude (e.g. identifying genes based on drug response). Go back and repeat the clustering using “Absolute correlation (uncentered)” (change the job-name) and Centroid Linkage. How does using Absolute correlation affected the clustering and why. Absolute correlation takes the absolute value of the Pearson correlation. In this case, information about anticorrelation is lost, so induced genes and repressed genes with similar (but opposite) patterns of expression will group together. List a circumstance in which you might want to use absolute correlation. 2 You would use this whenever you don’t care about the directionality of the expression change (e.g. if trying to identify downstream targets of a pathway that may regulate induced and repressed genes with the same timing). 3. Clustering combined datasets. Hierarchical clustering will give different gene clusters depending on the experiments being clustered. Download the file NEW_HS+EtOH+H2O2+aa+starvation.txt. This file contains the heat shock and ethanol data from above plus corresponding microarray data from timecourses measuring the response to hydrogen peroxide, amino acid starvation, and nutrient starvation (labeled ‘YPD’). Open the file in Cluster, select genes that are affected >2X (1 in log2 space) in at least 4 experiments, and cluster the data using Pearson correlation (uncentered) and centroid linkage. View the data in a New Window of Treeview (in your existing browser, go to “Window” and select “New Window”) and compare to the initial clustering of Heat + Ethanol data. Describe briefly how the two clustered files (HS+EtOH vs HS+EtOH+H2O2+aa+starvation) compare and describe why. The second analysis has a more detailed hierarchical tree, with finer resolution and generally more smaller clusters. This is because there is more (differential) data for the algorithm to distinguish patterns based on. 4. Using clustering to assess reproducibility and confidence. Find the gene SHM2 (Analysis Find genes .. search for and then click on the gene of interest to display). If you hold your mouse over the thumbnail/zoomed out view, you can use the up and down arrow keys to expand the node according to the hierarchical clustering tree. Based on the FDR cutoff (shown in the annotation), do you believe this gene was differentially expressed in response to heat shock? The FDR for this gene is >5% - with nothing else to go on, we probably would not call this gene significant. Now look at the cluster to which SHM2 belongs. Does this information change your opinion – why or why not? When you consider other genes in the cluster you should see a) many genes with similar expression patterns as SHM2 and b) many of those genes are functionally related to SHM2, suggesting that the whole biological pathway is likely similarly (and significantly) affected). 5. Functional prediction based on clustering. Based on the clustered datasets, you will assign a hypothetical function to the gene 3 YIL165C. Return to the Heat+EtOH_RNAseq_FDR0.05 dataset and cluster all the data (*without filtering based on fold-cutoff). Search for YIL165C and expand the cluster to see what genes it clusters with: Move your mouse to the far-left dendrogram panel, then use the up arrow key to expand the cluster to a reasonable cutoff. Scan the gene annotations and look for common features in the gene set, if you can. Next, use GO tools webpage at Princeton to look for functional enrichment: 1. Retrieve the gene names in your selected cluster by going to ‘Export’ and ‘Save List’ in Treeview. Copy the unique identifiers to paste into the website below. 2. Go to http://go.princeton.edu/ 3. Select ‘Generic GO Term Finder’ 4. Enter gene list into the appropriate field, enter email address 5. Try the functional enrichment based on all 3 GO classifications. 6. Go back and repeat the process based on the clustered NEW_HS+EtOH+H2O2+aa+starvation.txt dataset. What cellular process would you predict this gene functions in and why? Which clustering was more effective in assigning the hypothetical function? First analysis: most people got different answers, some related to membrane protein, heat response, and other functions, and many people did not find a significant enrichment. The differences likely reflect the fact that it’s harder to define the cluster (and different people pulled different gene sets depending on how they ‘defined’ the cluster). For the second analysis on the more detailed stress dataset, most people found the gene belonging to a cluster linked to amino acid metabolism. Most people find it easier to identify the boundaries of the cluster in this analysis, and the pattern is obvious by eye as well as by the hierarchical tree. 4