Project 1. Gene expression data analysis and interpretation PROJECT OBJECTIVES 1. Gain basic skills of data preprocessing in Linux. 2. Learn to perform differential expression analysis using t-test in R. 3. Learn to perform clustering analysis in R. 4. Learn to perform pathway and network-level interpretation of the differential expression analysis results. 5. Answer the following biological questions: a. Which genes are differentially expressed between human colorectal adenomas and normal controls? Are these differentially expressed genes reproducible between studies? b. Is unsupervised clustering able to distinguish adenomas from normal controls? c. Which Gene Ontology biological processes, molecular functions, and cellular component, and KEGG pathways are enriched among the differentially expressed genes? Are these pathways consistent between studies? d. How the differentially expressed genes interact with each other? Are they enriched in certain regions of the human protein-protein interaction network? FINAL REPORT Final report should include a brief introduction section, a method section, and a results section. The introduction section should demonstrate a basic understanding of the biological questions to be pursued. The method section should demonstrate a good understand of the bioinformatics tools and methods used, and the method description should allow others to reproduce your results. The result section should provide answers to the abovementioned biological questions. Figures, tables, and references can be included as necessary but have to fit into the 5-page limit. You don’t have to use all 5 pages, but content beyond the page limit will be ignored. Evaluation will be based primarily on the completeness and soundness of the methods and results and quality of writing. DATA SETS Sabates-Bellver J, Van der Flier LG, de Palo M, Cattaneo E et al. Transcriptome profile of human colorectal adenomas. Mol Cancer Res 2007 Dec;5(12):1263-75. PMID: 18171984 o Data: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8671. The series matrix file at the bottom of the page includes normalized data matrix, and there is no missing value in the data matrix. Normalization was based on the MAS5 algorithm that uses a modified version of global normalization. The data remain skewed, and log transformation is necessary for downstream analysis. o Array: Affymetrix Human Genome U133 Plus 2.0 Array (probes for >47,000 transcripts), http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570 Samples: 32 colorectal adenomas vs 32 normal mucosa (Paired) Galamb O, Györffy B, Sipos F, Spisák S et al. Inflammation, adenoma and cancer: objective classification of colon biopsy specimens with gene expression signature. Dis Markers 2008;25(1):1-16. PMID: 18776587 o http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4183. The series matrix file at the bottom of the page includes normalized data matrix, and there is no missing value in the data matrix. Normalization was based on the MAS5 algorithm that uses a modified version of global normalization. The data remain skewed, and log transformation is necessary for downstream analysis. o Array: Affymetrix Human Genome U133 Plus 2.0 Array (probes for >47,000 transcripts), http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570 o Samples: there are 53 samples in this study, but we will only use data from the 8 colon_normal samples and the 15 colon_adenoma samples (note the normal samples were collected from healthy individuals, not paired with the adenoma samples.) o PROCEDURE I. Data downloading, preprocessing, and differential expression analysis for GSE8671 1. What do these commands do? $ cd ~ $ mkdir project1 $ cd project1 2. Download the dataset from GEO using the command wget $ wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE8nnn/GSE8671/matrix/GSE8671_series_matrix.txt.gz $ ls 3. Unzip the file using the command gunzip $ gunzip GSE8671_series_matrix.txt.gz $ ls 4. What does this command do? Why this step is included in data preprocessing? $ tail -n +71 GSE8671_series_matrix.txt | head -n -1 >GSE8671_exp.txt $ ls 5. The file GSE8671_exp.txt will be used as an input file for the read.table function in R (see below). When an input file contains the names of the columns in its first line, the first line is called header. The read.table function requires the header of an input file to have one fewer field than the number of columns in the file. To meet this criterion, use nano or other text editors to remove the first field in the first row of the file GSE8671_exp.txt (both the text “ID_REF” and the space right after the text), save the change, and double check whether it is done correctly. You may use the head command to check. The first field of the first row should be “GSM215051” before you can start the R analysis. $ head GSE8671_exp.txt 6. Use R to perform differential expression analysis. R code for data analysis is provided below. Complete the “Task” column to help you better understand the R code. You may use the question mark in R to get information of the functions you are not familiar with (e.g. ?t.test). The lines in grey are not required to generate the results, but they can help you understand the data objects and the functions. Task R code data<-read.table("GSE8671_exp.txt",head=TRUE,sep="\t") dim(data) class(data) data0<-as.matrix(data) class(data0) data1<-log2(data0) data0[1:3,1:3] data1[1:3,1:3] summary(data0) summary(data1) data1["1438_at",] gene1_normal<-data1[1,1:32] gene1_adenoma<-data1[1,33:64] mean(gene1_normal) mean(gene1_adenoma) gene1_log_ratio<-mean(gene1_adenoma)-mean(gene1_normal) gene1_log_ratio ?t.test t.test(x=gene1_adenoma,y=gene1_normal) t.test(x=gene1_adenoma,y=gene1_normal,paired=TRUE) t.test(x=gene1_adenoma,y=gene1_normal,paired=TRUE)$statistic t.test(x=gene1_adenoma,y=gene1_normal,paired=TRUE)$p.value log_ratios <-apply(data1, 1, function(gene) {mean(gene[33:64])mean(gene[1:32])}) t_statistics <- apply(data1, 1, function(gene) {t.test(x = gene[33:64], y = gene[1:32], paired=TRUE)$statistic}) p_values <- apply(data1, 1, function(gene) {t.test(x = gene[33:64], y = gene[1:32], paired=TRUE)$p.value}) nrow(data1) adjp_values<-ifelse(p_values*nrow(data1)>1,1,p_values*nrow(data1)) p_values[1:10] adjp_values[1:10] sum(adjp_values<0.01) sum(adjp_values<0.01 & log_ratios>1) sigup_list<-names(adjp_values)[adjp_values<0.01 & log_ratios>1] sigdown_list<-names(adjp_values)[adjp_values<0.01 & log_ratios<(-1)] results<-cbind(data1,log_ratios,t_statistics,p_values,adjp_values) write.table(results,"GSE8671_sigtest_results.txt",quote=F,sep="\t") write.table(sigup_list,"GSE8671_sigup.txt",quote=F,col.names=F,row.names=F) write.table(sigdown_list,"GSE8671_sigdown.txt",quote=F,col.names=F,row.names =F) q() II. Data downloading, preprocessing, and differential expression analysis for GSE4183 Follow the procedure described above with necessary modifications Only use the 8 colon_normal samples and the 15 colon_adenoma samples. You can use either the cut command in Linux or the matrix subsetting method in R to get the subset of data. Compare the lists of differentially expressed genes (actually probe sets) between these two studies. III. Clustering analysis for GSE8671 R code for clustering analysis is provided below. Complete the “Task” column to help you better understand the R code. Based on the clustering analysis result, answer the question: is unsupervised clustering of gene expression data able to distinguish adenomas from normal controls? Note: 1. The analysis will require two R packages that are not included in the R installation, “heatmap.plus” and “gplots”. These two packages are both included in the CRAN package repository and will need to be installed before data analysis. You only need to do this once. 2. There are more than 50,000 probe sets representing more than 47,000 transcripts in the data set. Transcripts that are not expressed or have low variation across the samples are poor candidates for distinguishing the samples. Therefore, two filtering steps are included in this code to filter out probe sets representing such transcripts. First, mean expression levels of all probe sets were calculated individually and the bottom 50% of the probe sets with the lowest mean expression values are removed to get rid of transcripts that are lowly expressed or not expressed in these samples. Second, variances of the remaining probe sets are calculated individually and only the top 1% with the largest variance are selected for the clustering analysis. 3. Clustering analysis does not use sample annotations. However, sample annotations can be color coded in the heat map for visualization purpose. Task R code install.packages(“heatmap.plus”) install.packages(“gplots”) data<-read.table("GSE8671_exp.txt",head=TRUE,sep="\t") data0<-as.matrix(data) data1<-log2(data0) means<-rowMeans(data1) mean.cutoff<-quantile(means,0.5) data2<-data1[means>mean.cutoff,] vars<-apply(data2,1,var) var.cutoff<-quantile(vars,0.99) data3<-data2[vars>var.cutoff,] dim(data1) dim(data2) dim(data3) hc<-hclust(as.dist(1-cor(data3)),"average") rhc<-hclust(as.dist(1-cor(t(data3))),"average") hc rhc ann<-matrix(c(rep("black",32), rep("red",32)), nrow = 64, ncol =2) ann library("gplots") library("heatmap.plus") pdf("gse8671_clustering.pdf", width=10, height=15) heatmap.plus(data3,Rowv=as.dendrogram(rhc),Colv=as.dendrogram(hc),ColSide Colors=ann,cexRow=0.5,cexCol=0.5,col=greenred(256)) dev.off() q() IV. Clustering analysis for GSE4183 Follow the procedure described above with necessary modifications Only use the 8 colon_normal samples and the 15 colon_adenoma samples. Based on the analysis of this data set, is unsupervised clustering of gene expression data able to distinguish adenomas from normal controls? V. Pathway and Gene Ontology (GO) analysis Perform Pathway and GO analysis for GSE8671. You are encouraged to perform Pathway and GO analysis for GSE4183, but this is NOT required for the final report. 1. WebGestalt analysis A. WebGestalt is available at http://www.webgestalt.org B. WebGestalt manual is available from the website and can also be found in the reading folder in Google Drive. C. Significant up and down probe set lists can be used as input to WebGestalt analysis, respectively. D. Both GSE8671 and GSE4183 were generated using the Affymetrix Human Genome U133 Plus 2.0 Array (hsapiens_affy_hg_u133_plus_2) E. Follow the manual to perform the following analysis for each probe set list a. GO analysis (including biological process, molecular function, and cellular component) b. KEGG analysis c. Transcription Factor analysis (optional) d. Protein interaction network module analysis (optional) F. Report significant GO Biological processes, GO molecular functions, GO cellular components, and KEGG pathways (FDR, i.e. adjusted p value < 0.05). If there are more than 10 significant identifications for each of the four categories, you may consider only report the top 10. 2. GSEA analysis A. Download the GSEA software from http://www.broadinstitute.org/gsea/index.jsp. It is the easiest to launch the javaGSEA Desktop Application. B. Online tutorial of GSEA is available at http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp. GSEA User Guide is available as a pdf file in the reading folder in Google Drive C. Prepare gene expression data in the .gct format and phenotype data in the .cls format a. For GSE8671, if you followed the code provided, log-transformed expression data can be found in the GSE8671_sigtest_results.txt file. Note the head row has to be shifted right because column name for the first column is not included in the file. You also have to remove the last four columns of statistical results. b. Information about the GSEA input file formats can be found at http://www.broadinstitute.org/cancer/software/gsea/wiki/index.ph p/Data_formats c. You may use excel or other text editor to prepare the expression data in the .gct format d. Phenotype information for the samples (Normal or Adenoma) can be found in the GEO pages of the two studies (see the DATA SETS section). You may use excel or another text editor to prepare the .cls files. D. Follow the tutorial to perform GSEA analysis based on the following gene set databases available in GSEA: a. c5.bp.v5.0.symbols.gmt [Gene ontology] (GO biological process) b. c5.cc.v5.0.symbols.gmt [Gene ontology] (GO cellular component) c. c5.mf.v5.0.symbols.gmt [Gene ontology] (GO molecular function) d. c2.cp.kegg.v.0.symbols.gmt [Curated] (KEGG pathway) E. Report significant GO Biological processes, GO molecular functions, GO cellular components, and KEGG pathways (FDR q-value < 0.05). If there are more than 10 significant identifications for each of the four categories, you may consider only report the top 10. F. Compare results from WebGestalt and GSEA. Computationally, what is the fundamental difference between the two tools? Do you get comparable results from these tools?