Project1

advertisement
Project 1. Gene expression data analysis and interpretation
PROJECT OBJECTIVES
1. Gain basic skills of data preprocessing in Linux.
2. Learn to perform differential expression analysis using t-test in R.
3. Learn to perform clustering analysis in R.
4. Learn to perform pathway and network-level interpretation of the differential
expression analysis results.
5. Answer the following biological questions:
a. Which genes are differentially expressed between human colorectal
adenomas and normal controls? Are these differentially expressed genes
reproducible between studies?
b. Is unsupervised clustering able to distinguish adenomas from normal
controls?
c. Which Gene Ontology biological processes, molecular functions, and
cellular component, and KEGG pathways are enriched among the
differentially expressed genes? Are these pathways consistent between
studies?
d. How the differentially expressed genes interact with each other? Are they
enriched in certain regions of the human protein-protein interaction
network?
FINAL REPORT
Final report should include a brief introduction section, a method section, and a
results section. The introduction section should demonstrate a basic understanding of the
biological questions to be pursued. The method section should demonstrate a good
understand of the bioinformatics tools and methods used, and the method description
should allow others to reproduce your results. The result section should provide answers
to the abovementioned biological questions. Figures, tables, and references can be
included as necessary but have to fit into the 5-page limit. You don’t have to use all 5
pages, but content beyond the page limit will be ignored. Evaluation will be based
primarily on the completeness and soundness of the methods and results and quality of
writing.
DATA SETS
 Sabates-Bellver J, Van der Flier LG, de Palo M, Cattaneo E et al. Transcriptome
profile of human colorectal adenomas. Mol Cancer Res 2007 Dec;5(12):1263-75.
PMID: 18171984
o Data: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8671. The
series matrix file at the bottom of the page includes normalized data matrix,
and there is no missing value in the data matrix. Normalization was based on
the MAS5 algorithm that uses a modified version of global normalization. The
data remain skewed, and log transformation is necessary for downstream
analysis.
o Array: Affymetrix Human Genome U133 Plus 2.0 Array (probes for >47,000
transcripts), http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
Samples: 32 colorectal adenomas vs 32 normal mucosa (Paired)
Galamb O, Györffy B, Sipos F, Spisák S et al. Inflammation, adenoma and cancer:
objective classification of colon biopsy specimens with gene expression
signature. Dis Markers 2008;25(1):1-16. PMID: 18776587
o http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4183. The series
matrix file at the bottom of the page includes normalized data matrix, and
there is no missing value in the data matrix. Normalization was based on the
MAS5 algorithm that uses a modified version of global normalization. The
data remain skewed, and log transformation is necessary for downstream
analysis.
o Array: Affymetrix Human Genome U133 Plus 2.0 Array (probes for >47,000
transcripts), http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
o Samples: there are 53 samples in this study, but we will only use data from the
8 colon_normal samples and the 15 colon_adenoma samples (note the normal
samples were collected from healthy individuals, not paired with the adenoma
samples.)
o

PROCEDURE
I. Data downloading, preprocessing, and differential expression analysis for
GSE8671
1. What do these commands do?
$ cd ~
$ mkdir project1
$ cd project1
2. Download the dataset from GEO using the command wget
$ wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE8nnn/GSE8671/matrix/GSE8671_series_matrix.txt.gz
$ ls
3. Unzip the file using the command gunzip
$ gunzip GSE8671_series_matrix.txt.gz
$ ls
4. What does this command do? Why this step is included in data preprocessing?
$ tail -n +71 GSE8671_series_matrix.txt | head -n -1 >GSE8671_exp.txt
$ ls
5. The file GSE8671_exp.txt will be used as an input file for the read.table function in R
(see below). When an input file contains the names of the columns in its first line, the
first line is called header. The read.table function requires the header of an input file to
have one fewer field than the number of columns in the file. To meet this criterion, use
nano or other text editors to remove the first field in the first row of the file
GSE8671_exp.txt (both the text “ID_REF” and the space right after the text), save the
change, and double check whether it is done correctly. You may use the head command
to check. The first field of the first row should be “GSM215051” before you can start the
R analysis.
$ head GSE8671_exp.txt
6. Use R to perform differential expression analysis.
R code for data analysis is provided below. Complete the “Task” column to help you
better understand the R code. You may use the question mark in R to get information of
the functions you are not familiar with (e.g. ?t.test). The lines in grey are not required to
generate the results, but they can help you understand the data objects and the functions.
Task
R code
data<-read.table("GSE8671_exp.txt",head=TRUE,sep="\t")
dim(data)
class(data)
data0<-as.matrix(data)
class(data0)
data1<-log2(data0)
data0[1:3,1:3]
data1[1:3,1:3]
summary(data0)
summary(data1)
data1["1438_at",]
gene1_normal<-data1[1,1:32]
gene1_adenoma<-data1[1,33:64]
mean(gene1_normal)
mean(gene1_adenoma)
gene1_log_ratio<-mean(gene1_adenoma)-mean(gene1_normal)
gene1_log_ratio
?t.test
t.test(x=gene1_adenoma,y=gene1_normal)
t.test(x=gene1_adenoma,y=gene1_normal,paired=TRUE)
t.test(x=gene1_adenoma,y=gene1_normal,paired=TRUE)$statistic
t.test(x=gene1_adenoma,y=gene1_normal,paired=TRUE)$p.value
log_ratios <-apply(data1, 1, function(gene) {mean(gene[33:64])mean(gene[1:32])})
t_statistics <- apply(data1, 1, function(gene) {t.test(x = gene[33:64], y =
gene[1:32], paired=TRUE)$statistic})
p_values <- apply(data1, 1, function(gene) {t.test(x = gene[33:64], y = gene[1:32],
paired=TRUE)$p.value})
nrow(data1)
adjp_values<-ifelse(p_values*nrow(data1)>1,1,p_values*nrow(data1))
p_values[1:10]
adjp_values[1:10]
sum(adjp_values<0.01)
sum(adjp_values<0.01 & log_ratios>1)
sigup_list<-names(adjp_values)[adjp_values<0.01 & log_ratios>1]
sigdown_list<-names(adjp_values)[adjp_values<0.01 & log_ratios<(-1)]
results<-cbind(data1,log_ratios,t_statistics,p_values,adjp_values)
write.table(results,"GSE8671_sigtest_results.txt",quote=F,sep="\t")
write.table(sigup_list,"GSE8671_sigup.txt",quote=F,col.names=F,row.names=F)
write.table(sigdown_list,"GSE8671_sigdown.txt",quote=F,col.names=F,row.names
=F)
q()
II. Data downloading, preprocessing, and differential expression analysis for
GSE4183
 Follow the procedure described above with necessary modifications
 Only use the 8 colon_normal samples and the 15 colon_adenoma samples. You
can use either the cut command in Linux or the matrix subsetting method in R to
get the subset of data.
 Compare the lists of differentially expressed genes (actually probe sets) between
these two studies.
III. Clustering analysis for GSE8671
R code for clustering analysis is provided below. Complete the “Task” column to help
you better understand the R code. Based on the clustering analysis result, answer the
question: is unsupervised clustering of gene expression data able to distinguish
adenomas from normal controls?
Note:
1. The analysis will require two R packages that are not included in the R installation,
“heatmap.plus” and “gplots”. These two packages are both included in the CRAN
package repository and will need to be installed before data analysis. You only need to do
this once.
2. There are more than 50,000 probe sets representing more than 47,000 transcripts in the
data set. Transcripts that are not expressed or have low variation across the samples are
poor candidates for distinguishing the samples. Therefore, two filtering steps are included
in this code to filter out probe sets representing such transcripts. First, mean expression
levels of all probe sets were calculated individually and the bottom 50% of the probe sets
with the lowest mean expression values are removed to get rid of transcripts that are
lowly expressed or not expressed in these samples. Second, variances of the remaining
probe sets are calculated individually and only the top 1% with the largest variance are
selected for the clustering analysis.
3. Clustering analysis does not use sample annotations. However, sample annotations can
be color coded in the heat map for visualization purpose.
Task
R code
install.packages(“heatmap.plus”)
install.packages(“gplots”)
data<-read.table("GSE8671_exp.txt",head=TRUE,sep="\t")
data0<-as.matrix(data)
data1<-log2(data0)
means<-rowMeans(data1)
mean.cutoff<-quantile(means,0.5)
data2<-data1[means>mean.cutoff,]
vars<-apply(data2,1,var)
var.cutoff<-quantile(vars,0.99)
data3<-data2[vars>var.cutoff,]
dim(data1)
dim(data2)
dim(data3)
hc<-hclust(as.dist(1-cor(data3)),"average")
rhc<-hclust(as.dist(1-cor(t(data3))),"average")
hc
rhc
ann<-matrix(c(rep("black",32), rep("red",32)), nrow = 64, ncol =2)
ann
library("gplots")
library("heatmap.plus")
pdf("gse8671_clustering.pdf", width=10, height=15)
heatmap.plus(data3,Rowv=as.dendrogram(rhc),Colv=as.dendrogram(hc),ColSide
Colors=ann,cexRow=0.5,cexCol=0.5,col=greenred(256))
dev.off()
q()
IV. Clustering analysis for GSE4183
 Follow the procedure described above with necessary modifications
 Only use the 8 colon_normal samples and the 15 colon_adenoma samples.
 Based on the analysis of this data set, is unsupervised clustering of gene
expression data able to distinguish adenomas from normal controls?
V. Pathway and Gene Ontology (GO) analysis
Perform Pathway and GO analysis for GSE8671. You are encouraged to perform
Pathway and GO analysis for GSE4183, but this is NOT required for the final report.
1. WebGestalt analysis
A. WebGestalt is available at http://www.webgestalt.org
B. WebGestalt manual is available from the website and can also be found in the
reading folder in Google Drive.
C. Significant up and down probe set lists can be used as input to WebGestalt
analysis, respectively.
D. Both GSE8671 and GSE4183 were generated using the Affymetrix Human
Genome U133 Plus 2.0 Array (hsapiens_affy_hg_u133_plus_2)
E. Follow the manual to perform the following analysis for each probe set list
a. GO analysis (including biological process, molecular function, and
cellular component)
b. KEGG analysis
c. Transcription Factor analysis (optional)
d. Protein interaction network module analysis (optional)
F. Report significant GO Biological processes, GO molecular functions, GO
cellular components, and KEGG pathways (FDR, i.e. adjusted p value < 0.05).
If there are more than 10 significant identifications for each of the four
categories, you may consider only report the top 10.
2. GSEA analysis
A. Download the GSEA software from
http://www.broadinstitute.org/gsea/index.jsp. It is the easiest to launch the
javaGSEA Desktop Application.
B. Online tutorial of GSEA is available at
http://www.broadinstitute.org/gsea/doc/desktop_tutorial.jsp. GSEA User
Guide is available as a pdf file in the reading folder in Google Drive
C. Prepare gene expression data in the .gct format and phenotype data in the
.cls format
a. For GSE8671, if you followed the code provided, log-transformed
expression data can be found in the GSE8671_sigtest_results.txt file.
Note the head row has to be shifted right because column name for
the first column is not included in the file. You also have to remove the
last four columns of statistical results.
b. Information about the GSEA input file formats can be found at
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.ph
p/Data_formats
c. You may use excel or other text editor to prepare the expression data
in the .gct format
d. Phenotype information for the samples (Normal or Adenoma) can be
found in the GEO pages of the two studies (see the DATA SETS
section). You may use excel or another text editor to prepare the .cls
files.
D. Follow the tutorial to perform GSEA analysis based on the following gene set
databases available in GSEA:
a. c5.bp.v5.0.symbols.gmt [Gene ontology] (GO biological process)
b. c5.cc.v5.0.symbols.gmt [Gene ontology] (GO cellular component)
c. c5.mf.v5.0.symbols.gmt [Gene ontology] (GO molecular function)
d. c2.cp.kegg.v.0.symbols.gmt [Curated] (KEGG pathway)
E. Report significant GO Biological processes, GO molecular functions, GO
cellular components, and KEGG pathways (FDR q-value < 0.05). If there are
more than 10 significant identifications for each of the four categories, you
may consider only report the top 10.
F. Compare results from WebGestalt and GSEA. Computationally, what is the
fundamental difference between the two tools? Do you get comparable
results from these tools?
Download