STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

advertisement
Yang Li
Lin Liu
Feb 3 & Feb 4, 2014
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
1
Microarray Analysis
In simple words, this paper’s
objective is to compare the gene
expression profiles of drosophila
females in different tissues - with
and without a mutation on the p24
gene logjam (loj).
sample name
GSM277408
GSM277409
GSM277410
GSM277411
GSM277412
GSM277413
GSM277414
GSM277415
GSM277416
GSM277417
GSM277418
GSM277419
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
tissue
abd
abd
abd
ht
ht
ht
abd
abd
abd
ht
ht
ht
experiment
control
control
control
control
control
control
treatment
treatment
treatment
treatment
treatment
treatment
Exercise 3 – Microarray Analysis
(a) Read the raw microarray data and visualize it.
(b) Normalize the microarray data using RMA
(c) Normalize the microarray data using GCRMA
(d) Find DE genes
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Microarray data analysis
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affymetrix data
• Each gene (or portion of a gene) is represented by 11 to
20 oligonucleotides of 25 base-pairs.
• Probe: an oligonucleotide of 25 base-pairs, i.e., a 25mer.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affymetrix data
•
•
•
•
•
Perfect match (PM): A 25-mer complementary to a reference sequence of
interest (e.g., part of a gene).
Mismatch (MM): same as PM but with a single homomeric base change for the
middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) .
– The purpose of the MM probe design is to measure non-specific binding
and background noise.
Probe-pair: a (PM,MM) pair.
Probe-pair set: a collection of probe-pairs (11 to 20) related to a common gene
or fraction of a gene.
Affy ID: an identifier for a probe-pair set.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Affymetrix Data Flow
CDF file
Hybridized
GeneChip
Scan
Chip
DAT file
Process
Image
High
Level
Analysis
CEL file
MAS4
MAS5
RMA
Quantile
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Microarray analysis
• Go to
and download the data set:
– GSE10940
• The R script has to be in the same file of the
.cel files
• The data set contains 12 .CEL files
• bsub -Is -n 1 -q interact R
– library(affy)
– data.affy=ReadAffy()
• What is the name of the CDF file?
• How many genes are considered on the arrays?
• What is the annotation version?
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Looking at RAW data
Low-level analysis
• MA plot
 MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4),
plot.method = "smoothScatter")
• Image of an array
 image(data.affy)
• Density of the log intensities of the arrays
 hist(data.affy)
• Boxplot of the data
 boxplot(data.affy, col=seq(2,7,by=1))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Normalization
 data.rma=rma(data.affy)
• Install the package affyPLM to view the MA
plot after normalization (along with
dependencies)
 MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4),
plot.method = "smoothScatter”)
 expr.rma=exprs(data.rma) # Puts data in a table
 boxplot(data.frame(expr.rma), col=seq(2,7,by=1))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Before moving forward…
affy probeset names
• rownames(expr.rma)[1:100]
• Suffixes are meaningful, for example:
• _at : hybridizes to unique antisense transcript for this
chip
• _s_at: all probes cross hybridize to a specified set of
sequences
• _a_at: all probes cross hybridize to a specified gene
family
• _x_at: at least some probes cross hybridize with other
target sequences for this chip
• _r_at: rules dropped
• and many more…
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Custom CDF files
• The most popular platform for genome-wide
expression profiling is the Affymetrix GeneChip.
• However, its selection of probes relied on earlier
genome and transcriptome annotation which is
significantly different from current knowledge.
• The resultant informatics problems have a
profound impact on analysis and interpretation the
data.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Custom CDF files
• One solution Dai, M. et. at (2005)
• They reorganized probes on more than a
dozen popular 30 GeneChips
• Comparing analysis results between the
original and the redefined probe sets
– Reveals ~ 30–50% discrepancy in the genes
previously identified as differentially expressed,
regardless of analysis method.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Custom CDF files
• Go to:
– http://brainarray.mbni.med.umich.edu/Brainarray/
Database/CustomCDF/13.0.0/refseq.asp
– Download the Drosophila melanogaster RefSeq
CDF annotation corresponding to the Affy array
analyzed
– Install/loaded it on R
• R CMD INSTALL…
 data.affy@cdfName="drosophila2dmrefseqcdf”
 data.rma.refseq=rma(data.affy)
 expr.rma.refseq=exprs(data.rma.refseq)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
High-level analysis
• Perform a comparison between the control
group and the experimental group
– Objective: Obtain the most significant genes
with an FDR of 5% and with a fold change of 1
– Information provided in “SamplePhenotype.csv”
to obtain controls and mutant ids
sample.ids=read.csv("doe.csv",header=F)
control=grep("Control",sample.ids[,2])
mutants=grep("Logjam",sample.ids[,2])
– Obtain just the RefSeq ids
genes_t=matrix(rownames(expr.rma.refseq))
genes.refseq=apply(genes_t,1,function(x)
sub("_at","",x))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
• Calculating the fold change for every gene
– foldchange=apply(expr.rma, 1, function(x) mean( x[mutants]
) - mean( x[control] ) )
• Perform a t-test and obtain the p-values
– T.p.value=apply(expr.rma, 1, function(x) t.test( x[mutants],
x[control], var.equal=T )$p.value )
• Calculating the FDR
– fdr=p.adjust(T.p.value, method="fdr")
• THE GENES
– genes.up=genes.refseq[ which( fdr < 0.05 & foldchange > 0
)]
– genes.down=genes.refseq [ which( fdr < 0.05 & foldchange
<0 ) ]
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Differential gene expression
• Apart from t-test, you can use limma and
samr package
• library(limma)
• design <- c(0,0,0,0,0,0,1,1,1,1,1,1)
• design.mat <- model.matrix(~design)
• fit <- lmFit(expr.rma.refseq, design.mat)
• fit <- eBayes(fit)
• topTable(fit, coef = ‘design’)
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
17
Check out samr on your own
• library(samr)
• …
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
18
Results
• Provide a .csv file with the list of significant
genes with an FDR of 5% and with a fold
change of 1
• Provide a heatmap with the significant genes
– genes.ids=c(which( fdr < 0.05 & foldchange > 0 ),which( fdr <
0.05 & foldchange <0 ))
– colnames(expr.rma.refseq)=c(rep("Control",6),rep("Mutant",6))
– heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10))
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
Time-series of gene
expression
• Two different algorithms with similar aim
will be used together to help get robust
result
• Here: COSOPT + Fisher’s G-test
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
20
Fisher’s G-Test
• Not really the likelihood ratio Fisher’s Gtest, but rather
“Fisher's exact test of Gaussian whitenoise in a time series”
• Fisher, R.A. (1929) "Tests of significance
in harmonic analysis" Proceedings of the
Royal Society of London. Series A,
Volume 125, Issue 796, pp. 54-59
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
21
Fisher’s G-Test Implementation
• library(GeneCycle)
• pval.fisher <- fisher.g.test(t(exp.RMA))
• fdr.pval.fisher <- fdrtool(pval.fisher,
statistic="pvalue")
• qval.fisher <- fdr.pval.fisher$qval
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
22
Instead of COSOPT, Use JTK
Cycle in homework
• Acknowledgement:
Professor John B. Hogenesch, Upenn
• http://bioinf.itmat.upenn.edu/hogeneschlab
/, which could be a good resource for
circadian clock expression data for your
future project
• JTK Cycle is much faster and more
accurate than COSOPT
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
23
How to implement JTK Cycle
• Formatting your data: in our case, ignore
this step, just use the output of
RMA/GCRMA
• Change periods information by yourself:
– periods <- 1:48
• Run JTK Cycle: This could take some
time, but don’t be paranoid. The running
time only increases linearly with sample
size and the time points
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
24
How to implement JTK Cycle
• We also provide the documents for JTK
Cycle software in
/n/stat115/hws/2/jtk_cycle/
• Very short document, use it as a black box
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology
25
Download