Yang Li Lin Liu Feb 3 & Feb 4, 2014 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 1 Microarray Analysis In simple words, this paper’s objective is to compare the gene expression profiles of drosophila females in different tissues - with and without a mutation on the p24 gene logjam (loj). sample name GSM277408 GSM277409 GSM277410 GSM277411 GSM277412 GSM277413 GSM277414 GSM277415 GSM277416 GSM277417 GSM277418 GSM277419 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology tissue abd abd abd ht ht ht abd abd abd ht ht ht experiment control control control control control control treatment treatment treatment treatment treatment treatment Exercise 3 – Microarray Analysis (a) Read the raw microarray data and visualize it. (b) Normalize the microarray data using RMA (c) Normalize the microarray data using GCRMA (d) Find DE genes STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Microarray data analysis STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix data • Each gene (or portion of a gene) is represented by 11 to 20 oligonucleotides of 25 base-pairs. • Probe: an oligonucleotide of 25 base-pairs, i.e., a 25mer. STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix data • • • • • Perfect match (PM): A 25-mer complementary to a reference sequence of interest (e.g., part of a gene). Mismatch (MM): same as PM but with a single homomeric base change for the middle (13th) base (transversion purine <-> pyrimidine, G <->C, A <->T) . – The purpose of the MM probe design is to measure non-specific binding and background noise. Probe-pair: a (PM,MM) pair. Probe-pair set: a collection of probe-pairs (11 to 20) related to a common gene or fraction of a gene. Affy ID: an identifier for a probe-pair set. STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix Data Flow CDF file Hybridized GeneChip Scan Chip DAT file Process Image High Level Analysis CEL file MAS4 MAS5 RMA Quantile STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Microarray analysis • Go to and download the data set: – GSE10940 • The R script has to be in the same file of the .cel files • The data set contains 12 .CEL files • bsub -Is -n 1 -q interact R – library(affy) – data.affy=ReadAffy() • What is the name of the CDF file? • How many genes are considered on the arrays? • What is the annotation version? STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Looking at RAW data Low-level analysis • MA plot MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter") • Image of an array image(data.affy) • Density of the log intensities of the arrays hist(data.affy) • Boxplot of the data boxplot(data.affy, col=seq(2,7,by=1)) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Normalization data.rma=rma(data.affy) • Install the package affyPLM to view the MA plot after normalization (along with dependencies) MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter”) expr.rma=exprs(data.rma) # Puts data in a table boxplot(data.frame(expr.rma), col=seq(2,7,by=1)) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Before moving forward… affy probeset names • rownames(expr.rma)[1:100] • Suffixes are meaningful, for example: • _at : hybridizes to unique antisense transcript for this chip • _s_at: all probes cross hybridize to a specified set of sequences • _a_at: all probes cross hybridize to a specified gene family • _x_at: at least some probes cross hybridize with other target sequences for this chip • _r_at: rules dropped • and many more… STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files • The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. • However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. • The resultant informatics problems have a profound impact on analysis and interpretation the data. STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files • One solution Dai, M. et. at (2005) • They reorganized probes on more than a dozen popular 30 GeneChips • Comparing analysis results between the original and the redefined probe sets – Reveals ~ 30–50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method. STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files • Go to: – http://brainarray.mbni.med.umich.edu/Brainarray/ Database/CustomCDF/13.0.0/refseq.asp – Download the Drosophila melanogaster RefSeq CDF annotation corresponding to the Affy array analyzed – Install/loaded it on R • R CMD INSTALL… data.affy@cdfName="drosophila2dmrefseqcdf” data.rma.refseq=rma(data.affy) expr.rma.refseq=exprs(data.rma.refseq) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology High-level analysis • Perform a comparison between the control group and the experimental group – Objective: Obtain the most significant genes with an FDR of 5% and with a fold change of 1 – Information provided in “SamplePhenotype.csv” to obtain controls and mutant ids sample.ids=read.csv("doe.csv",header=F) control=grep("Control",sample.ids[,2]) mutants=grep("Logjam",sample.ids[,2]) – Obtain just the RefSeq ids genes_t=matrix(rownames(expr.rma.refseq)) genes.refseq=apply(genes_t,1,function(x) sub("_at","",x)) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology • Calculating the fold change for every gene – foldchange=apply(expr.rma, 1, function(x) mean( x[mutants] ) - mean( x[control] ) ) • Perform a t-test and obtain the p-values – T.p.value=apply(expr.rma, 1, function(x) t.test( x[mutants], x[control], var.equal=T )$p.value ) • Calculating the FDR – fdr=p.adjust(T.p.value, method="fdr") • THE GENES – genes.up=genes.refseq[ which( fdr < 0.05 & foldchange > 0 )] – genes.down=genes.refseq [ which( fdr < 0.05 & foldchange <0 ) ] STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differential gene expression • Apart from t-test, you can use limma and samr package • library(limma) • design <- c(0,0,0,0,0,0,1,1,1,1,1,1) • design.mat <- model.matrix(~design) • fit <- lmFit(expr.rma.refseq, design.mat) • fit <- eBayes(fit) • topTable(fit, coef = ‘design’) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 17 Check out samr on your own • library(samr) • … STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 18 Results • Provide a .csv file with the list of significant genes with an FDR of 5% and with a fold change of 1 • Provide a heatmap with the significant genes – genes.ids=c(which( fdr < 0.05 & foldchange > 0 ),which( fdr < 0.05 & foldchange <0 )) – colnames(expr.rma.refseq)=c(rep("Control",6),rep("Mutant",6)) – heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10)) STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Time-series of gene expression • Two different algorithms with similar aim will be used together to help get robust result • Here: COSOPT + Fisher’s G-test STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 20 Fisher’s G-Test • Not really the likelihood ratio Fisher’s Gtest, but rather “Fisher's exact test of Gaussian whitenoise in a time series” • Fisher, R.A. (1929) "Tests of significance in harmonic analysis" Proceedings of the Royal Society of London. Series A, Volume 125, Issue 796, pp. 54-59 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 21 Fisher’s G-Test Implementation • library(GeneCycle) • pval.fisher <- fisher.g.test(t(exp.RMA)) • fdr.pval.fisher <- fdrtool(pval.fisher, statistic="pvalue") • qval.fisher <- fdr.pval.fisher$qval STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 22 Instead of COSOPT, Use JTK Cycle in homework • Acknowledgement: Professor John B. Hogenesch, Upenn • http://bioinf.itmat.upenn.edu/hogeneschlab /, which could be a good resource for circadian clock expression data for your future project • JTK Cycle is much faster and more accurate than COSOPT STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 23 How to implement JTK Cycle • Formatting your data: in our case, ignore this step, just use the output of RMA/GCRMA • Change periods information by yourself: – periods <- 1:48 • Run JTK Cycle: This could take some time, but don’t be paranoid. The running time only increases linearly with sample size and the time points STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 24 How to implement JTK Cycle • We also provide the documents for JTK Cycle software in /n/stat115/hws/2/jtk_cycle/ • Very short document, use it as a black box STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 25