Implement in R - Mechanism-Anchored Biomodule Derived from Gene expression profiling Xinan Yang1, Yves A Lussier1,2,3 1 Center for Biomedical Informatics and Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL 60637 USA 2 The University of Chicago Cancer Research Center, and The Ludwig Center for Metastasis Research, The University of Chicago, Chicago, IL 60637 USA 3 The Institute for Genomics and Systems Biology, and the Computational Institute, The University of Chicago, Chicago, IL 60637 USA May 14th, 2009 Abstract This is the vignette of the computational implement of the algorithm PGnet. We introduce the Bioconductor packages and their implement in R, and further generate several new functions: VEO to run cross-validation using PGnet algorithm, robustBiomodule to identify robust mechanism-anchored biomodule, and Cytoscape to generate the input tables for visualization the network. Contents Introduction .............................................................................................................................................. 2 Dependent Packages................................................................................................................................. 2 Expression Data......................................................................................................................................... 2 Seed Genes................................................................................................................................................ 3 Vector of Differentially Expressed genes in One Phenotype .................................................................... 4 Vector of Genes Co-expressed with the Seed Genes ............................................................................... 6 Vectorial Enrichment Optimization (VEO) & GEMs .................................................................................. 8 Network Visualization ............................................................................................................................. 12 Cross-validation ...................................................................................................................................... 13 Robust Seed Genes and GEMs ................................................................................................................ 17 Reference ................................................................................................................................................ 21 1 Introduction PGnet is a simple algorithm to identify the association between mechanism-anchored seed genes and clinical phenotypes of interest, together with the significant enriched genes contributing to the association. The algorithm is based on expression profiling and previous knowledge. We demonstrate the implement of PGnet on known DNA methylation anchored genes and leukemia outcome in this vignette, however, PGnet can be applied to other molecular mechanisms such as transcriptional networks, microRNA-regulation or Gene Ontology classes, and to other diseases. Dependent Packages The implement takes use of functions in the following Bioconductor packages: > library(stats) > library(Biobase) > library(hgu133a.db) >library(multtest) > library(twilight) > library(OrderedList) > library(e1071) > library(pamr) Expression Data For illustration, we analysis a subset of leukemia expression data that was published by Ross et al.[1]. This data contains 132 ALL arrays normalized with the variance stabilization and calibration normalization (vsn) method[2]. An additional inter quartile range (IQR) filter[3] was applied to eliminate the genes without sufficient variation across the samples in the dataset(Suppl. Methods), resulting in 7,256 genes per probe-set. > load(“vignette/eset.rdata”) > eset ExpressionSet (storageMode: lockedEnvironment) assayData: 7256 features, 132 samples element names: exprs phenoData 2 sampleNames: JD-ALD318-v5-U133A, JD-ALD066-v5-U133A, ..., JD-ALD054-v5-U133A (132 total) varLabels and varMetadata description: Sample_name: sample ID Type: subtype Flow_up: outcome Novel_group: novel group featureData featureNames: 1405_i_at, 200000_s_at, ..., AFFX-HUMGAPDH/M33197_5_at (7256 total) fvarLabels and fvarMetadata description: none experimentData: use 'experimentData(object)' Annotation: Ross et al. Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood l2003;102: 2951-9. > expr.mat = exprs(eset) > phenotype = pData(eset) Seed Genes As an example, we here use the symbols of genes that code a DNA Methyltransferase and Methyl-binding proteins, and then find out the corresponding probe-sets in the array data. Generally, the seed genes are one known mechanism-anchored genes, where any regulatory mechanisms including DNA methylation, histine modification, transcriptional regulation, or microRNA regulation can be applied. > SeedSymbol = c("DNMT","MBD","MECP2","ZBTB33") > > xxU133a = as.list(hgu133aSYMBOL) > source("vignette/probegrep.r") > probes = NULL > > for ( i in 1:length(SeedSymbol)) { + probes = c(probes, probegrep(SeedSymbol[i], xxU133a)) +} 3 > > probes = unlist(probes) > probes ## need to be manually verified 201697_s_at 220139_at 220668_s_at 218457_s_at 41160_at 214396_s_at "DNMT1" "DNMT3L" "DNMT3B" "DNMT3A" "MBD3" "MBD2" 208595_s_at 220195_at 202463_s_at 202485_s_at 209579_s_at 202484_s_at "MBD1" "MBD5" "MBD3" "MBD2" "MBD4" "MBD2" 214048_at 203353_s_at 214047_s_at 214397_at 209580_s_at 202616_s_at "MBD4" "MBD1" "MBD4" "MBD2" "MBD4" "MECP2" 202618_s_at 202617_s_at 214631_at "MECP2" "MECP2" "ZBTB33" > > ESP = names(probes)[which(names(probes) %in% featureNames(eset))] > length(ESP) [1] 8 Based on the Bioconductor hug133a_2.2.0 annotation for Affymetrix Hgu133a array, 21 probe-sets are annotated as DNA methylation anchored genes; among which, eight Epigenetic Seed Genes (ESG) in the expression data meet the IQR filter criteria. Vector of Differentially Expressed genes in One Phenotype As an illustration, we here focus on the comparison between two sample groups (n=87): relapse after treatment and continuous complete remission (CCR). > ALLtype = as.factor(phenotype[, "Flow_up"]) > table(ALLtype) ALLtype 2nd_AML 6 71 CCR Relapse subgroup 16 39 > outcomeSamples = which(ALLtype %in% c("Relapse", "CCR")) > clinical = phenotype [outcomeSamples, ] > expr.mat = expr.mat[, outcomeSamples] 4 > dim(expr.mat) #[1] 7256 77 [1] 7256 87 > ALLtype = as.factor(clinical[, "Flow_up"]) > yin = as.numeric(ALLtype) > names(yin) = colnames(expr.mat) Subsequently, we measured every gene’s differential expression between these two sub-groups of leukemia outcome. Below code uses the function twilight.pval in the package twilight[4], other functions and packages, such as the function sam in the package siggenes[5] or the function mt.teststat in the package multtest, etc., should also work in this step to measure the lever of differential expression. The inputs are: expr.mat: The normalized and filtered expression data, rows are genes and columns are samples; yin: A numerical vector of the length equal to the number of samples containing class labels to test case vs. control, in the way that the higher label denotes the case, while the lower label the control samples; method: The method to calculate the differential expression. The outputs are: score: A list of differential expression measurements for every gene pval: A list of p-value of differential expression for every gene which is an opt output when ‘method’ is ‘fc’ , ‘z’ or ‘t.twilight’ using package ‘twilight’. qval: A list of q-value of differential expression for every gene which is an opt output when ‘method’ is ‘fc’ , ‘z’ or ‘t.twilight’ using package ‘twilight’. > B = 1000 > S = twilight.pval(expr.mat, yin, method="t", B=B) No complete enumeration. Prepare permutation matrix. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 40 seconds. Compute q-values. Compute values for confidence lines. 5 > index = S$result$index > score = S$result$observed[order(index)] > pval = S$result$pval[order(index)] > qval = S$result$qval[order(index)] > names(score) = names(index) = names(pval) = names(pval) = rownames(expr.mat) Vector of Genes Co-expressed with the Seed Genes On the other hand, we measure the vectors of genes’ co-expression with the eight seed genes, using Pearson correlation test on these expression profiles of these 87 samples. The following codes calculate every gene’s co-expressed with eight ESGs respectively. As a result, we got eight vectors of Pearson correlation coefficients. The inputs are: Expr.mat: Normalized and filtered expression data, rows are genes and columns are samples; ESG: mechanism-anchored seed genes; Method: The method to calculate the correlation coefficient. The output is: Corrx: A matrix of co-expression measurement for every gene, rows are corresponding to genes and columns are corresponding to seed genes. Corrp: A matrix of the correspond p-value of the correlation; Corrq: A matrix of the corresponding q-value of the correlation. > corrx = corrp = corrq =matrix(1,nrow=nrow(expr.mat), ncol=length(ESP)) > for ( j in 1:length(ESP)) { + toCompare = which(rownames(expr.mat) %in% ESP [j]) + yin = expr.mat[toCompare,] +# corrx[,j] = cor(t(expr.mat), expr.mat[toCompare,], method = "pearson" ) + S = twilight.pval(expr.mat, yin, method="pearson") + index = S$result$index + corrx[,j] = S$result$observed[order(index)] + corrp[,j] = S$result$pval[order(index)] + corrq[,j] <- S$result$qval[order(index)] + } 6 Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 10 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 20 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 40 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 30 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 20 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 30 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 30 seconds. Compute q-values. Compute values for confidence lines. Compute vector of observed statistics. Compute expected scores and p-values. This will take approx. 20 seconds. Compute q-values. Compute values for confidence lines. > > rownames(corrx) = rownames(corrp) = rownames(corrq) = rownames(expr.mat) 7 > colnames(corrx) = colnames(corrp) = colnames(corrq) = ESP > > dim(corrx) [1] 7256 8 Vectorial Enrichment Optimization (VEO) & GEMs GEMs are the Genes that not only co-Expressed with Mechanism anchored seed genes but also differentially expressed in phenotype of interested. This part takes use of the algorithm that we previously proposed, by performing the function compareLists in the package OrderedList[6] to estimate the significance of similarity of two vectors of gene ranks, and by using the function getOverlap to identify the optimal enriched genes from two corresponding significant enriched ranks. In this example, we do not know how the DNA methylation anchored genes regulate leukemia outcomes, so we consider the comparison of both ends of two vectors: straight comparison (one ordered vector with another one) and reversed comparison (one order vector with the flipped of another one). The inputs are: corrx: The matrix of correlation expression, rows are genes and columns are seed genes; score: The vector of differential expression for every genes in the matrix ‘corrx’; T.sig: The threshold of significance, default value is 0.05. The outputs are: n.pos: The optimal length of ranks to be observed in each straightly compared pair of vectors. n.rev: The optimal length of ranks to be observed in each reversed compared pair of vectors, i.e. one vector with the flipped second vector. SIMx: A matrix, with columns corresponding to the significant similar vectors and rows corresponding to the enriched genes. revSIMx: A matrix, with columns corresponding to the significant revised similar vectors and rows corresponding to the enriched genes. > n.LP = ncol(corrx) > SIMx = matrix(0, nrow = nrow(expr.mat), ncol = n.LP) > rownames(SIMx) = rownames(expr.mat) > revSIMx = SIMx > c=0 > lab = NULL 8 > n.pos = n.rev = array(dim = n.LP) > p.pos = p.rev = array(dim = n.LP) > LP.lab = paste(levels(ALLtype)[1], levels(ALLtype)[2], sep = "_") > > for (i in 1:n.LP) { + lab = c(lab, paste(colnames(corrx)[i], LP.lab, sep = "~") ) + x = compareLists(order(corrx[,i]), order(score), alphas = NULL) + res = getOverlap(x) + p.pos[i] = min(x$pvalues) + p.rev[i] = min(x$revPvalues) + n.pos[i] = x$nn[which(x$pvalues == p.pos[i])] + n.rev[i] = x$nn[which(x$revPvalues == p.rev[i])] + if(p.pos[i] < T.sig) SIMx[res$intersect, i] = 1 + if(p.rev[i] < T.sig) revSIMx[res$intersect, i] = 1 +} Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 9 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------> length(n.pos) = [1] 8 > names(p.pos) = names(p.rev) = names(n.pos) = names(n.rev) = lab > colnames(SIMx) = colnames(revSIMx) = lab > > SIMx = SIMx[which(apply(SIMx,1,sum)>0),] > revSIMx = revSIMx[which(apply(revSIMx,1,sum)>0),] > > SIMx = SIMx[,which(apply(SIMx,2,sum)>0)] > revSIMx = revSIMx[,which(apply(revSIMx,2,sum)>0)] > > dim(SIMx) [1] 0 0 > dim(revSIMx) [1] 1214 2 > table(yin, ALLtype) # the higher label denotes the case, while the lower label the control samples ALLtype yin CCR Relapse 1 71 0 2 0 16 Consequently, we got two seed genes associating with leukemia outcome dys-regulation. Both of the two vectors of differential expression was tested by case (Relapse) vs. control (CCR) , and the vectorial enrichment are reversed significant; therefore we can infer that the two identified seed genes were both down-regulated in leukemia relapse group. Additionally, we can see which seed gene is associated with leukemia relapse comparing with CCR, based on above results: 10 > xxU133a["201697_s_at"] $`201697_s_at` [1] "DNMT1" > xxU133a["218457_s_at"] $`218457_s_at` [1] "DNMT3A" We can also identify the genes co-expressed with the seed genes and differentially expressed in phenotype (GEMs). For example, in the resulting matrix, the probe-sets of enriched genes were labeled as 1, either wise 0, in the corresponding column, where the column is named in the form of “A_B”, A is the probe ID of the seed gene and B is two subgroups in the form of control~case. > revSIMx 201697_s_at~CCR_Relapse 218457_s_at~CCR_Relapse 200000_s_at 1 1 200007_at 1 0 200026_at 1 1 200043_at 1 1 200052_s_at 1 1 200061_s_at 0 1 200062_s_at 0 1 200072_s_at 1 1 …… Important, the output also reports the optimal maximal ranks to be counted for the observed significance: > n.rev[colnames(revSIMx)] 201697_s_at~CCR_Relapse 218457_s_at~CCR_Relapse 750 500 11 Network Visualization The visualization of the ESG-GEM-LP can be performed using the R package Rgraphviz or the software Cytoscape. To use R, please refer to the vignette of the package Graphvis. The below codes generate the tables of edges and nodes for the inputs of Cytoscape to visualize the resulting network. The inputs are: SIMx: The resulting matrix of PGnet of straight VEO or reversed CEO. score: The resulting matrix of measurement of differential expression for every examined gene in the dataset between two sub-groups; qval: The significance of differential expression for every examined gene in the dataset between two sub-groups, using q-value measurement; T.DE: A threshold of significance for differential expression. corrx: The resulting matrix of measurement of co-expression for every examined gene in the dataset and the seed genes; corrq: The significance of co-expression for every examined gene in the dataset and the seed genes, using q-value measurement; T.CE: A threshold of significance for co-expression. GeneSymbolList: A list of the symbol of every examined gene; Dir: The way of vectorial enrichment, “+” for straight significant and “-“ for reversed significant; ShowSig: A boolean value to indicate whether only show the significant (T.DE) differential expressed and significant (T.CE) co-expressed genes among the resulting significant (Sig,T) enriched ranks; Sig.T: A threshold of significance of vectorial enrichment. > rownames(qval) <- rownames(score) > colnames(score) = colnames(qval) = "CCR_Relapse" > res2 = Cytoscape(revSIMx, score,qval, T.DE=0.02, corrx, corrq, T.CE=0.05, xxU133a, "", ShowSig=FALSE, sig.T=0.05) > Edges <- res2$edges > Nodes <- res2$nodes > dup.Nodes <- which(duplicated(rownames(Nodes))) > if(length(dup.Nodes)>0) Nodes = Nodes[-dup.Nodes, ] > dim(Edges) [1] 1750 5 12 > dim(Nodes) [1] 875 4 > write.csv(Edges , file="sig_Edges_Cytoscape.csv") > write.csv(Nodes , file="sig_Nodes_Cytoscape.csv") Cross-validation A robust molecular signature is one that repeatedly appears by random sampling[7]. As a simple solution, we proposed that the robust ESGs refer to those GEMs that were among the top 5% frequencies in the one hundred iterations of the three-fold cross-validation. The inputs here are: expr.mat: The normalized and filtered expression data; ALLtype: A character of the phenotypes for every sample; yin: A numerical vector of the length equal to the number of samples containing class labels to test case vs. control, in the way that the higher label denotes the case, while the lower label the control samples; And the names of ‘yin’ should be the same as the sample names. seedProbe: A string of the candidates of the seed gene; max.rank: The maximum rank to be considered, either an unique value for all seed genes or distinct values for every seed gene; cross.repeat: The iterations of running cross-validation, the default value is 100; cross.out: The number of folds of samples to do cross-validation, the default value is 3; if using the number 1, a leave-one-out cross-validation will be performed; stratified: A Boolean value to assign whether a stratified strategy of randomly cutting sample be used for the cross validation, the default value is TRUE; min.weight: The minimal weight to be considered in the algorithm of OrderedList[6, 8]. CE.method: The distance measure to be used for correlation between genes and the seed gene. This must be one of "pearson", "kendall", "spearman". Nbin: The number of bins to calculate discrete probabilities when ‘CE.method’ is ‘MI’ , default is 10; DE.method: A character string specifying the method to measure the differential expresion. This must be one of "fc", "z", “t.twilight”, "t",’ "t.equalvar", "wilcoxon", "f", "pairt" or "blockf". pred.method: A character string specifying the prediction model to be applied on crossvalidations which must be one of "SVM" or "PAM". 13 The outputs are: CoCEPG: A list of matrixes which are the straightly significant results of ‘cross.repeat’ iterations with GEM in rows and seed genes in columns; RevCEPG: A list of matrixes which are the reversed significant results of ‘cross.repeat’ iterations with GEM in rows and seed genes in columns; model: A list of resulting model for the train set of each iteration; pred.res: A list of resulting prediction for the test set of each iteration; vote: A matrix of voting resulted from the cross-validation, rows are corresponding to iterations and columns are corresponding to sample names; TP: The true positives rate for every iteration of cross-validation when a result is truly predicted as case when a situation is case; FP: The false positive rate for every iteration when a result is erroneously predicted as case when a situation is control; TN: The false negative rate for every iteration when a result is truly predicted as control when a situation is control; FN: The false negative rate for every iteration when a result is erroneously predicted as control when a situation is case; phenotype.anno: A table reports the codes used for phenotypes of the inputted samples with sample sizes; vote.tb: A table of resulting votes which rows are the sample names and columns are the probes of seed gene; block: A matrix of Boolean values indicating whether a sample is randomly selected into the train set for cross-validation, where rows are corresponding to iterations and columns are corresponding to the samples. As an example, we run 4 iterations of cross-validation using following code, for reversed similarities because there is no significant result of straight similarity: > source("vignette/VEO.r") > CV = VEO(expr.mat, + yin, + seedProbe = ESP, + max.rank = n.rev, + cross.repeat = 4, 14 + cross.outer = 3, + stratify = TRUE, + CE.method = "pearson", + DE.method = "t", + pred.method = "PAM") 12121212iteration : 1 Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------iteration : 2 Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------iteration : 3 Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------iteration : 4 Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------Simulating random scores... 0%.......:.........:.........:.........:......100% -------------------------------------------------> 15 > names(CV) [1] "CoCEPG" "RevCEPG" "model" [5] "vote" "TP" "FP" [9] "FN" "phenotype.anno" "vote.tb" "pred.res" "TN" "block" > length(CV$CEPG) [1] 0 > length(CV$RevCEPG) [1] 4 The above results suggest no seed gene show straightly significant correlation with the phenotype of interest, but four seed genes show reversed significant correlation with the phenotype of interest. We can observe the recall-precision plot and the boxplot of the total accuracy: > recall = CV$TN/(CV$TN + CV$FP) > precision = CV$TN/(CV$TN + CV$FN) > plot(precision, recall, main = "Prediction of CCR") > accuracy = (CV$TP + CV$TN)/(CV$TP + CV$TN + CV$FP + CV$FN) > boxplot(accuracy, main = "Total prediction accuracy for Relapse") > > recall = CV$TP/(CV$TP + CV$FN) > precision = CV$TP/(CV$TP + CV$FP) > plot(precision, recall, main = "Prediction of Relapse") 16 Figure 1 The box-and-whisker plot for the total prediction accuracy (the lower quartile, median and upper quartile, and the largest observation, whiskers are the 95% and 5% intervals). Figure 2 The four iterations of cross-validation resulting precision-recall graph. Robust Seed Genes and GEMs A robust molecular signature is one that repeatedly appears by random sampling[7]. We can identify the robust biomodule from the cross-validation resulting seed genes and their corresponding GEMs as following code demonstrates. The inputs are: eset: An ExpressionSet object or matrix of the expression data to be studied; 17 phenotype: A string indicating the phenotypes of interest for samples if the ‘eset’ is an ExpressionSet object; otherwise, this parameter will be a string indicating the value of the phenotype of interest for every samples. CV: An object of the ‘VEO’ function result; T.robust: The threshold to call a signature as “significant” from the ‘vote’ matrix, default is 95% quantile of the highest frequency; array.Anno: A string pointing the special annotation for the corresponding probe-sets in ‘eset’. The outputs are: RSeed: A matrix with robust seed genes that associated with the phenotype of interest in rows and the sign for their ways of correlation (‘+’ for straight and ‘-’ for reversed) in columns; RGEM: A list of matrix with gene names that contributing to one distinct ‘RSeed’. The name of the list presents the associated seed gene and the way of correlation (1 for straight and 2 for reversed). seed.fre: A list of the calling frequencies for each identified ‘RSeed’; GEM.fre: A matrix of the calling frequencies for every observed gene in the array corresponding to each ‘RSeed’. > source("vignette/robustBiomodule.r") > > PL = clinical[,"Flow_up"] > names(PL) = rownames(clinical) > res <- robustBiomodule(expr.mat, phenotype=PL, + CV, T.robust = 0.95, + array.Anno = unlist(xxU133a[rownames(expr.mat)])) > names(res) [1] "RSeed" "RGEM" "seed.fre" "GEM.fre" > res$RSeed Seed sign 201697_s_at "DNMT1" "-" 218457_s_at "DNMT3A" "-" > res$RGEM $`201697_s_at~2` 18 200052_s_at 200072_s_at 200709_at 200783_s_at "STMN1" "ILF2" "HNRPM" "FKBP1A" 201292_at 201589_at 201664_at "TOP2A" "SMC1A" 202107_s_at "MCM2" 202705_at "CCNB2" 203422_at "HNRNPA0" 201970_s_at "SMC4" "CKS1B" 202462_s_at 202503_s_at 202580_x_at "DDX46" "KIAA0101" "FOXM1" 202748_at 203046_s_at "GBP2" 203432_at 203755_at "TMPO" "BUB1B" 204126_s_at 204146_at 204162_at "RAD51AP1" "NASP" 202626_s_at "LYN" 203145_at "TIMELESS" "POLD1" "CDC45L" 201897_s_at 201054_at 203213_at "SPAG5" 203976_s_at "CDC2" 204026_s_at "CHAF1A" "ZWINT" 204240_s_at "NDC80" 204444_at "SMC2" "KIF11" 204494_s_at 204531_s_at 204641_at 204649_at 204695_at "C15orf39" "BRCA1" "NEK2" "TROAP" "CDC25A" 204768_s_at 204825_at 204887_s_at 205024_s_at 206102_at "FEN1" "MELK" "PLK4" 206120_at 206364_at "CD33" "KIF14" 208939_at 209026_x_at 209053_s_at 209408_at 209642_at "TUBB" "WHSC1" "KIF2C" "BUB1" 210001_s_at 210052_s_at 210225_x_at 210911_at "LILRB3" "ID2B" "SEPHS1" 209735_at "ABCG2" 212017_at "LOC130074" 213110_s_at "COL4A5" "RAD51" "GINS1" 207828_s_at 208808_s_at 208881_x_at "CENPF" "HMGB2" "SOCS1" 212063_at "PTGES3" "TPX2" 212949_at 213007_at "NCAPH" "IDI1" 213083_at "FANCI" "SLC35D2" 213599_at 213762_x_at 214710_s_at 214743_at "OIP5" "RBMX" "CCNB1" "CUX1" 216761_at 216874_at 217025_s_at 215239_x_at 216026_s_at "ZNF273" "POLE" 217118_s_at 217547_x_at "C22orf9" "ZNF675" "PRC1" "ASF1B" "NFYB" 218349_s_at 218600_at 218741_at 218755_at 218782_s_at "ZWILCH" 219306_at "KIF15" "LIMD2" "RAB33A" "DKFZp686O1327" 218009_s_at "CENPM" 218115_at "KIF20A" 219620_x_at 219789_at 221004_s_at "C9orf167" "NPR3" "ITM2C" 19 "DBN1" 218127_at "ATAD2" 221392_at "TAS2R4" 221505_at 221685_s_at "ANP32E" 222077_s_at "CCDC99" "RACGAP1" $`218457_s_at~2` 200052_s_at 200072_s_at 200709_at 200783_s_at "STMN1" "ILF2" "HNRPM" "FKBP1A" 201292_at 201589_at 201664_at "TOP2A" "SMC1A" 202107_s_at "MCM2" 202705_at "CCNB2" 203422_at "HNRNPA0" 201970_s_at "SMC4" "CKS1B" 202462_s_at 202503_s_at 202580_x_at "DDX46" "KIAA0101" "FOXM1" 202748_at 203046_s_at "GBP2" 203432_at 203755_at "TMPO" "BUB1B" 204126_s_at 204146_at 204162_at "RAD51AP1" "NASP" 202626_s_at "LYN" 203145_at "TIMELESS" "POLD1" "CDC45L" 201897_s_at 201054_at 203213_at "SPAG5" 203976_s_at "CDC2" 204026_s_at "CHAF1A" "ZWINT" 204240_s_at "NDC80" 204444_at "SMC2" "KIF11" 204494_s_at 204531_s_at 204641_at 204649_at 204695_at "C15orf39" "BRCA1" "NEK2" "TROAP" "CDC25A" 204768_s_at 204825_at 204887_s_at 205024_s_at 206102_at "FEN1" "MELK" "PLK4" 206120_at 206364_at "CD33" "KIF14" 208939_at 209026_x_at 209053_s_at 209408_at 209642_at "TUBB" "WHSC1" "KIF2C" "BUB1" 210001_s_at 210052_s_at 210225_x_at 210911_at "LILRB3" "ID2B" "SEPHS1" 209735_at "ABCG2" 212017_at "LOC130074" 213110_s_at "COL4A5" "RAD51" "GINS1" 207828_s_at 208808_s_at 208881_x_at "CENPF" "HMGB2" "SOCS1" 212063_at "PTGES3" "TPX2" 212949_at 213007_at "NCAPH" "IDI1" 213083_at "FANCI" "SLC35D2" 213599_at 213762_x_at 214710_s_at "OIP5" "RBMX" "CCNB1" "CUX1" 216761_at 216874_at 217025_s_at 215239_x_at 216026_s_at "ZNF273" "POLE" 217118_s_at 217547_x_at "C22orf9" "ZNF675" 214743_at "RAB33A" "DKFZp686O1327" 218009_s_at "PRC1" 20 218115_at "ASF1B" "DBN1" 218127_at "NFYB" 218349_s_at 218600_at "ZWILCH" 219306_at "LIMD2" 218741_at 218755_at "CENPM" "KIF20A" 219620_x_at 219789_at 221004_s_at "KIF15" "C9orf167" "NPR3" "ITM2C" 221505_at 221685_s_at 222077_s_at "ANP32E" "CCDC99" 218782_s_at "ATAD2" 221392_at "TAS2R4" "RACGAP1" > hist(res$GEM.fre) Figure 3. The histogram of the calling frequency for GEMs. This plot is based on above 4 iteration of cross-validation in this vignette as an example. Reference [1] Ross ME, Zhou X, Song G, Shurtleff SA, Girtman K, Williams WK, Liu HC, Mahfouz R, Raimondi SC, Lenny N, Patel A, Downing JR. Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood l2003;102: 2951-9. [2] Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics l2002;18 Suppl 1: S96-104. [3] von Heydebreck A, Huber W, Gentleman R. Differential expression with the Bioconductor Project. Bioconductor Project Working Papers l2004. [4] Scheid S, Spang R. twilight; a Bioconductor package for estimating the local false discovery rate. Bioinformatics l2005;21: 2921-2. [5] Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A l2001;98: 5116-21. 21 [6] Yang X, Bentink S, Scheid S, Spang R. Similarities of ordered gene lists. J Bioinform Comput Biol l2006;4: 693-708. [7] Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet l2005;365: 488-92. [8] Lottaz C, Yang X, Scheid S, Spang R. OrderedList--a bioconductor package for detecting similarity in ordered gene lists. Bioinformatics l2006;22: 2315-6. 22