Implement in R - Mechanism-Anchored Biomodule Derived from

advertisement
Implement in R - Mechanism-Anchored Biomodule Derived from
Gene expression profiling
Xinan Yang1, Yves A Lussier1,2,3
1
Center for Biomedical Informatics and Section of Genetic Medicine, Department of Medicine, The
University of Chicago, Chicago, IL 60637 USA
2
The University of Chicago Cancer Research Center, and The Ludwig Center for Metastasis
Research, The University of Chicago, Chicago, IL 60637 USA
3
The Institute for Genomics and Systems Biology, and the Computational Institute, The University of
Chicago, Chicago, IL 60637 USA
May 14th, 2009
Abstract
This is the vignette of the computational implement of the algorithm PGnet. We introduce the
Bioconductor packages and their implement in R, and further generate several new functions: VEO to run
cross-validation using PGnet algorithm, robustBiomodule to identify robust mechanism-anchored
biomodule, and Cytoscape to generate the input tables for visualization the network.
Contents
Introduction .............................................................................................................................................. 2
Dependent Packages................................................................................................................................. 2
Expression Data......................................................................................................................................... 2
Seed Genes................................................................................................................................................ 3
Vector of Differentially Expressed genes in One Phenotype .................................................................... 4
Vector of Genes Co-expressed with the Seed Genes ............................................................................... 6
Vectorial Enrichment Optimization (VEO) & GEMs .................................................................................. 8
Network Visualization ............................................................................................................................. 12
Cross-validation ...................................................................................................................................... 13
Robust Seed Genes and GEMs ................................................................................................................ 17
Reference ................................................................................................................................................ 21
1
Introduction
PGnet is a simple algorithm to identify the association between mechanism-anchored seed genes and
clinical phenotypes of interest, together with the significant enriched genes contributing to the
association. The algorithm is based on expression profiling and previous knowledge. We demonstrate
the implement of PGnet on known DNA methylation anchored genes and leukemia outcome in this
vignette, however, PGnet can be applied to other molecular mechanisms such as transcriptional
networks, microRNA-regulation or Gene Ontology classes, and to other diseases.
Dependent Packages
The implement takes use of functions in the following Bioconductor packages:
> library(stats)
> library(Biobase)
> library(hgu133a.db)
>library(multtest)
> library(twilight)
> library(OrderedList)
> library(e1071)
> library(pamr)
Expression Data
For illustration, we analysis a subset of leukemia expression data that was published by Ross et al.[1].
This data contains 132 ALL arrays normalized with the variance stabilization and calibration
normalization (vsn) method[2]. An additional inter quartile range (IQR) filter[3] was applied to
eliminate the genes without sufficient variation across the samples in the dataset(Suppl.
Methods), resulting in 7,256 genes per probe-set.
> load(“vignette/eset.rdata”)
> eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7256 features, 132 samples
element names: exprs
phenoData
2
sampleNames: JD-ALD318-v5-U133A, JD-ALD066-v5-U133A, ..., JD-ALD054-v5-U133A
(132 total)
varLabels and varMetadata description:
Sample_name: sample ID
Type: subtype
Flow_up: outcome
Novel_group: novel group
featureData
featureNames: 1405_i_at, 200000_s_at, ..., AFFX-HUMGAPDH/M33197_5_at (7256 total)
fvarLabels and fvarMetadata description: none
experimentData: use 'experimentData(object)'
Annotation: Ross et al. Classification of pediatric acute lymphoblastic leukemia by gene
expression profiling. Blood l2003;102: 2951-9.
> expr.mat = exprs(eset)
> phenotype = pData(eset)
Seed Genes
As an example, we here use the symbols of genes that code a DNA Methyltransferase and
Methyl-binding proteins, and then find out the corresponding probe-sets in the array data. Generally,
the seed genes are one known mechanism-anchored genes, where any regulatory mechanisms
including DNA methylation, histine modification, transcriptional regulation, or microRNA
regulation can be applied.
> SeedSymbol = c("DNMT","MBD","MECP2","ZBTB33")
>
> xxU133a = as.list(hgu133aSYMBOL)
> source("vignette/probegrep.r")
> probes = NULL
>
> for ( i in 1:length(SeedSymbol)) {
+
probes = c(probes, probegrep(SeedSymbol[i], xxU133a))
+}
3
>
> probes = unlist(probes)
> probes ## need to be manually verified
201697_s_at 220139_at 220668_s_at 218457_s_at 41160_at 214396_s_at
"DNMT1" "DNMT3L" "DNMT3B"
"DNMT3A"
"MBD3"
"MBD2"
208595_s_at 220195_at 202463_s_at 202485_s_at 209579_s_at 202484_s_at
"MBD1"
"MBD5"
"MBD3"
"MBD2"
"MBD4"
"MBD2"
214048_at 203353_s_at 214047_s_at 214397_at 209580_s_at 202616_s_at
"MBD4"
"MBD1"
"MBD4"
"MBD2"
"MBD4"
"MECP2"
202618_s_at 202617_s_at 214631_at
"MECP2"
"MECP2" "ZBTB33"
>
> ESP = names(probes)[which(names(probes) %in% featureNames(eset))]
> length(ESP)
[1] 8
Based on the Bioconductor hug133a_2.2.0 annotation for Affymetrix Hgu133a array, 21 probe-sets are
annotated as DNA methylation anchored genes; among which, eight Epigenetic Seed Genes (ESG) in the
expression data meet the IQR filter criteria.
Vector of Differentially Expressed genes in One Phenotype
As an illustration, we here focus on the comparison between two sample groups (n=87): relapse after
treatment and continuous complete remission (CCR).
> ALLtype = as.factor(phenotype[, "Flow_up"])
> table(ALLtype)
ALLtype
2nd_AML
6
71
CCR Relapse subgroup
16
39
> outcomeSamples = which(ALLtype %in% c("Relapse", "CCR"))
> clinical = phenotype [outcomeSamples, ]
> expr.mat = expr.mat[, outcomeSamples]
4
> dim(expr.mat) #[1] 7256 77
[1] 7256 87
> ALLtype = as.factor(clinical[, "Flow_up"])
> yin = as.numeric(ALLtype)
> names(yin) = colnames(expr.mat)
Subsequently, we measured every gene’s differential expression between these two sub-groups of
leukemia outcome. Below code uses the function twilight.pval in the package twilight[4], other functions
and packages, such as the function sam in the package siggenes[5] or the function mt.teststat in the
package multtest, etc., should also work in this step to measure the lever of differential expression.
The inputs are:

expr.mat: The normalized and filtered expression data, rows are genes and columns are
samples;

yin: A numerical vector of the length equal to the number of samples containing class labels to
test case vs. control, in the way that the higher label denotes the case, while the lower label the
control samples;

method: The method to calculate the differential expression.
The outputs are:

score: A list of differential expression measurements for every gene

pval: A list of p-value of differential expression for every gene which is an opt output when
‘method’ is ‘fc’ , ‘z’ or ‘t.twilight’ using package ‘twilight’.

qval: A list of q-value of differential expression for every gene which is an opt output when
‘method’ is ‘fc’ , ‘z’ or ‘t.twilight’ using package ‘twilight’.
> B = 1000
> S = twilight.pval(expr.mat, yin, method="t", B=B)
No complete enumeration. Prepare permutation matrix.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 40 seconds.
Compute q-values.
Compute values for confidence lines.
5
> index = S$result$index
> score = S$result$observed[order(index)]
> pval = S$result$pval[order(index)]
> qval = S$result$qval[order(index)]
> names(score) = names(index) = names(pval) = names(pval) = rownames(expr.mat)
Vector of Genes Co-expressed with the Seed Genes
On the other hand, we measure the vectors of genes’ co-expression with the eight seed genes, using
Pearson correlation test on these expression profiles of these 87 samples. The following codes calculate
every gene’s co-expressed with eight ESGs respectively. As a result, we got eight vectors of Pearson
correlation coefficients.
The inputs are:

Expr.mat: Normalized and filtered expression data, rows are genes and columns are samples;

ESG: mechanism-anchored seed genes;

Method: The method to calculate the correlation coefficient.
The output is:

Corrx: A matrix of co-expression measurement for every gene, rows are corresponding to genes
and columns are corresponding to seed genes.

Corrp: A matrix of the correspond p-value of the correlation;

Corrq: A matrix of the corresponding q-value of the correlation.
> corrx = corrp = corrq =matrix(1,nrow=nrow(expr.mat), ncol=length(ESP))
> for ( j in 1:length(ESP)) {
+
toCompare = which(rownames(expr.mat) %in% ESP [j])
+
yin = expr.mat[toCompare,]
+#
corrx[,j] = cor(t(expr.mat), expr.mat[toCompare,], method = "pearson" )
+
S = twilight.pval(expr.mat, yin, method="pearson")
+
index = S$result$index
+
corrx[,j] = S$result$observed[order(index)]
+
corrp[,j] = S$result$pval[order(index)]
+
corrq[,j] <- S$result$qval[order(index)]
+ }
6
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 10 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 20 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 40 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 30 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 20 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 30 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 30 seconds.
Compute q-values.
Compute values for confidence lines.
Compute vector of observed statistics.
Compute expected scores and p-values. This will take approx. 20 seconds.
Compute q-values.
Compute values for confidence lines.
>
> rownames(corrx) = rownames(corrp) = rownames(corrq) = rownames(expr.mat)
7
> colnames(corrx) = colnames(corrp) = colnames(corrq) = ESP
>
> dim(corrx)
[1] 7256 8
Vectorial Enrichment Optimization (VEO) & GEMs
GEMs are the Genes that not only co-Expressed with Mechanism anchored seed genes but also
differentially expressed in phenotype of interested. This part takes use of the algorithm that we
previously proposed, by performing the function compareLists in the package OrderedList[6] to estimate
the significance of similarity of two vectors of gene ranks, and by using the function getOverlap to
identify the optimal enriched genes from two corresponding significant enriched ranks.
In this example, we do not know how the DNA methylation anchored genes regulate leukemia outcomes,
so we consider the comparison of both ends of two vectors: straight comparison (one ordered vector with
another one) and reversed comparison (one order vector with the flipped of another one).
The inputs are:

corrx: The matrix of correlation expression, rows are genes and columns are seed genes;

score: The vector of differential expression for every genes in the matrix ‘corrx’;

T.sig: The threshold of significance, default value is 0.05.
The outputs are:

n.pos: The optimal length of ranks to be observed in each straightly compared pair of vectors.

n.rev: The optimal length of ranks to be observed in each reversed compared pair of vectors, i.e.
one vector with the flipped second vector.

SIMx: A matrix, with columns corresponding to the significant similar vectors and rows
corresponding to the enriched genes.

revSIMx: A matrix, with columns corresponding to the significant revised similar vectors and
rows corresponding to the enriched genes.
> n.LP = ncol(corrx)
> SIMx = matrix(0, nrow = nrow(expr.mat), ncol = n.LP)
> rownames(SIMx) = rownames(expr.mat)
> revSIMx = SIMx
> c=0
> lab = NULL
8
> n.pos = n.rev = array(dim = n.LP)
> p.pos = p.rev = array(dim = n.LP)
> LP.lab = paste(levels(ALLtype)[1], levels(ALLtype)[2], sep = "_")
>
> for (i in 1:n.LP) {
+
lab = c(lab, paste(colnames(corrx)[i], LP.lab, sep = "~") )
+
x = compareLists(order(corrx[,i]), order(score), alphas = NULL)
+
res = getOverlap(x)
+
p.pos[i] = min(x$pvalues)
+
p.rev[i] = min(x$revPvalues)
+
n.pos[i] = x$nn[which(x$pvalues == p.pos[i])]
+
n.rev[i] = x$nn[which(x$revPvalues == p.rev[i])]
+
if(p.pos[i] < T.sig) SIMx[res$intersect, i] = 1
+
if(p.rev[i] < T.sig) revSIMx[res$intersect, i] = 1
+}
Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
9
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------> length(n.pos) =
[1] 8
> names(p.pos) = names(p.rev) = names(n.pos) = names(n.rev) = lab
> colnames(SIMx) = colnames(revSIMx) = lab
>
> SIMx = SIMx[which(apply(SIMx,1,sum)>0),]
> revSIMx = revSIMx[which(apply(revSIMx,1,sum)>0),]
>
> SIMx = SIMx[,which(apply(SIMx,2,sum)>0)]
> revSIMx = revSIMx[,which(apply(revSIMx,2,sum)>0)]
>
> dim(SIMx)
[1] 0 0
> dim(revSIMx)
[1] 1214 2
> table(yin, ALLtype)
# the higher label denotes the case, while the lower label the control samples
ALLtype
yin CCR Relapse
1 71
0
2 0
16
Consequently, we got two seed genes associating with leukemia outcome dys-regulation. Both of the two
vectors of differential expression was tested by case (Relapse) vs. control (CCR) , and the vectorial
enrichment are reversed significant; therefore we can infer that the two identified seed genes were both
down-regulated in leukemia relapse group.
Additionally, we can see which seed gene is associated with leukemia relapse comparing with CCR,
based on above results:
10
> xxU133a["201697_s_at"]
$`201697_s_at`
[1] "DNMT1"
> xxU133a["218457_s_at"]
$`218457_s_at`
[1] "DNMT3A"
We can also identify the genes co-expressed with the seed genes and differentially expressed in
phenotype (GEMs). For example, in the resulting matrix, the probe-sets of enriched genes were labeled
as 1, either wise 0, in the corresponding column, where the column is named in the form of “A_B”, A is
the probe ID of the seed gene and B is two subgroups in the form of control~case.
> revSIMx
201697_s_at~CCR_Relapse 218457_s_at~CCR_Relapse
200000_s_at
1
1
200007_at
1
0
200026_at
1
1
200043_at
1
1
200052_s_at
1
1
200061_s_at
0
1
200062_s_at
0
1
200072_s_at
1
1
……
Important, the output also reports the optimal maximal ranks to be counted for the observed significance:
> n.rev[colnames(revSIMx)]
201697_s_at~CCR_Relapse 218457_s_at~CCR_Relapse
750
500
11
Network Visualization
The visualization of the ESG-GEM-LP can be performed using the R package Rgraphviz or the software
Cytoscape. To use R, please refer to the vignette of the package Graphvis. The below codes generate the
tables of edges and nodes for the inputs of Cytoscape to visualize the resulting network.
The inputs are:

SIMx: The resulting matrix of PGnet of straight VEO or reversed CEO.

score: The resulting matrix of measurement of differential expression for every examined gene in
the dataset between two sub-groups;

qval: The significance of differential expression for every examined gene in the dataset between
two sub-groups, using q-value measurement;

T.DE: A threshold of significance for differential expression.

corrx: The resulting matrix of measurement of co-expression for every examined gene in the
dataset and the seed genes;

corrq: The significance of co-expression for every examined gene in the dataset and the seed
genes, using q-value measurement;

T.CE: A threshold of significance for co-expression.

GeneSymbolList: A list of the symbol of every examined gene;

Dir: The way of vectorial enrichment, “+” for straight significant and “-“ for reversed significant;

ShowSig: A boolean value to indicate whether only show the significant (T.DE) differential
expressed and significant (T.CE) co-expressed genes among the resulting significant (Sig,T)
enriched ranks;

Sig.T: A threshold of significance of vectorial enrichment.
> rownames(qval) <- rownames(score)
> colnames(score) = colnames(qval) = "CCR_Relapse"
> res2 = Cytoscape(revSIMx, score,qval, T.DE=0.02, corrx, corrq, T.CE=0.05, xxU133a, "", ShowSig=FALSE, sig.T=0.05)
> Edges <- res2$edges
> Nodes <- res2$nodes
> dup.Nodes <- which(duplicated(rownames(Nodes)))
> if(length(dup.Nodes)>0) Nodes = Nodes[-dup.Nodes, ]
> dim(Edges)
[1] 1750 5
12
> dim(Nodes)
[1] 875 4
> write.csv(Edges , file="sig_Edges_Cytoscape.csv")
> write.csv(Nodes , file="sig_Nodes_Cytoscape.csv")
Cross-validation
A robust molecular signature is one that repeatedly appears by random sampling[7]. As a simple solution,
we proposed that the robust ESGs refer to those GEMs that were among the top 5% frequencies in the one
hundred iterations of the three-fold cross-validation.
The inputs here are:

expr.mat: The normalized and filtered expression data;

ALLtype: A character of the phenotypes for every sample;

yin: A numerical vector of the length equal to the number of samples containing class labels to
test case vs. control, in the way that the higher label denotes the case, while the lower label the
control samples; And the names of ‘yin’ should be the same as the sample names.

seedProbe: A string of the candidates of the seed gene;

max.rank: The maximum rank to be considered, either an unique value for all seed genes or
distinct values for every seed gene;

cross.repeat: The iterations of running cross-validation, the default value is 100;

cross.out: The number of folds of samples to do cross-validation, the default value is 3; if using
the number 1, a leave-one-out cross-validation will be performed;

stratified: A Boolean value to assign whether a stratified strategy of randomly cutting sample be
used for the cross validation, the default value is TRUE;

min.weight: The minimal weight to be considered in the algorithm of OrderedList[6, 8].

CE.method: The distance measure to be used for correlation between genes and the seed gene.
This must be one of "pearson", "kendall", "spearman".

Nbin: The number of bins to calculate discrete probabilities when ‘CE.method’ is ‘MI’ , default is
10;

DE.method: A character string specifying the method to measure the differential expresion. This
must be one of "fc", "z", “t.twilight”, "t",’ "t.equalvar", "wilcoxon", "f", "pairt" or "blockf".

pred.method: A character string specifying the prediction model to be applied on crossvalidations which must be one of "SVM" or "PAM".
13
The outputs are:

CoCEPG: A list of matrixes which are the straightly significant results of ‘cross.repeat’ iterations
with GEM in rows and seed genes in columns;

RevCEPG: A list of matrixes which are the reversed significant results of ‘cross.repeat’ iterations
with GEM in rows and seed genes in columns;

model: A list of resulting model for the train set of each iteration;

pred.res: A list of resulting prediction for the test set of each iteration;

vote: A matrix of voting resulted from the cross-validation, rows are corresponding to iterations
and columns are corresponding to sample names;

TP: The true positives rate for every iteration of cross-validation when a result is truly predicted
as case when a situation is case;

FP: The false positive rate for every iteration when a result is erroneously predicted as case when
a situation is control;

TN: The false negative rate for every iteration when a result is truly predicted as control when a
situation is control;

FN: The false negative rate for every iteration when a result is erroneously predicted as control
when a situation is case;

phenotype.anno: A table reports the codes used for phenotypes of the inputted samples with
sample sizes;

vote.tb: A table of resulting votes which rows are the sample names and columns are the probes
of seed gene;

block: A matrix of Boolean values indicating whether a sample is randomly selected into the train
set for cross-validation, where rows are corresponding to iterations and columns are
corresponding to the samples.
As an example, we run 4 iterations of cross-validation using following code, for reversed similarities
because there is no significant result of straight similarity:
> source("vignette/VEO.r")
>
CV = VEO(expr.mat,
+
yin,
+
seedProbe = ESP,
+
max.rank = n.rev,
+
cross.repeat = 4,
14
+
cross.outer = 3,
+
stratify = TRUE,
+
CE.method = "pearson",
+
DE.method = "t",
+
pred.method = "PAM")
12121212iteration : 1
Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------iteration : 2
Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------iteration : 3
Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------iteration : 4
Simulating random scores...
0%.......:.........:.........:.........:......100%
-------------------------------------------------Simulating random scores...
0%.......:.........:.........:.........:......100%
------------------------------------------------->
15
> names(CV)
[1] "CoCEPG"
"RevCEPG"
"model"
[5] "vote"
"TP"
"FP"
[9] "FN"
"phenotype.anno" "vote.tb"
"pred.res"
"TN"
"block"
> length(CV$CEPG)
[1] 0
> length(CV$RevCEPG)
[1] 4
The above results suggest no seed gene show straightly significant correlation with the phenotype of
interest, but four seed genes show reversed significant correlation with the phenotype of interest. We can
observe the recall-precision plot and the boxplot of the total accuracy:
> recall = CV$TN/(CV$TN + CV$FP)
> precision = CV$TN/(CV$TN + CV$FN)
> plot(precision, recall, main = "Prediction of CCR")
> accuracy = (CV$TP + CV$TN)/(CV$TP + CV$TN + CV$FP + CV$FN)
> boxplot(accuracy, main = "Total prediction accuracy for Relapse")
>
> recall = CV$TP/(CV$TP + CV$FN)
> precision = CV$TP/(CV$TP + CV$FP)
> plot(precision, recall, main = "Prediction of Relapse")
16
Figure 1 The box-and-whisker plot for the total prediction accuracy (the lower quartile, median
and upper quartile, and the largest observation, whiskers are the 95% and 5% intervals).
Figure 2 The four iterations of cross-validation resulting precision-recall graph.
Robust Seed Genes and GEMs
A robust molecular signature is one that repeatedly appears by random sampling[7]. We can identify the
robust biomodule from the cross-validation resulting seed genes and their corresponding GEMs as
following code demonstrates.
The inputs are:

eset: An ExpressionSet object or matrix of the expression data to be studied;
17

phenotype: A string indicating the phenotypes of interest for samples if the ‘eset’ is an
ExpressionSet object; otherwise, this parameter will be a string indicating the value of the
phenotype of interest for every samples.

CV: An object of the ‘VEO’ function result;

T.robust: The threshold to call a signature as “significant” from the ‘vote’ matrix, default is 95%
quantile of the highest frequency;

array.Anno: A string pointing the special annotation for the corresponding probe-sets in ‘eset’.
The outputs are:

RSeed: A matrix with robust seed genes that associated with the phenotype of interest in rows and
the sign for their ways of correlation (‘+’ for straight and ‘-’ for reversed) in columns;

RGEM: A list of matrix with gene names that contributing to one distinct ‘RSeed’. The name of
the list presents the associated seed gene and the way of correlation (1 for straight and 2 for
reversed).

seed.fre: A list of the calling frequencies for each identified ‘RSeed’;

GEM.fre: A matrix of the calling frequencies for every observed gene in the array corresponding
to each ‘RSeed’.
> source("vignette/robustBiomodule.r")
>
>
PL = clinical[,"Flow_up"]
>
names(PL) = rownames(clinical)
>
res <- robustBiomodule(expr.mat, phenotype=PL,
+
CV, T.robust = 0.95,
+
array.Anno = unlist(xxU133a[rownames(expr.mat)]))
> names(res)
[1] "RSeed" "RGEM"
"seed.fre" "GEM.fre"
> res$RSeed
Seed
sign
201697_s_at "DNMT1" "-"
218457_s_at "DNMT3A" "-"
> res$RGEM
$`201697_s_at~2`
18
200052_s_at
200072_s_at
200709_at
200783_s_at
"STMN1"
"ILF2"
"HNRPM"
"FKBP1A"
201292_at
201589_at
201664_at
"TOP2A"
"SMC1A"
202107_s_at
"MCM2"
202705_at
"CCNB2"
203422_at
"HNRNPA0"
201970_s_at
"SMC4"
"CKS1B"
202462_s_at
202503_s_at
202580_x_at
"DDX46"
"KIAA0101"
"FOXM1"
202748_at
203046_s_at
"GBP2"
203432_at
203755_at
"TMPO"
"BUB1B"
204126_s_at
204146_at
204162_at
"RAD51AP1"
"NASP"
202626_s_at
"LYN"
203145_at
"TIMELESS"
"POLD1"
"CDC45L"
201897_s_at
201054_at
203213_at
"SPAG5"
203976_s_at
"CDC2"
204026_s_at
"CHAF1A"
"ZWINT"
204240_s_at
"NDC80"
204444_at
"SMC2"
"KIF11"
204494_s_at
204531_s_at
204641_at
204649_at
204695_at
"C15orf39"
"BRCA1"
"NEK2"
"TROAP"
"CDC25A"
204768_s_at
204825_at
204887_s_at
205024_s_at
206102_at
"FEN1"
"MELK"
"PLK4"
206120_at
206364_at
"CD33"
"KIF14"
208939_at
209026_x_at
209053_s_at
209408_at
209642_at
"TUBB"
"WHSC1"
"KIF2C"
"BUB1"
210001_s_at
210052_s_at
210225_x_at
210911_at
"LILRB3"
"ID2B"
"SEPHS1"
209735_at
"ABCG2"
212017_at
"LOC130074"
213110_s_at
"COL4A5"
"RAD51"
"GINS1"
207828_s_at
208808_s_at
208881_x_at
"CENPF"
"HMGB2"
"SOCS1"
212063_at
"PTGES3"
"TPX2"
212949_at
213007_at
"NCAPH"
"IDI1"
213083_at
"FANCI"
"SLC35D2"
213599_at
213762_x_at
214710_s_at
214743_at
"OIP5"
"RBMX"
"CCNB1"
"CUX1"
216761_at
216874_at
217025_s_at
215239_x_at
216026_s_at
"ZNF273"
"POLE"
217118_s_at
217547_x_at
"C22orf9"
"ZNF675"
"PRC1"
"ASF1B"
"NFYB"
218349_s_at
218600_at
218741_at
218755_at
218782_s_at
"ZWILCH"
219306_at
"KIF15"
"LIMD2"
"RAB33A" "DKFZp686O1327"
218009_s_at
"CENPM"
218115_at
"KIF20A"
219620_x_at
219789_at
221004_s_at
"C9orf167"
"NPR3"
"ITM2C"
19
"DBN1"
218127_at
"ATAD2"
221392_at
"TAS2R4"
221505_at
221685_s_at
"ANP32E"
222077_s_at
"CCDC99"
"RACGAP1"
$`218457_s_at~2`
200052_s_at
200072_s_at
200709_at
200783_s_at
"STMN1"
"ILF2"
"HNRPM"
"FKBP1A"
201292_at
201589_at
201664_at
"TOP2A"
"SMC1A"
202107_s_at
"MCM2"
202705_at
"CCNB2"
203422_at
"HNRNPA0"
201970_s_at
"SMC4"
"CKS1B"
202462_s_at
202503_s_at
202580_x_at
"DDX46"
"KIAA0101"
"FOXM1"
202748_at
203046_s_at
"GBP2"
203432_at
203755_at
"TMPO"
"BUB1B"
204126_s_at
204146_at
204162_at
"RAD51AP1"
"NASP"
202626_s_at
"LYN"
203145_at
"TIMELESS"
"POLD1"
"CDC45L"
201897_s_at
201054_at
203213_at
"SPAG5"
203976_s_at
"CDC2"
204026_s_at
"CHAF1A"
"ZWINT"
204240_s_at
"NDC80"
204444_at
"SMC2"
"KIF11"
204494_s_at
204531_s_at
204641_at
204649_at
204695_at
"C15orf39"
"BRCA1"
"NEK2"
"TROAP"
"CDC25A"
204768_s_at
204825_at
204887_s_at
205024_s_at
206102_at
"FEN1"
"MELK"
"PLK4"
206120_at
206364_at
"CD33"
"KIF14"
208939_at
209026_x_at
209053_s_at
209408_at
209642_at
"TUBB"
"WHSC1"
"KIF2C"
"BUB1"
210001_s_at
210052_s_at
210225_x_at
210911_at
"LILRB3"
"ID2B"
"SEPHS1"
209735_at
"ABCG2"
212017_at
"LOC130074"
213110_s_at
"COL4A5"
"RAD51"
"GINS1"
207828_s_at
208808_s_at
208881_x_at
"CENPF"
"HMGB2"
"SOCS1"
212063_at
"PTGES3"
"TPX2"
212949_at
213007_at
"NCAPH"
"IDI1"
213083_at
"FANCI"
"SLC35D2"
213599_at
213762_x_at
214710_s_at
"OIP5"
"RBMX"
"CCNB1"
"CUX1"
216761_at
216874_at
217025_s_at
215239_x_at
216026_s_at
"ZNF273"
"POLE"
217118_s_at
217547_x_at
"C22orf9"
"ZNF675"
214743_at
"RAB33A" "DKFZp686O1327"
218009_s_at
"PRC1"
20
218115_at
"ASF1B"
"DBN1"
218127_at
"NFYB"
218349_s_at
218600_at
"ZWILCH"
219306_at
"LIMD2"
218741_at
218755_at
"CENPM"
"KIF20A"
219620_x_at
219789_at
221004_s_at
"KIF15"
"C9orf167"
"NPR3"
"ITM2C"
221505_at
221685_s_at
222077_s_at
"ANP32E"
"CCDC99"
218782_s_at
"ATAD2"
221392_at
"TAS2R4"
"RACGAP1"
> hist(res$GEM.fre)
Figure 3. The histogram of the calling frequency for GEMs. This plot is based on above 4 iteration
of cross-validation in this vignette as an example.
Reference
[1]
Ross ME, Zhou X, Song G, Shurtleff SA, Girtman K, Williams WK, Liu HC, Mahfouz R, Raimondi SC,
Lenny N, Patel A, Downing JR. Classification of pediatric acute lymphoblastic leukemia by gene
expression profiling. Blood l2003;102: 2951-9.
[2]
Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M. Variance stabilization applied
to microarray data calibration and to the quantification of differential expression. Bioinformatics
l2002;18 Suppl 1: S96-104.
[3]
von Heydebreck A, Huber W, Gentleman R. Differential expression with the Bioconductor
Project. Bioconductor Project Working Papers l2004.
[4]
Scheid S, Spang R. twilight; a Bioconductor package for estimating the local false discovery rate.
Bioinformatics l2005;21: 2921-2.
[5]
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing
radiation response. Proc Natl Acad Sci U S A l2001;98: 5116-21.
21
[6]
Yang X, Bentink S, Scheid S, Spang R. Similarities of ordered gene lists. J Bioinform Comput Biol
l2006;4: 693-708.
[7]
Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random
validation strategy. Lancet l2005;365: 488-92.
[8]
Lottaz C, Yang X, Scheid S, Spang R. OrderedList--a bioconductor package for detecting similarity
in ordered gene lists. Bioinformatics l2006;22: 2315-6.
22
Download