Chromosomal Clustering of Periodically Expressed Genes

advertisement
Chromosomal Clustering of Periodically Expressed Genes
in Plasmodium Falciparum
Pingzhao Hu1, Celia M.T. Greenwood1,2, Cyr Emile M’lan3 and Joseph Beyene1,2
1Hospital
for Sick Children Research Institute
of Public Health Sciences, University of Toronto
3Department of Statistics, University of Connecticut, Storrs, CT
555 University Avenue, Toronto ON, M5G 1X8
(416) 813-7654 x2302
2Department
joseph@utstat.toronto.edu
ABSTRACT
Identification of periodically expressed genes has been widely
studied, but understanding how periodically expressed genes are
distributed along chromosomes is largely unexplored. In this
study we focused on the detection of chromosomal clusters of
periodically expressed genes in stages of intraerythrocytic
developmental cycle (IDC) of plasmodium falciparum. The
DNA microarray data was provided by the organizers of the
Critical Assessment of Microarray Data Analysis (CAMDA)
2004 competition. To this end, we first applied a multiple linear
regression model containing sinusoidal curves to identify
periodically expressed oligonucleotides. Setting the proportion
of variance explained (PVE) at ≥ 0.7, a list of 2949 periodically
expressed oligonucleotides (2204 genes) with a false discovery
rate (FDR) of 3*10-5 were selected. Subsequently, a supervised
support vector machine (SVM) method was used to assign these
oligonucleotides into four IDC stages with at least 80% level of
confidence. Furthermore, genes in each stage were mapped on to
the 14 chromosomes of plasmodium falciparum genome. A
total of 312 chromosomal clusters were identified. Finally, we
performed a brief analysis of gene functions in these clusters.
Our findings revealed that the expression of periodically
regulated genes is coordinated locally on chromosomes where
there are clusters of genes within same stage, suggested cisregulation.
Keywords
Asexual intraerythrocytic development cycle, multiple linear
regression model, support vector machine, class probability,
chromosomal clusters
1. INTRODUCTION
Plasmodium falciparum is the organism which causes human
malaria. The 22.8 Mb genome of P.falciparum is comprised of
14 linear chromosomes. Understanding the genome of
P.falciparum will hopefully provide a foundation for prevention
and treatment of the disease. The complete P.falciparum life
cycle includes three major developmental stages: the mosquito,
liver and blood stages. The periodic nature of genes expressed in
one of these stages, which has been called the asexual
intraerythrocytic development cycle (IDC), has been
investigated in detail by Bozdech et al. [1] Genes sharing this
periodicity are likely to be co-regulated. Previous studies on
Saccharomyces cerevisiae [2], Homo sapiens [3] and
Caenorhabditis elegans [4] have demonstrated that co-regulated
genes were clustered together on chromosomes. Proteomic
analysis of the three developmental stages of P.falciparum also
revealed the presence of chromosomal clusters encoding coexpressed proteins [5]. The focus of this study is on the
association between chromosomal location and the periodic
nature of genes expressed in IDC using the dataset of Bozdech
et al. [1].
2. METHODS
2.1 Data Source and Preprocessing
The organizers of CAMDA 2004 provided three datasets: the
complete raw data set, a quality controlled data set and an
overview data set. In this study we used the quality-controlled
data set to simplify the preprocessing and to facilitate
comparisons with the original work on this dataset [1]. The data
set includes 5080 oligonucleotides measured at 46 time points
spanning 48 hours. The data was originally normalized using the
NOMAD (Normalization of MicroArray Data) database system.
243 of the oligonucleotides had a missing value at one or more
time points. We imputed missing values in the dataset using the
10-nearest neighbor averaging method [6]. This imputation
method can be summarized as follows: if oligonucleotide x has
one missing value at time point j, the approach first finds 10
other oligonucleotides that have a value measured at time point
j, with expression most similar to x at all other 45 time points.
Then the weighted average of expression values for time point j
from these 10 similar oligonucleotides is used as an estimate of
the missing intensity value in oligonucleotide x. The inverse of
the Euclidean distance was used to weight the average.
2.2 Identification of Periodically Expressed
Oligonucleotides
ˆ
let V  (1 / B )
In order to objectively analyze periodical gene expression
measurements, several studies [7][8] calculated a numerical
score for quantifying the periodicity of the expression profile of
each gene based on Fourier analysis. Here we applied standard
statistical methods [9], consisting of multiple linear regression,
R 2 scores and F-statistics, to identify periodically expressed
genes. Since many genes were measured by more than one
oligonucleotide, we fitted a linear model for each
oligonucleotide. For oligonucleotide i at time point j, the
variation in log expression ratios over the course of the study
was modeled as a linear combination of sine-cosine waves as
follows:
y ij  b0i  b1i cos( 2t j / T )  b2i sin( 2t j / T )  eij ,
least squares method, for fixed T. In order to evaluate whether
an oligonucleotide is periodically expressed in the
intraerythrocytic development cycle, the goodness-of-fit of the
linear model for each oligonucleotide’s expression profile was
measured by R 2 . The R 2 value quantifies the “proportion of
variance explained (PVE)” by the periodicity. The PVE falls
between zero and one, and values close to one indicate greater
periodicity for a given T. The statistical significance of each
R 2 can be determined by the F-statistic [9],
F  ( J  p) R 2 /(( p  1)(1  R2 )) . Here J is the number of
time points and p=3 is the number of parameters in the linear
model.
Selecting periodically expressed oligonucleotides based on Fstatistics involves multiple testing as described by Dudoit et al.
[10]. The false discovery rate (FDR) [11] has become a popular
error measure for controlling the false positive and false
negative errors in this situation. We applied Taylor et al.’s
algorithm [12], a column-wise permutation-based method, (that
is, we permuted the times in the data) to calculate the FDR. In
their method, T-statistics were used, but here we used our Fstatistics. The details of this method are as follows:
Create B column-wise permutations, producing Fstatistics F1,b ,..., FI ,b , for oligonucleotide
i  1, 2,...,I and permutations b  1, 2,...,B .
Let Fi , 0 be the F-statistics for oligonucleotide i in
the original data, for a cutpoint Fc ,
ˆ
let R 
I
 I (|Fi , 0 | Fc ) , and
i 1
b 1 i 1
ˆ ˆ
Estimate the FDR by  0V / R ,
where  0 is the true proportion of oligonucleotides without
periodicity among all the oligonucleotides I, as suggested
by Efron et al. [13] and Storey [14]. We followed Storey
[14] and Taylor et al.‘s methods [12] to calculate
0 .
Statistically significant oligonucleotides were chosen by
comparing the F-statistic Fi , 0 with a given cutpoint Fc at
the estimated FDR.
Equation (1) is a standard multiple linear regression model, so
the regression parameters b0 i , b1i , b2i can be estimated using the
2.
I
(1)
where T is the period for the cyclically expressed
oligonucleotides. We estimated the period by minimizing the
sum of squared errors (SSE) of least squares fits of known
periodically expressed oligonucleotide profiles to model (1), for
different values of T.
1.
3.
B
  I (|Fi ,b|Fc ) .
2.3 Classification of Periodically Expressed
Oligonucleotides
Many studies have used clustering methods to classify genes
into cell cycle phase [7][8]. However, unsupervised
classification methods require an arbitrary specification of the
number of clusters in a dataset, and furthermore cannot use prior
information. In this study, the IDC contains 4 stages, namely,
ring/early trophozoite, trophozoite/early schizont, schizont and
early ring, and a total of 472 oligonucleotides (351 genes) were
known to be expressed in one of these stages [1]. Based on
Table S2 and Figure 2 of Bozdech study [1], there are 183, 75,
69 and 24 periodically expressed genes in these four stages
respectively. Therefore, to classify the oligonucleotides
identified in Section 2.2 into these stages with high confidence
level, we used a pairwise coupling method to solve this multiclass classification problem [15]. This involves estimating class
probabilities for each pair of classes, and then coupling the
estimates together for each oligonucleotide.
We employed SVM with a radial basis function (RBF) kernel as
our base classifier for each pair of classes. SVM is a core
machine learning technique with a strong theoretical basis and
excellent empirical success [16]. It has been widely applied in
handwriting digit recognition [16] and text classification.
Generally speaking, given a periodically expressed
oligonucleotide x, the SVM outputs a decision value f kl for
each pair of classes k and l. While the sign and magnitude of
f kl can be used to determine the class prediction and the
confidence level of that prediction, the SVM decision
value f kl is an uncalibrated value that does not always translate
directly to a probability value useful for estimating confidence.
Platt [17] proposed a parametric model for calibration in which
rkl for each pair of classes k and l was
1
on: rˆkl 
, where A and B are
1e Af kl  B
the class probability
estimated based
estimated by minimizing the negative log-likelihood function.
A common way to combine pairwise comparison scores
rkl
is
through a majority voting method described by Friedman [18].
The voting method selects the class label with the most winning
two-class decisions. In our study, however, a confidence level is
required in order to assign a periodically expressed
oligonucleotide into a stage. Hastie and Tibshirani [15]
proposed an algorithm to calculate coupled class probabilities
for this task. For the periodically expressed oligonucleotide x,
the pairwise calibrated SVM computes estimates
k , l  1,...,4 , k  l
. Assume that
nkl
r̂kl
for classes
is the number of
oligonucleotides in the training set for the classifier trained on
classes
where
k
and
l.
We
wish
to
estimate
{ p k }4k 1
,
pk  p(class  k | x) . The algorithm of Hastie and
Tibshirani words as follows:
(1) Start with some initial
pˆ k  0 ,
3. RESULTS
3.1 Estimation of the Cycle of Periodically
Expressed Oligonucleotides
We used the 472 oligonucleotides (351 genes) whose staging is
known to estimate the period T by fitting equation (1). Bozdech
et al. [1] found that the majority of gene profiles exhibited an
overall expression period of 0.75-1.5 cycles per 48h. For this
reason we fitted equation (1) over a range of 100 T values
evenly spaced from 1 hour to 100 hours. As can be seen in
Figure 1, the sum of squared errors over the 351 genes was
minimized at 50 hours.
subsequent analysis.
Therefore, we selected
Tˆ
=50 for
and corresponding
uˆ kl  pˆ k /( pˆ k  pˆ l ) .
(2) Repeat
(k  1,...,4,1,...) until convergence:
nkl rˆkl
pˆ k  pˆ k  k  l n uˆ , rˆkl  1e Af1kl  B
 k  l kl kl
pˆ  pˆ / k 1 pˆ k , pˆ  ( pˆ 1 , pˆ 2 , pˆ 3 , pˆ 4 )
4
recompute the
û kl
(3) The final class prediction y is based on the maximum,
pˆ y  arg max k ( pˆ k ) ,
and so we assign
p̂ y as
the probability that the oligonucleotide x falls into the
predicted stage y  {1,2,3,4} .
A total of 472 oligonucleotides with known stages were used as
the training data for this algorithm. After training, class
predictions were estimated for all periodically expressed
oligonucleotides identified using the methods described in
Section 2.2 that were not included in the training data. We
assigned the periodically expressed oligonucleotide x to stage y
if the maximum probability
p̂ y
was greater or equal to 0.8.
When different oligonucleotides from the same gene were
assigned to more than one stage, we assigned the gene to the
stage with the highest confidence estimate
p̂ y
2.4 Clustering of Periodically
Genes on Chromosomes
.
Expressed
We used www.PlasmoDB.org to obtain the physical locations
and ordering of all genes, and marked the stage assigned to each
gene (if any). Then we examined the patterns of periodicallyexpressed, stage-assigned genes along the 14 chromosomes.
Using the chromosomal positions obtained above, we defined a
cluster as two or more consecutive loci whose expression
patterns were matched to the same stage. Based on this
definition, we can identify chromosomal clusters for each stage
for a given cluster size.
Figure 1. The relationship between the SSE and period
3.2 Identification of Periodically Expressed
Oligonucleotides
For all the remaining oligonucleotides whose staging is
unknown, we fit equation (1) using the least-squares method,
and calculated the PVE and corresponding F-statistics. We
defined an oligonucleotide as periodically expressed if its PVE
was at least 0.7, which corresponds to an F-statistic=50.2. There
were 2949 oligonucleotides (2204 genes) which passed this
filtering criteria and demonstrated periodicity. Figure 2 shows
examples of expression profiles for 4 genes, PFL2355w,
PFA0285c, PFC0185w and PF11_0231. These genes were
selected because they represent four distinct sine-cosine wave
profiles in the dataset. The first peaks of the sine-cosine wave
forms of these four genes were about 15 hours, 36 hours, 43
hours and 5 hours, respectively.
We observed that most of the oligonucleotides which passed the
PVE filtering criteria had one of these four profiles. This
suggested that there were four dominant expression patterns in
the selected periodically expressed oligonucleotides.
Figure 2. Examples expression Profiles for 4 genes shown
with a least-square fit of the data (curved line)
In order to verify whether random variation can produce marked
systematic patterns of expression, we performed 10,000
permutations of the data over the time points, and refit equation
(1) to the permuted datasets. The estimated FDR was only
0.00003, strongly suggesting that the randomized datasets do
not demonstrate periodicity.
3.3 Classification of Stage Group for
Periodically Expressed Oligonucleotides
As we stated before, there are 472 oligonucleotides (351 genes)
whose staging was known. These were used as the training
samples in the SVM. Excluding these oligonucleotides, we had
2545 oligonucleotides (1918 genes) for testing. (It should be
noted that some of the oligonucleotides in the training sample
had PVE values less than 0.7, which explains why the number
of oligonucleotides in the combined training and testing samples
does not equal the number of periodically expressed
oligonucleotides selected). We built pairwise binary SVM
classifiers with the RBF kernel for the four stages, and generated
6 predictors. A 10-fold cross-validation scheme was used to
evaluate each binary predictor, and the overall cross validation
error was 3.4%. For the 1918 periodically expressed genes of
unknown stage, we assigned 718 genes (923 oligonucleotides)
into ring/early trophozoite stage, 624 genes (835
oligonucleotides) into trophozoite/early schizont stage, 141
genes (186 oligonucleotides) into schizont stage and 167 genes
(199 oligonucleotides) into early ring stage, each with an
estimated class probability
p̂ y
of at least 0.8. Another 268
genes that had class probabilities less than 0.8 were not assigned
into any one of these four stages.
Figure 3. Heat map of periodically expressed genes predicted
in four stages of IDC
Figure 3 shows the stageogram of the IDC transcriptome based
on the 1650 classified genes which had class probability at least
0.8 and the 351 training set genes for which stage was known
(class probability 1). First, the genes were ordered by predicted
stage,.from top to bottom the ordering is ring/early trophozoite,
trophozoite/early schizont, schizont and early ring, respectively.
Secondly, within each stage genes were sorted by probability in
descending order.
Our IDC stageogram demonstrates clear boundaries among these
four stages, unlike Bozdech’s study [1] where the stageogram
showed a cascade of continuous expression. By not classifying
genes with low PVE or low class probabilities into the four
stages, the genes in our stageogram were highly selected for
clear and consistent periodic signatures.
We calculated a meta-expression profile for each stage over the
46 time points by averaging the expression values of all genes
predicted to be in the stage. Sine-cosine curves were then fitted
to the meta-expression profiles using equation (1). As can be
seen in Figure 4, the meta-expression profiles of each stage are
very similar to the profiles of the 4 representative genes shown
in Figure 2. Our proposed method clearly identifies stagespecific patterns.
3.4 Chromosomal Clustering
In the remaining analysis, we focused on the 351 genes with
known staging, together with the 1650 genes whose estimated
class probabilities were at least 0.8, for a total of 2001 genes.
Average Gene Expression Profile of Ring/Early trophozoite Stage
1.0
0.5
0.0
log2(Cy5/Cy3)
10
20
30
40
0
10
20
Hours
30
40
# of adjacent loci predicted to belong to the
same stage in a cluster
2
3
4
5
Chr-1
4
1
Chr-2
15
2
1
1
Chr-3
14
2
2
1
Chr-4
9
3
2
1
Chr-5
19
1
1
Chr-6
13
Chr-7
15
5
1
Chr-8
14
1
Chr-9
16
2
Chr-10
12
5
1
Chr-11
13
7
1
Chr-12
18
3
1
Chr-13
33
15
2
Chr-14
43
8
3
total
238
55
15
-1.0
-1.5
-1.5
0
Hours
Average Gene Expression Profile of Schizont Stage
1.5
1.0
0.0
-1.5
-1.5
-1.0
-0.5
log2(Cy5/Cy3)
0.5
1.5
Average Gene Expression Profile of Early Ring Stage
1.0
0.5
0.0
-0.5
-1.0
log2(Cy5/Cy3)
Table 1. Number of Clusters on each Chromosome with
different cluster size
Chromosome
-0.5
0.5
0.0
-0.5
-1.0
log2(Cy5/Cy3)
1.0
1.5
1.5
Average Gene Expression Profile of Trophozoite/Early Schizont Stage
0
10
20
30
Hours
40
0
10
20
30
40
Hours
Figure 4. Meta-gene expression profiles of 4 stages
As stated before, a chromosomal cluster is defined as two or
more adjacent loci that are classified to the same stage. In order
to determine whether gene clustering exists in the P.falciparum
genome, we mapped the periodically expressed genes onto the
14 chromosomes in a stage dependent manner. Table 1 shows
the number of clusters on each chromosome of different cluster
sizes. A total of 238 clusters containing 2 loci, 55 clusters
containing 3 loci, 15 clusters containing 4 loci and 4 clusters
containing 5 loci were identified. It should be noted that since
the chromosomal clusters were defined in a stage dependent
way, the number of clusters for each chromosome and cluster
size in Table 1 is the total number of clusters over all four IDC
stages. For example, on chromosome 1 for cluster size 2, we
identified 2 clusters at trophozoite/early schizont stage, 1 cluster
at schizont stage and 1 cluster at the early ring stage, so the
total number of clusters on this chromosome is 4.
1
4
Total number of clusters: 312
Figure 5 shows a whole genome view of the 74 large clusters
(where 3 or more adjacent genes were mapped to the same
stage). Blue, yellow, green and red colors represent clusters
identified at ring/early trophozoite, trophozoite/early schizont,
schizont and early ring stages, respectively; circle, diamond and
triangle symbols denote cluster sizes from 3 to 5, respectively.
It can be seen that most large clusters were identified at
ring/early trophozoite and trophozoite/early schizont stages with
cluster size 3.
In order to evaluate whether patterns of clustering similar to
those observed in Table 1 could occur by chance, we performed
a permutation analysis. For each chromosome, we randomly
permuted the order of all the genes. Holding the number of
periodically-expressed genes fixed, together with the number of
genes assigned to each of the four stages, we randomly assigned
these outcomes to the re-ordered genes and counted the number
of clusters observed. Figure 6 illustrates how the permutations
were performed.
Figure 5. Whole chromosome view of 74 large clusters
distributed on 14 chromosomes
For chromosome 14 (the longest chromosome), there are 787
genes, 160, 107, 24 and 35 of which were assigned to stages 14, respectively. We found a total of 26, 17, 0 and 0 clusters of
size 2 for stages 1-4 in the 10,000 permuted data sets, compared
to 21,19,1,2 clusters of size 2 for stages 1-4 in the original data.
No single permutation has more than 5 clusters of size 2 for all
four stages. For larger clusters, the difference between the
original and permuted data is even more dramatic. For other
chromosomes, results were qualitatively similar.
Original Data
Permuted Data Sets
Figure 6. Assessment of significance of chromosomal clustering. On
this fictitious chromosome, there are 30 genes of which 20 are
periodically expressed. Solid blue, yellow, green and red colors
represent periodically-expressed genes assigned to 1-4 stages (ring/early
trophozoite, trophozoite/early schizont, schizont and early ring stages),
solid black symbols represents genes that are periodically expressed but
were not assigned to a particular stage and open circles are genes that
were not periodically expressed. It can be seen that in the original data
there is one cluster of size 3 in stage “blue” and one yellow cluster of
size 2. Three sample permutations are shown above, and one of the
permutations gives a blue cluster of size 2. Hence empirical
significance would be 0/3 for yellow clusters of size 2 and 1/3 for blue
clusters of size 2.
Our study identified many more number of clusters than
Bozdech et al’s study [1]. They defined a chromosomal cluster
as one in which the correlation of 70% of the possible pairs of
adjacent genes on the same chromosome was greater than or
equal to 0.75. Based on this criterion, they found only 37
clusters consisting of 3 genes and 14 clusters consisting of more
than 3 genes. In our study, there were 55 clusters with 3 genes
and 19 clusters consisting of more than 3 genes. Many clusters
detected in their study were also found in our study. For
example, 34 of 51 large clusters (cluster size is 3 or larger)
identified in their study were also found in the 74 large clusters
we detected. The seven genes of the SERA family found on
chromosome 2 [19] were observed in two clusters. The first
SERA gene cluster contained two genes at trophozoite/early
schizont stage and another SERA gene cluster contained 5 genes
at schizont stage.
Based on our and Bozdech et al’s studies [1], it seems that only
clusters with 3 or fewer periodically expressed genes within
same stage, were prevalent in the P.falciparum genome. This
criterion includes about 94% of the chromosomal clusters
detected in our study. It is also interesting to note that there
was no obvious difference in cluster-distribution across the
chromosomes; for example, approximately 33% of the clusters
were on two longest chromosomes 13 and 14, and these
chromosomes form approximately 35% of the total genome
length.
3.5 Gene Functional Analysis of
Chromosomal clusters
We downloaded genes with GO terms and EC for P.falciparum
strain 3D7 from www.PlasmoDB.org. A total of 3119 loci have
been annotated to 2074 functions. Of 312 clusters that contain
721 loci, 126 of them (40.4%) contained at least two adjacent
loci that have been functionally annotated. More than 90% of
the loci in these 126 clusters have been assigned to at least 2
functions. For the large clusters, where there are 3 or more
adjacent genes in a cluster, only genes in two (SERA gene
cluster and ribosomal protein gene cluster) of the 51 large
clusters were shown to have functional relationship (within
cluster) in Bozdech et al.’s study [1]. However, we found 11
(including the above two) of 74 large clusters contain at least
two loci whose annotation clearly indicates that the genes are
functionally related. For example, we idenfied an energy gene
cluster (PF10_0121, PF10_0122 and PF10_0123) at stage 1 on
chromosome 10. A RNA processing gene cluster
(MAL13P1.322, MAL13P1.323 and PF13_0340) and a ATP
binding gene cluster (PF13_0177, PF13_0178, PF13_0179 and
PF13_0180) were also found at stage 1 on chromosome 13.
4. DISCUSSION
In this study we proposed a comprehensive procedure with solid
statistical
basis
to
identify periodically expressed
oligonucleotides, classify these oligonucleotides into different
stages of the intraerythrocytic developmental cycle of
P.falciparum and map them to chromosomes to detect
chromosomal clusters. This method provides a chromosomal
viewpoint of the higher order organization of the genome. We
found that around 60% of the oligonucleotides were periodically
expressed by our definition. Most of them were highly expressed
in ring/early trophozoite and trophozoite/early schizont stages of
IDC. Our study demonstrated that many of the periodically
expressed genes were arranged in clusters with 3 or fewer
periodically expressed genes within same stage. In addition, our
primary analysis showed that some periodically expressed genes
with similar functions are clustered together. This information
may be useful when annotating the function of the many
unknown gene products in the P.falciparum genome.
It should be noted that there are some concerns in this analysis.
The first concern is that our estimate of the FDR for identifying
periodically expressed oligonucleotides was very small, which
gives rise to concern about underestimation. One possible
reason for a downward bias in FDR is that there were significant
serial correlations in the expression levels of a given
oligonucleotide over time due to the slowly varying nature of
the cell culture. Anderson et al. [20] pointed out that
permutation of raw data under the full model will not maintain
type I error close to a nominal  when there is collinearity
among the independent variables. They suggested that
permutation of residuals under a reduced model is a better
choice in this case. The second concern is that the permutation
analysis to evaluate the significance of the number of stagespecific chromosomal clusters in our study is still relatively
rough. Some studies explored the use of the cumulative
binomial distribution [2] or the
 2 distribution [5] to evaluate
the statistical significance of the number of chromosomalspecific clusters for given cluster sizes. A detailed consideration
of methods for assessing the significance of the number of stagespecific chromosomal clusters would be an interesting topic for
further investigation.
[16] Vapnik, V. Statistical learning theory. Wiley, 1998.
5. REFERENCES
[18] Friedman, F. Another approach to polychotomous
classification. Stanford University, Statistics Department
Technical Report. 1996.
[1] Bozdech, et al. The Transcriptome of the intraerythrocytic
developmental cycle of plasmodium falciparum. PloS
Biology, 1, 1-16, 2003.
[2] Cohen, B.A., Mitra, R.D., Hughes, J.D.and Church, G.M.
A computational analysis of whole-genome expression data
reveals chromosomal domains of gene expression. Nature
Genetics, 26, 183-186, 2000.
[3] Caron, H. et al. The human transcriptome map: clustering
of highly expressed genes in chromosomal domains.
Science, 291, 1289-1292, 2001.
[4] Roy, P.J. et al. Chromosomal clustering of muscleexpressed genes in Caenorhabditis elegans. Nature, 418,
975-979, 2002.
[5] Florens, L. et al. A proteomic view of the plasmodium
falciparum life cycle. Nature, 419, 520-526, 2002.
[6] Troyanskaya, O. et al. Missing value estimation methods
for DNA microarrays. Bioinformatics, 17, 520-525, 2001.
[7] Spellman, P.T. et al. Comprehensive identification of cellcycle-regulated genes of the Yeast saccharomyces
cerevisiae by microarray hybridization. Molecular Biology
of the Cell, 9, 3723-3297, 1998.
[8] Whitfield, M.L. et al. Identification of genes periodically
expressed in the human cell cycle and their expression in
tumors. Molecular Biology of the Cell, 13, 1977-2000,
2002.
[9] Booth, J.G. Clustering periodically expressed genes using
mciroarray data: a statistical analysis of the yeast cell
cycle data. University of Florida, Statistics Department
Technical Report. 2003.
[10] Dudoit, S., Shaffer, J.P. and Boldrick, J.C.Multiple
hypothesis testing in microarray experiments. Statistical
Science, 18, 71-103, 2003.
[11] Benjamini, Y. and Hochberg, Y. Controlling the false
discovery rate: a practical and powerful approach to
multiple testing. Journal of the Royal Statistical Society B,
85, 289-300, 1995.
[12] Taylor, J., Tibshirani, R. and Efron, B. The “Miss rate” for
the analysis of gene expression data. Technical Report,
Department of Statistics, Stanford University, http://wwwstat.stanford.edu/~tibs/ftp/miss.pdf, 2004.
[13] Efron, B, Tibshirani, R. and Tusher, V. Empirical Bayes
analysis of a microarray experiment. Journal of the
American Statistical Association, 96, 1151-1160, 2001.
[14] Storey, J. A direct approach to false discovery rate. Journal
of the Royal Statistical Society B, 64,479-498, 2002.
[15] Hastie,T. and Tibshirani,R. Classification by pairwise
coupling. The Annals of Statistics, 26, 451–471, 1998.
[17] Platt, J. Probabilistic outputs for support vector machines
and comparison to regularized likelihood methods.
Advances in Large Margin Classifiers, A. Smola, P.
Bartlett, B. Schoelkopf and D. Schuurmans, Eds.
Cambridge, MA: MIT Press, 2000.
[19] Miller, S.K., et al. A subset of Plasmodium falciparum
SERA genes are expressed and appear to play an important
role in the erythrocytic cycle. Journal of Biology
Chemistry, 277,47524-47532,2002.
[20] Anderson, M.J. and Legender, P. An empirical comparison
of permutation methods for tests of partial regression
coefficients in a linear model. Journal of Statistical
Computation and Simulation, 62, 271-303, 1999.
Download