Supplemental Material

advertisement
Supplementary Methods
SBIME firstly performs an ANOVA (analysis of variance) for the different genes and computes
their associated p-values. The variables used are the expression data for each gene, such as the
log(ratio), signal intensity or any other value that represents expression levels. The factor studied is
a biological criterion defined by the user, for instance, the time points from a kinetics study, or the
subtypes of a particular cancer. Genes showing the lowest variance within a biological group and
the most significant difference of means between the groups will have the lowest p-value derived
from ANOVA. Secondly, using annotation files from Gene Ontology, regulatory networks from
BioCarta, metabolic pathways from KEGG and chromosomal localization downloaded from
[ftp://ftp1.nci.nih.gov/pub/CGAP/], and protein domains from PFAM, , which are automatically
updated on a regular basis, SBIME recovers each gene present in the data set and stores its p-value.
Any categories (pathway, GO annotation, domain, etc…), which do not have a corresponding gene
on the chip, are eliminated from the rest of the study. For each category, SBIME then counts the
number of genes found on the chip and the number of genes found to have a p-value lower than the
threshold fixed by the user and expresses this as a percentage (Ps). With the proportion of
differentially expressed genes calculated for each category, the next step is to assess the
significance of these results. There are two ways of doing this:
The Z-score approach: For each functional category containing X genes on the chip, SBIME
randomly selects X genes from the entire data set and compares the percentage of genes with a pvalue lower than the established threshold with the number of X genes (Pr). This operation is
repeated N times (the number of iterations (N) is determined by the user), and a Z-score is finally
computed as follows:
Z
Ps  P r
P
Under the null hypothesis H0: Z ~N (0.1)
r
where Ps is the percentage of significant genes found in the data set for a given functional category;
Pr is the percentage of significant genes found randomly, P r the mean of the N Pr and  Pr the
square root of the variance of the N Pr.
A Z-score is thus computed for each category. Categories displaying the highest Z-score can be
considered to be of special interest in the data set studied (typically Z>3). Finally, a p-value is
associated with the Z-score in order to facilitate comparison with results obtained from the second
option, described below.
Fisher’s exact test: Each proportion of significant genes found by functional category is compared
to the proportion of significant genes on the array. The corresponding p-value is calculated using
the following formula:
p
(a  b)! (c  d )! (a  c)! (b  d )!
(a  b  c  d )! a!b!c! d!
, where a is the number of significant genes on the
array; b is the number of all the genes on the array; c is the number of significant genes in a given
functional category and d is the number of all the genes in the same category.
Download