A Genetic Analysis of a Co-Expression Network Reveals

advertisement
Integration of Genetic and Genomic Approaches for the Analysis of Chronic
Fatigue Syndrome Implicates Forkhead Box N1
Anja Presson1,2, Eric Sobel1, Jeanette Papp1, Aldons J. Lusis1, Steve Horvath1,3*
Depts. of Human Genetics1, Statistics2, Biostatistics3, University of California, Los Angeles
*Corresponding author: shorvath@mednet.ucla.edu
Abstract
Chronic fatigue syndrome (CFS) is a difficult disorder to characterize due to its combined physical and
psychological effects. Studies on physiological manifestations have identified an abnormal immune response in
CFS patients, and it is hypothesized that a viral infection triggers CFS in some individuals. In this report we
integrate microarray, SNP marker, and clinical trait data to identify the key drivers of CFS. Our novel weighted
network analysis identifies a set of eight correlated gene candidates for CFS. Among them, the FOXN1 gene is
particularly interesting, as it is known to cause a depleted immune system in both humans and mice.
In this article we define a novel systems biology approach for identifying complex disease genes. As an
illustration we show that FOXN1 is a candidate gene for CFS. Our evidence includes the following four
findings. First, FOXN1 is part of a gene co-expression network module that is related to the CFS severity trait
(clinical variable “CLUSTER”). Second, it has high gene network connectivity. Third, it is significantly
associated with SNP hCV245410 (p-value = 0.0179), which in turn is significantly associated with the CFS
severity trait (p-value = 0.010). And fourth, it is correlated with the CFS severity trait. We note a commercially
available biomarker for FOXN1, and two for CFS.
Background
Chronic fatigue syndrome (CFS) is a complex disease whose spectrum of physiological and
psychological characteristics challenges modern genetics. In 1994, CFS experts convened to outline rigorous
guidelines for diagnosing the disease (Fukuda et al., 1994). They determined that a CFS diagnosis should
require a minimum of six months of medically unexplained, debilitating fatigue, and several additional
symptoms. In 2003, CFS classification was further refined, and mental disorders became exclusionary for CFS
status (Reeves et al., 2003; Carruthers et al., 2003). It has been shown that some CFS patients have elevated
levels of cortisol, due to an overactive hypothalamic–pituitary–adrenal (HPA) axis (Demitrack et al., 1991; Fries
et al., 2004), an altered immune response substantiated by high T-cell counts (Rasmussen et al., 1994; Maher et
al., 2005), and skeletal muscle dysfunction (Jammes et al., 2005; Nijs et al., 2005). In some patients, CFS
appears to be triggered by viral infection (Mawle et al., 1994; Coyle et al., 1990). Gene expression studies have
lent support to this theory by identifying overexpression of immune response genes (Kaushik et al., 2005).
Recently, several groups have proposed using microarray data to construct gene expression networks
and to identify network modules (sets of tightly correlated genes) and highly connected “hub” genes. Here, we
use microarray and genetic marker data on CFS patients to identify key genetic drivers of CFS in humans. First,
a weighted gene co-expression network is constructed and its modules are identified. In gene co-expression
networks, Pearson correlations in mRNA levels are normally used to define connectivity and to group genes
with similar expression profiles into modules, thereby illuminating higher-order properties of the transcriptome.
We sought to identify functional disease related modules of co-expressed genes and to characterize them in
terms of the clinical traits. Characterizing the modules in terms of clinical traits, their gene ontology information
(EASE analysis), and SNP marker correlations, allowed us to identify clinically relevant gene modules while
simultaneously identifying candidate genes that have contributed to the CFS trait.
Weighted gene co-expression network construction
We considered the 165 patients that were genotyped and assayed for gene expression levels. We
narrowed this set of samples further by excluding two outlier arrays (based on their very high mean expression
levels) and limiting our analysis to patients classified as fatigued in the surveillance study. Fatigued patients
were considered to include all but the control category.
1
Figure 1. Visualizing the gene network. A. Average linkage hierarchical clustering reveals branches (gene
modules) of correlated genes. Grey is used to denote genes that are not part of any module. The remaining
colors are used for the four modules. B. Classical multidimensional scaling plot of the gene network. Note that
modules form distinct point clouds of genes. C. Heat maps of gene expression values (rows) versus samples
(columns). The vertical bands show that the genes within a module are highly correlated.
The microarray expression data consisted of approximately 20,000 probes from glass-slide arrays.
ArrayVision software read the slides and normalized the data by subtracting background intensity from the spot
intensity values. We analyzed these “sARMDens” values.
Since our particular interest was in identifying gene modules related to the CFS trait, we restricted the
analysis to genes that showed some sign of differential expression between CFS patients and controls (absolute
value of the T-statistic larger than 1.15). Further, we removed outlying probe sets, e.g., on the basis of extreme
skewness. These data pre-processing steps resulted in a set of 2,824 genes.
Weighted network construction was performed using R as described in Zhang and Horvath (2005).
Briefly, the absolute value of the Pearson correlation coefficient was calculated for all pair-wise comparisons of
gene-expression values across all microarray samples. The correlation matrix for each species was then
transformed into a matrix of connection strengths using a power function (connection strength=|correlation| ß),
which resulted in a weighting network. The use of weighted networks represents an improvement over
unweighted networks based on dichotomizing the correlation matrix, since a) the continuous nature of the gene
co-expression information is preserved and b) the results of weighted network analyses are highly robust with
respect to the choice of the parameter ß, whereas unweighted networks display sensitivity to the choice of the
cutoff. The scale-free topology criterion was used and led us to choose ß=14 but our findings are highly robust
with respect to the choice of beta. The overall connectivity for each gene (k) is the sum of the connection
strengths (|correlation|ß) between that gene and all other 2823 genes in the network. The intramodular
connectivity for each gene is the sum of the connection strengths between that gene and all other genes in its
module. R software code, a tutorial, and a technical report for generating weighted gene co-expression networks
can be found at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork.
Using the clinical trait significance measures to characterize the gene modules
Gene co-expression modules in the network were identified using average linkage hierarchical
clustering to group genes based upon the topological overlap of their connectivity (Ravasz et al., 2002; Zhang
and Horvath, 2005). In Figure 1A, modules correspond to branches of the dendrogram. Modules were also
visualized by classical multi-dimensional scaling in three dimensions as seen in Figure 1B. For each module, a
“heat map” (Figure 1C) was produced with rows corresponding to genes and columns corresponding to
samples. Next, the resulting gene modules were related to the clinical traits to identify clinically interesting
modules. For a given clinical trait, one can define a measure of trait significance by forming the absolute value
2
of the correlation between trait and gene expression values. For example, the CFS severity score (CLUSTER)
can be used to define the gene significance of the i-th gene expression x(i):
Trait-Significance(i) = |cor(x(i),CLUSTER)|.
[1]
Table 1 shows that the modules can be characterized by different clinical traits that are correlated with the
respective module genes. We were particularly interested in the CLUSTER trait since it measures severity. Note
that the green module stands out because it had the highest mean correlation (0.216) with the CLUSTER trait.
Table 1. Average trait significance for different clinical traits and average SNP gene significance within each
module.
Functional enrichment analysis of the green module
Using the EASE gene ontology enrichment analysis software, we found that the green module was
significantly enriched with nitrogen metabolism genes (Fisher exact p-value 0.00040) and muscle development
genes (p-value 0.0016).
Relating the CFS Severity Score (CLUSTER) to the SNPs
We narrowed the list of SNPs from more than 40 to two by demanding an association with the cluster
trait at p-value ≤ 0.01. This identified the two SNPs: hCV245410 (chromosome 12, p-value=0.010) and
hCV7911132 (chromosome 17, p-value=0.001).
Using the SNP gene significance measures to characterize the gene modules
Next we related the SNP marker with the module gene expressions to identify genetic loci (referred to
as module loci) that drive the module gene expression profiles. These module loci overlap in some cases with
the clinical trait loci. Analogous to the notion of trait significance, each SNP also gives rise to a measure of SNP
gene significance as follows. First, use additive marker coding (0,1,2) to code genotypes AA, AB, and BB,
respectively. Next, define the SNP gene significance measure of the i-th gene expression x(i) by the absolute
value of its correlation with the SNP under consideration. In Figure 2 we make use of the measure of SNP gene
significance
SNP-Gene-Significance(i) = |cor(x(i), hCV245410)|.
[2]
For the SNPs that were significantly associated with the CLUSTER trait, Table 1 reports the average SNP gene
significance measures, i.e., the mean absolute correlations between traits and the two significant SNPs. Note that
SNP hCV245410 on chromosome 12 leads to an elevated SNP gene significance measure in the green module.
Network-based screening strategy
It is biologically plausible to assume that genes associated with SNP hCV245410 and with the
CLUSTER trait may be related to CFS. Furthermore, genes that have high connectivity are likely to play a
prominent role in the gene network. We selected candidate genes by imposing the following gene screening
criteria (see Figure 2):
3
i)
ii)
iii)
iv)
Trait significance greater than 0.2 (using CLUSTER trait in equation [1]);
SNP gene significance greater than 0.2 (using SNP hCV245410 in equation [2], since this SNP
associated with the CLUSTER trait);
Weighted network connectivity bigger than 2, to ensure that the gene is highly connected;
Membership in the green module, since it had high trait significance with the CLUSTER trait and is
enriched with plausible gene ontologies.
Figure 2. A. Trait significance (based on the trait CLUSTER) versus network connectivity.
B. Trait significance versus SNP gene significance (hCV245410). The eight candidate genes in the green
module are numbered in each plot. Note the 8 genes have high trait significance, high network connectivity and
high SNP gene significance.
Results and discussion of the FOXN1 gene
Our eight candidate genes are briefly described in Table 2. We consider FOXN1 as a plausible
candidate for a biomarker of CFS because our network analysis demonstrates its significant role and because it
has been shown to affect the immune system and T-cell production, symptoms that are consistent with CFS.
From public data, it is known that FOXN1 is highly expressed in thymus epithelia cells, and the thymus gland
plays a pivotal role in the immune system. The thymus gland converts lymphocytes to T-cells and releases
functional T-cells to combat infection. Mutations in the FOXN1 gene have been shown to cause nudity and a
depleted immune system due to dysfunctional T-cells (Nehls et al. 1994; Pignata et al., 1996; Adriani et al.
2004). In chronic fatigue syndrome, it has been shown that there are significantly low levels of the protein
perforin, which composes T-cells (Maher et al, 2005) and high T-cells activation. Due to the role of the FOXN1
in T-cell impairment and the T-cell abnormality observed in CFS patients, the FOXN1 gene makes a
biologically plausible biomarker for CFS.
Our gene screening strategy is related to the “genetical genomics” approaches (for examples see Hubert
et al. 2005 and Schadt et al. 2003). Our gene network strategy may not be suitable for identifying genes that
only interact with a few other genes. However, the network concepts of module and connectivity are clearly
biologically meaningful tools for screening for biomarkers in the current application. Moreover, even without
our network-based evidence, the FOXN1 gene is an excellent candidate gene as it shows significant association
with the hCV245410 SNP (p-value = 0.0179), a SNP that is significantly associated with the CFS severity trait
(p-value = 0.010). There are both genomic and antibody biomarkers available for the FOXN1 gene, as well as a
FOXN1 deficient mouse model. Chronic fatigue syndrome has also been studied in mice (Chao et al., 1992) and
it would be interesting to study the effect of knocking out the FOXN1 gene in CFS mice.
4
Table 2. Top eight genes identified by our network-based screening strategy.
P-Value (Correlation)
Commercial
Accession Gene Symbol (Name) and Information
Locus
CLUSTER
SNP
Biomarker
NM_003593 FOXN1 (forkhead box N1): Functions in defense
response, T-cell immunodeficiency, and known to
cause nudity in mice and humans. Present in thymus. 17q11-q12 0.055 (-0.21) 0.018 (0.21)
YES
AB051077 PEX6 (peroxisomal biogenesis factor 6): absence
results in zellweger syndrome (zws), neurological and
metabolic defects.
6p21.1
0.032 (-0.23) 0.013 (0.22)
YES
AF118073 PRDX3 (peroxiredoxin 3): Regulates cell proliferation,
differentiation, and antioxidant functions.
10q25-q26 0.017 (-0.26) 0.020 (0.21)
YES
AF106685 MYEF2 (myelin expression factor 2): myoblast cell
differentiation and transcription.
15q21.1
0.050 (-0.21) 0.012 (0.22)
YES
AF111802 CRNKL1 (Crn, crooked neck-like 1 (Drosophila)):
expressed in testes, involved in mRNA splicing
20p11.2
0.012 (-0.27) 0.013 (0.22)
YES
BC010019 MED8 (mediator of RNA polymerase II transcription,
subunit 8 homolog (yeast)): regulates transcription.
1p34.2
0.007 (-0.29) 0.015 (0.22)
YES
XM_067644 Similar to polynucleotide phosphorylase-like protein
and 3-5 RNA exonuclease.
0.007 (-0.29) 0.002 (0.27)
NO
0.032 (-0.23) 0.007 (0.24)
NO
BC004179 Unknown (protein for mgc:2780)
References
1. Adriani, M., Martinez-Mir, A., Fusco, F., Busiello, R., Frank, J., Telese, S., Matrecano, E., Ursini, M.V.,
Christiano, A.M., Pignata, C. (2004). Ann Hum Genet 68, 265–268.
2. Zhang, B. and Horvath, S. (2005). Statistical Applications in Genetics and Molecular Biology 4, 17.
3. Carruthers, B.M., Jain, A.K., and De Meirleir, K.L. et al. (2003). J Chronic Fatigue Syndr 11, 7–15.
4. Chao, CC, DeLaHunt, M, Hu, S, Close, K, and Peterson, PK (1992) Clin Immun Immunopath 64, 161–165.
5. Coyle, P.K., Krupp, L. (1990) Ann Neurol 28, 243–4.
6. Demitrack, M.A., Dale, J.K., Straus, S.E., Laue, L., Listwak, S.J., Kruesi, M.J., Chrousos, G.P., and Gold,
P.W. (1991). J Clin Endocrinol Metab 73, 1224–1234.
7. Fries, E., Hesse, J., Hellhammer, J., and Hellhammer, D.H. (2004). Psychoneuroendocri 30, 1010–1016.
8. Fukuda, K., Straus, S.E., and Hickie et al. (1994). Ann Intern Med 121, 953–959.
9. Hubner, N., Wallace, C.A., Zimdahl, H., Petretto, E., Schulz, H., et al. (2005). Nat Genet 37, 243–253.
10. Jammes, Y., Steinberg, J.G., Mambrini, O., Bregeon, F., Delliaux, S. (2005). J Intern Med 257, 299–310.
11. Kaushik, N., Fear, D., Richards, S.C.M., McDermott, C.R., Nuwaysir, E.F., Kellam, P., Harrison, T.J.,
Wilkinson, R.J., Tyrrell, D. A.J., Holgate, S.T. and Kerr, J.R. (2005). J Clinical Path 58, 826–832.
12. Mawle, A.C., Reyes, M., Schmid, C.S. (1994). Infect Agents Dis 2, 333–41.
13. Nehls, M., Pfeifer, D., Schorpp, M., Hedrich, H., and Boehm, T. (1994). Nature 372, 103–107.
14. Nijs, J., Meeus, M., McGregor, N.R., Meeusen, R., de Schutter, G., van Hoof, E., de Meirleir K. (2005).
Med Sci Sports Exerc 37, 1647–1654.
15. Maher, K. J., Klimas, N. G., Fletcher, M. A. (2005) Clin Exp Immunol 142, 505–511.
16. Pignata, C., Fiore, M., Guzzetta, V., Castaldo, A., Sebastio, G., Porta, F., and Guarino, A. (1996). Am J Med
Genet 65, 167–170.
17. Rasmussen, A.K., Nielsen, H., Andersen, V., Barington, T., Bendtzen, K., Hansen, M.B., Nielsen, L.,
Pedersen, B.K., and Wiik, A. (1994) J Rheumatol 21, 1527–1531.
18. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., and Barabasi, A.L.(2002) Science 297, 1551–1555.
19. Reeves, W.C., Lloyd, A., Vernon, S.D., Klimas, N., Jason, L.A., Bleijenberg, G., Evengard, B., White, P.D.,
Nisenbaum, R., Unger, E.R., International Chronic Fatigue Syndrome Study Group (2003). BMC Health
Serv Res 3, 25.
20. Schadt, E.E., Monks, S.A., Drake, T.A., Lusis, A.J., Che, N., et al. (2003). Nature 422, 297–302.
5
Download