Integration of Genetic and Genomic Approaches for the Analysis of Chronic Fatigue Syndrome Implicates Forkhead Box N1 Anja Presson1,2, Eric Sobel1, Jeanette Papp1, Aldons J. Lusis1, Steve Horvath1,3* Depts. of Human Genetics1, Statistics2, Biostatistics3, University of California, Los Angeles *Corresponding author: shorvath@mednet.ucla.edu Abstract Chronic fatigue syndrome (CFS) is a difficult disorder to characterize due to its combined physical and psychological effects. Studies on physiological manifestations have identified an abnormal immune response in CFS patients, and it is hypothesized that a viral infection triggers CFS in some individuals. In this report we integrate microarray, SNP marker, and clinical trait data to identify the key drivers of CFS. Our novel weighted network analysis identifies a set of eight correlated gene candidates for CFS. Among them, the FOXN1 gene is particularly interesting, as it is known to cause a depleted immune system in both humans and mice. In this article we define a novel systems biology approach for identifying complex disease genes. As an illustration we show that FOXN1 is a candidate gene for CFS. Our evidence includes the following four findings. First, FOXN1 is part of a gene co-expression network module that is related to the CFS severity trait (clinical variable “CLUSTER”). Second, it has high gene network connectivity. Third, it is significantly associated with SNP hCV245410 (p-value = 0.0179), which in turn is significantly associated with the CFS severity trait (p-value = 0.010). And fourth, it is correlated with the CFS severity trait. We note a commercially available biomarker for FOXN1, and two for CFS. Background Chronic fatigue syndrome (CFS) is a complex disease whose spectrum of physiological and psychological characteristics challenges modern genetics. In 1994, CFS experts convened to outline rigorous guidelines for diagnosing the disease (Fukuda et al., 1994). They determined that a CFS diagnosis should require a minimum of six months of medically unexplained, debilitating fatigue, and several additional symptoms. In 2003, CFS classification was further refined, and mental disorders became exclusionary for CFS status (Reeves et al., 2003; Carruthers et al., 2003). It has been shown that some CFS patients have elevated levels of cortisol, due to an overactive hypothalamic–pituitary–adrenal (HPA) axis (Demitrack et al., 1991; Fries et al., 2004), an altered immune response substantiated by high T-cell counts (Rasmussen et al., 1994; Maher et al., 2005), and skeletal muscle dysfunction (Jammes et al., 2005; Nijs et al., 2005). In some patients, CFS appears to be triggered by viral infection (Mawle et al., 1994; Coyle et al., 1990). Gene expression studies have lent support to this theory by identifying overexpression of immune response genes (Kaushik et al., 2005). Recently, several groups have proposed using microarray data to construct gene expression networks and to identify network modules (sets of tightly correlated genes) and highly connected “hub” genes. Here, we use microarray and genetic marker data on CFS patients to identify key genetic drivers of CFS in humans. First, a weighted gene co-expression network is constructed and its modules are identified. In gene co-expression networks, Pearson correlations in mRNA levels are normally used to define connectivity and to group genes with similar expression profiles into modules, thereby illuminating higher-order properties of the transcriptome. We sought to identify functional disease related modules of co-expressed genes and to characterize them in terms of the clinical traits. Characterizing the modules in terms of clinical traits, their gene ontology information (EASE analysis), and SNP marker correlations, allowed us to identify clinically relevant gene modules while simultaneously identifying candidate genes that have contributed to the CFS trait. Weighted gene co-expression network construction We considered the 165 patients that were genotyped and assayed for gene expression levels. We narrowed this set of samples further by excluding two outlier arrays (based on their very high mean expression levels) and limiting our analysis to patients classified as fatigued in the surveillance study. Fatigued patients were considered to include all but the control category. 1 Figure 1. Visualizing the gene network. A. Average linkage hierarchical clustering reveals branches (gene modules) of correlated genes. Grey is used to denote genes that are not part of any module. The remaining colors are used for the four modules. B. Classical multidimensional scaling plot of the gene network. Note that modules form distinct point clouds of genes. C. Heat maps of gene expression values (rows) versus samples (columns). The vertical bands show that the genes within a module are highly correlated. The microarray expression data consisted of approximately 20,000 probes from glass-slide arrays. ArrayVision software read the slides and normalized the data by subtracting background intensity from the spot intensity values. We analyzed these “sARMDens” values. Since our particular interest was in identifying gene modules related to the CFS trait, we restricted the analysis to genes that showed some sign of differential expression between CFS patients and controls (absolute value of the T-statistic larger than 1.15). Further, we removed outlying probe sets, e.g., on the basis of extreme skewness. These data pre-processing steps resulted in a set of 2,824 genes. Weighted network construction was performed using R as described in Zhang and Horvath (2005). Briefly, the absolute value of the Pearson correlation coefficient was calculated for all pair-wise comparisons of gene-expression values across all microarray samples. The correlation matrix for each species was then transformed into a matrix of connection strengths using a power function (connection strength=|correlation| ß), which resulted in a weighting network. The use of weighted networks represents an improvement over unweighted networks based on dichotomizing the correlation matrix, since a) the continuous nature of the gene co-expression information is preserved and b) the results of weighted network analyses are highly robust with respect to the choice of the parameter ß, whereas unweighted networks display sensitivity to the choice of the cutoff. The scale-free topology criterion was used and led us to choose ß=14 but our findings are highly robust with respect to the choice of beta. The overall connectivity for each gene (k) is the sum of the connection strengths (|correlation|ß) between that gene and all other 2823 genes in the network. The intramodular connectivity for each gene is the sum of the connection strengths between that gene and all other genes in its module. R software code, a tutorial, and a technical report for generating weighted gene co-expression networks can be found at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork. Using the clinical trait significance measures to characterize the gene modules Gene co-expression modules in the network were identified using average linkage hierarchical clustering to group genes based upon the topological overlap of their connectivity (Ravasz et al., 2002; Zhang and Horvath, 2005). In Figure 1A, modules correspond to branches of the dendrogram. Modules were also visualized by classical multi-dimensional scaling in three dimensions as seen in Figure 1B. For each module, a “heat map” (Figure 1C) was produced with rows corresponding to genes and columns corresponding to samples. Next, the resulting gene modules were related to the clinical traits to identify clinically interesting modules. For a given clinical trait, one can define a measure of trait significance by forming the absolute value 2 of the correlation between trait and gene expression values. For example, the CFS severity score (CLUSTER) can be used to define the gene significance of the i-th gene expression x(i): Trait-Significance(i) = |cor(x(i),CLUSTER)|. [1] Table 1 shows that the modules can be characterized by different clinical traits that are correlated with the respective module genes. We were particularly interested in the CLUSTER trait since it measures severity. Note that the green module stands out because it had the highest mean correlation (0.216) with the CLUSTER trait. Table 1. Average trait significance for different clinical traits and average SNP gene significance within each module. Functional enrichment analysis of the green module Using the EASE gene ontology enrichment analysis software, we found that the green module was significantly enriched with nitrogen metabolism genes (Fisher exact p-value 0.00040) and muscle development genes (p-value 0.0016). Relating the CFS Severity Score (CLUSTER) to the SNPs We narrowed the list of SNPs from more than 40 to two by demanding an association with the cluster trait at p-value ≤ 0.01. This identified the two SNPs: hCV245410 (chromosome 12, p-value=0.010) and hCV7911132 (chromosome 17, p-value=0.001). Using the SNP gene significance measures to characterize the gene modules Next we related the SNP marker with the module gene expressions to identify genetic loci (referred to as module loci) that drive the module gene expression profiles. These module loci overlap in some cases with the clinical trait loci. Analogous to the notion of trait significance, each SNP also gives rise to a measure of SNP gene significance as follows. First, use additive marker coding (0,1,2) to code genotypes AA, AB, and BB, respectively. Next, define the SNP gene significance measure of the i-th gene expression x(i) by the absolute value of its correlation with the SNP under consideration. In Figure 2 we make use of the measure of SNP gene significance SNP-Gene-Significance(i) = |cor(x(i), hCV245410)|. [2] For the SNPs that were significantly associated with the CLUSTER trait, Table 1 reports the average SNP gene significance measures, i.e., the mean absolute correlations between traits and the two significant SNPs. Note that SNP hCV245410 on chromosome 12 leads to an elevated SNP gene significance measure in the green module. Network-based screening strategy It is biologically plausible to assume that genes associated with SNP hCV245410 and with the CLUSTER trait may be related to CFS. Furthermore, genes that have high connectivity are likely to play a prominent role in the gene network. We selected candidate genes by imposing the following gene screening criteria (see Figure 2): 3 i) ii) iii) iv) Trait significance greater than 0.2 (using CLUSTER trait in equation [1]); SNP gene significance greater than 0.2 (using SNP hCV245410 in equation [2], since this SNP associated with the CLUSTER trait); Weighted network connectivity bigger than 2, to ensure that the gene is highly connected; Membership in the green module, since it had high trait significance with the CLUSTER trait and is enriched with plausible gene ontologies. Figure 2. A. Trait significance (based on the trait CLUSTER) versus network connectivity. B. Trait significance versus SNP gene significance (hCV245410). The eight candidate genes in the green module are numbered in each plot. Note the 8 genes have high trait significance, high network connectivity and high SNP gene significance. Results and discussion of the FOXN1 gene Our eight candidate genes are briefly described in Table 2. We consider FOXN1 as a plausible candidate for a biomarker of CFS because our network analysis demonstrates its significant role and because it has been shown to affect the immune system and T-cell production, symptoms that are consistent with CFS. From public data, it is known that FOXN1 is highly expressed in thymus epithelia cells, and the thymus gland plays a pivotal role in the immune system. The thymus gland converts lymphocytes to T-cells and releases functional T-cells to combat infection. Mutations in the FOXN1 gene have been shown to cause nudity and a depleted immune system due to dysfunctional T-cells (Nehls et al. 1994; Pignata et al., 1996; Adriani et al. 2004). In chronic fatigue syndrome, it has been shown that there are significantly low levels of the protein perforin, which composes T-cells (Maher et al, 2005) and high T-cells activation. Due to the role of the FOXN1 in T-cell impairment and the T-cell abnormality observed in CFS patients, the FOXN1 gene makes a biologically plausible biomarker for CFS. Our gene screening strategy is related to the “genetical genomics” approaches (for examples see Hubert et al. 2005 and Schadt et al. 2003). Our gene network strategy may not be suitable for identifying genes that only interact with a few other genes. However, the network concepts of module and connectivity are clearly biologically meaningful tools for screening for biomarkers in the current application. Moreover, even without our network-based evidence, the FOXN1 gene is an excellent candidate gene as it shows significant association with the hCV245410 SNP (p-value = 0.0179), a SNP that is significantly associated with the CFS severity trait (p-value = 0.010). There are both genomic and antibody biomarkers available for the FOXN1 gene, as well as a FOXN1 deficient mouse model. Chronic fatigue syndrome has also been studied in mice (Chao et al., 1992) and it would be interesting to study the effect of knocking out the FOXN1 gene in CFS mice. 4 Table 2. Top eight genes identified by our network-based screening strategy. P-Value (Correlation) Commercial Accession Gene Symbol (Name) and Information Locus CLUSTER SNP Biomarker NM_003593 FOXN1 (forkhead box N1): Functions in defense response, T-cell immunodeficiency, and known to cause nudity in mice and humans. Present in thymus. 17q11-q12 0.055 (-0.21) 0.018 (0.21) YES AB051077 PEX6 (peroxisomal biogenesis factor 6): absence results in zellweger syndrome (zws), neurological and metabolic defects. 6p21.1 0.032 (-0.23) 0.013 (0.22) YES AF118073 PRDX3 (peroxiredoxin 3): Regulates cell proliferation, differentiation, and antioxidant functions. 10q25-q26 0.017 (-0.26) 0.020 (0.21) YES AF106685 MYEF2 (myelin expression factor 2): myoblast cell differentiation and transcription. 15q21.1 0.050 (-0.21) 0.012 (0.22) YES AF111802 CRNKL1 (Crn, crooked neck-like 1 (Drosophila)): expressed in testes, involved in mRNA splicing 20p11.2 0.012 (-0.27) 0.013 (0.22) YES BC010019 MED8 (mediator of RNA polymerase II transcription, subunit 8 homolog (yeast)): regulates transcription. 1p34.2 0.007 (-0.29) 0.015 (0.22) YES XM_067644 Similar to polynucleotide phosphorylase-like protein and 3-5 RNA exonuclease. 0.007 (-0.29) 0.002 (0.27) NO 0.032 (-0.23) 0.007 (0.24) NO BC004179 Unknown (protein for mgc:2780) References 1. Adriani, M., Martinez-Mir, A., Fusco, F., Busiello, R., Frank, J., Telese, S., Matrecano, E., Ursini, M.V., Christiano, A.M., Pignata, C. (2004). Ann Hum Genet 68, 265–268. 2. Zhang, B. and Horvath, S. (2005). Statistical Applications in Genetics and Molecular Biology 4, 17. 3. Carruthers, B.M., Jain, A.K., and De Meirleir, K.L. et al. (2003). J Chronic Fatigue Syndr 11, 7–15. 4. Chao, CC, DeLaHunt, M, Hu, S, Close, K, and Peterson, PK (1992) Clin Immun Immunopath 64, 161–165. 5. Coyle, P.K., Krupp, L. (1990) Ann Neurol 28, 243–4. 6. Demitrack, M.A., Dale, J.K., Straus, S.E., Laue, L., Listwak, S.J., Kruesi, M.J., Chrousos, G.P., and Gold, P.W. (1991). J Clin Endocrinol Metab 73, 1224–1234. 7. Fries, E., Hesse, J., Hellhammer, J., and Hellhammer, D.H. (2004). Psychoneuroendocri 30, 1010–1016. 8. Fukuda, K., Straus, S.E., and Hickie et al. (1994). Ann Intern Med 121, 953–959. 9. Hubner, N., Wallace, C.A., Zimdahl, H., Petretto, E., Schulz, H., et al. (2005). Nat Genet 37, 243–253. 10. Jammes, Y., Steinberg, J.G., Mambrini, O., Bregeon, F., Delliaux, S. (2005). J Intern Med 257, 299–310. 11. Kaushik, N., Fear, D., Richards, S.C.M., McDermott, C.R., Nuwaysir, E.F., Kellam, P., Harrison, T.J., Wilkinson, R.J., Tyrrell, D. A.J., Holgate, S.T. and Kerr, J.R. (2005). J Clinical Path 58, 826–832. 12. Mawle, A.C., Reyes, M., Schmid, C.S. (1994). Infect Agents Dis 2, 333–41. 13. Nehls, M., Pfeifer, D., Schorpp, M., Hedrich, H., and Boehm, T. (1994). Nature 372, 103–107. 14. Nijs, J., Meeus, M., McGregor, N.R., Meeusen, R., de Schutter, G., van Hoof, E., de Meirleir K. (2005). Med Sci Sports Exerc 37, 1647–1654. 15. Maher, K. J., Klimas, N. G., Fletcher, M. A. (2005) Clin Exp Immunol 142, 505–511. 16. Pignata, C., Fiore, M., Guzzetta, V., Castaldo, A., Sebastio, G., Porta, F., and Guarino, A. (1996). Am J Med Genet 65, 167–170. 17. Rasmussen, A.K., Nielsen, H., Andersen, V., Barington, T., Bendtzen, K., Hansen, M.B., Nielsen, L., Pedersen, B.K., and Wiik, A. (1994) J Rheumatol 21, 1527–1531. 18. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., and Barabasi, A.L.(2002) Science 297, 1551–1555. 19. Reeves, W.C., Lloyd, A., Vernon, S.D., Klimas, N., Jason, L.A., Bleijenberg, G., Evengard, B., White, P.D., Nisenbaum, R., Unger, E.R., International Chronic Fatigue Syndrome Study Group (2003). BMC Health Serv Res 3, 25. 20. Schadt, E.E., Monks, S.A., Drake, T.A., Lusis, A.J., Che, N., et al. (2003). Nature 422, 297–302. 5