COMPARING QUANTITATIVE TRAIT LOCI AND GENE EXPRESSION DATA ASSOCIATED WITH A COMPLEX TRAIT Bing Han*1, Naomi S. Altman*1, Jessica A. Mong2,4, Laura Cousino Klein3, Donald W. Pfaff4, and David J. Vandenbergh3,5 1Department 2Department of Statistics, The Pennsylvania State University, University Park, PA, US of Pharmacology & Experimental Therapeutics, University of Maryland School of Medicine 3 Department of Biobehavioral Health, The Pennsylvania State University, University Park, PA, US 4The 5Center Laboratory of Neurobiology and Behavior, Rockefeller University, NY, NY, US for Developmental and Health Genetics, The Pennsylvania State University, University Park, PA, US * To whom correspondence should be addressed. Abstract We develop methods to compare the positions of quantitative trait loci (QTLs) with a set of genes selected by other methods from a sequenced genome. We apply our methods to QTLs for addictive behavior in mouse, and sets of genes associated in microarray studies with the nucleus accumbens (NA) region of the brain. The association between the QTLs and NA genes is moderately stronger than expected by chance. Statistical methodology developed for this study can be applied to similar studies to assess the mutual information in microarray and QTL analyses. 1 Introduction The association between a complex phenotypic trait and genetic markers on the chromosome can be detected through statistical analysis, leading to the identification of QTLs – regions of the chromosome that appear to be associated with the phenotype. QTLs are expected to be associated with the genes controlling some aspect of the phenotype. One mechanism by which a gene might be associated with the trait is through altered transcription which is easily measured by microarray analysis. Microarrays have the ability to measure a large percentage of the genes in the genome, and this assessment parallels the genome-wide scan performed by QTL methods. Several investigators have considered combining QTL and microarray data for studying a genetic trait. For example, Wayne and Mclntyre (2002) proposed a way of identifying candidate genes based on both QTL mapping and microarray data, where loci for an interested quantitative trait were primarily used for prescreening genes. A parallel microarray study focused on the filtered gene list and identified differential expressed genes related to the same trait. When the type I error is particularly emphasized, a QTL analysis prescreen can "bring the number of genes to be validated to a reasonable size". Fischer et al. (2003) developed a web-based software tool for combined visualization and exploration of gene expression data and QTLs. The methodology 1 developed in this work is complementary to the analyses that can be performed on the GeneNetwork website (WebQTL, www.genenetwork.org), which allows assessment of the relationship between gene expression and QTLs in Recombinant Inbred mice (Wang et al., 2003). The major advantage of a dual approach is that it can identify genes of interest with higher reliability in order to focus further work, which is laborious and expensive. However, comparing QTL and microarray data is not completely straightforward. First, the estimated range of QTL positions is generally wide, containing thousands of putative genes. However, QTL analysis may also miss some interesting genes (Wayne and Mclntyre, 2002). Second, the high level of experimental error and limitations of analysis in microarray data introduce mistakes in the identification of relevant genes. Finally, QTL studies include the entire genome including noncoding regions, while microarray studies seldom include the entire genome. Further problems arise when we try to associate phenotypes with gene expression in specific tissues. While the association is direct if the tissue from which transcription is assayed defines the phenotype, unanticipated associations can arise if the tissue indirectly regulates the phenotype – for example, bone strength may be regulated through physical activities regulated by the brain. Alternatively, association can arise through pleiotropic expression of the gene in a tissue not included in the expression study but in which the gene plays a role in the phenotype. In addition, the association between a phenotype and a tissue may depend on ephemeral conditions that may not be present when the tissue was collected for the microarray study or on a small percentage of cells in the organism, which may be masked by bulk tissue preparation. In this paper, we suggest several methods to examine the strength of association between a group of QTLs and a set of genes. The methods identify QTLs with unusually high numbers of differentially expressed genes, and hence genes and QTLs which are more likely to be associated with the condition of interest. In this paper, the QTLs were selected using a literature search that attempted to identify all QTL studies for the topic of interest, and the genes were selected from a single microarray experiment. However, the methodology is not tied to the data source. We are testing whether a set of selected regions on the chromosomes have a higher association with a selected set of genes than would be expected by chance – the method by which the regions and genes are selected affect the biological interpretation of the results, but not the statistical assessment of association. We apply our methods to a set of mouse QTLs identified from the literature and a set of mouse genes identified from a microarray study. First, we identified a set of 120 QTLs associated with drug abuse behaviors in mice (Jung, 2003) from the Mouse Genome Informatics database (http://www.informatics.jax.org). A set of 166 genes that are preferentially expressed in the nucleus accumbens (NA) region of the mouse brain (the NA genes) were determined from a microarray study of brain tissues (Vandenbergh et al., 2006) using Affymetrix® MG-U74Av2 arrays. This array contains about 1/3 of the coding genes in the mouse genome. Briefly, the NA genes were identified as being expressed at least 1.5-fold higher in the nucleus accumbens compared to two other brain regions, the medial basal hypothalamus and preoptic area, in one-day-old C57BL/6J mouse pups. The study did not include replication, so the statistical significance of the expression differences cannot readily be assessed; association of these genes 2 with QTLs therefore becomes an important tool to help assess the biological significance of the observed differential expression. The two regions used for comparison are physically close to the NA on the ventral surface of the brain but are largely derived from a different embryonic region of the brain (diencephalon compared to telencephalon for the NA). The NA plays an important role in drug abuse-related behavior. Our primary objective was to determine if the QTLs and gene expression study are detecting a set of genes in common that might be related to drug abuse behaviors. However, because the NA genes are selected based on their expression in a selected region of the brain, rather than their direct involvement in drug abuse behaviors, we need to be cautious about the biological interpretation of an association between the QTLs and the gene set. 2 Exploratory data analysis and quantification of association Figure 1 shows the correspondence between the set of QTLs and the set of NA genes. The long horizontal dashed lines are numbered to represent the mouse chromosomes. Note that the Y chromosome is apparently shorter than others and no data were available regarding gene expression or QTLs on it. The positions of the NA genes were determined using the Affymetrix® metadata for the MG-U74Av2 (version 1.10.0) array provided in the Bioconductor suite in R (Gentleman et al., 2004). QTL and NA genes Y X 19 18 17 16 Chromosome 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0.0 e+00 5.0 e+07 1.0 e+08 Basepairs 1.5 e+08 2.0 e+08 Fig. 1. Combined visualization of QTLs and NA genes. The short discrete horizontal segments are the spans of the QTLs defined as +/- 5 centiMorgans (cM) from the peak QTL position. The small circles in the center of every segment are the peak positions of the QTLs. Finally the vertical lines are the NA genes. 3 QTLs are reported in centiMorgans (cM), which measure recombination frequencies between markers on a chromosome. Gene locations are usually measured by the physical distance in base pairs (bp) or megabase pairs (1 Mb =106 bp). Empirically, on average 2 Mb = 1 cM in the mouse chromosome, although there are a few more accurate methods to translate cM into Mb (e.g. Silver, 1995, Fischer et al, 2003, and Voigt et al, 2004). To match QTL sets and gene sets, we need to measure locations on the same scale. We adopted the embedded conversion tool in Expressionview (Fischer et al, 2003) to estimate physical distances from cM. The “smoothing window” technique used in Expressionview essentially applies piecewise regression. However at the edge of chromosomes and some central regions possibly near to the cut points of the “smoothing windows”, Expressionview appeared to give poor estimates. In those cases we used polynomial regression to estimate physical distance from cM by using genes for which both measures are available. This method also has good performance except at the ends of some chromosomes. Any QTL with a span that extends beyond the end of a chromosome is truncated. No obvious matches between the QTL set and the NA genes can be seen in Figure 1. The visual impression does not support a strong association between QTLs and genes. We consider two approaches to quantify the strength of association. For convenience, we denote a set of QTLs, such as drug abuse QTLs, by Q and a set of genes, such as the NA genes, by G. A natural first approach is to consider the percentage of genes in G covered by the whole span of Q. The association between Q and G is strong if this number is big. This quantification reflects the “completeness” of Q in terms of covering G. This suggestion is supported by data from Drosophila, in which co-regulated genes are found in clusters (Spellman et al., 2002). A second approach is to consider whether each QTL in Q covers at least one gene in G. If a QTL in Q covers no genes in G, it is called “empty”; otherwise it is “non-empty”. The association between Q and G is strong when the percentage of empty QTLs is small. This quantification reflects the “accuracy” of Q in terms of covering G. If Q is strongly associated with G, we expect both completeness and accuracy to be high. However the two methods do not necessarily give the same result because they are measuring complementary aspects of an association. In other words, an accurate QTL set could be incomplete, and vice versa. While each method can answer the question if the association is strong in terms of completeness or accuracy, we want to develop a unique measure on both completeness and accuracy together to answer the question: is the association strong? This is defined by a weighted average of completeness and accuracy. Firstly we need to introduce some notation. Let N be the number of genes in G, M be the number of QTLs in Q, n be the number genes in G covered by Q, and m be the non-empty QTLs in Q covering genes in G. It is straightforward to define completeness C = n / N, and accuracy A = m / M. We define the combined measure of an association as C A (1) S M N The weight is chosen to diminish the effect of matching by chance. When M increases, more of the genome will be covered by Q, and the completeness, C, will also increase regardless of the strength of the underlying association. To penalize for the effect of large M, we use 1/M as the 4 weight of completeness. We weight accuracy by 1/N to penalize for increasing the size of G. The limiting behaviors of the combined measure S satisfies the need to differentiate a strong association from a noisy one in which matching results by chance. Let s be the number of genes in G really having matching relationships with some QTL in Q. Correspondingly let r be the number of QTLs in Q that really match some genes in G. Note r is not necessarily equal to s. Besides the true matching relationship, every gene has a probability p = p(M) to be covered by Q. On the other hand every QTL has a probability 1-q = 1-q(N) of being empty with respect to G, so that it has probability q of being "non-empty". By introducing the new notation, the completeness can be written as s C I{gene is matched } genes w/o true match . N Thus, the expectation of C is straightforward: s ( N s) p . EC N Similarly r A (2) (3) I{QTL is non - empty } QTLs w/o true match , M EA r ( M r )q . M (4) (5) r s ( N s ) p ( M r )q (6) MN Consider the following limiting circumstances: 1. (perfect match) when s → N and r → M, ES monotonically increases to the limit (M+N) / MN; 2. (totally random) when s → 0 and r → 0, ES monotonically decreases to the limit (Np+Mq) / MN; 3. (G has spurious genes) when N → ∞ and M is fixed, notice q=q(N) → 1 in this case, ES will converge to p / M; 4. (Q has spurious QTLs) when M → ∞ and N is fixed, notice p=p(M) → 1 in this case, ES will converge to q / N. From the above it can be concluded that the combined measure S will approach its maximum when a perfect match arises and decrease when the association weakens in some respect. Then ES 3 Statistical tests for accuracy and completeness Until the biology is fully understood, we cannot be certain if the association between Q and G is due to chance. In this section, we determine the statistical significance of the observed levels of completeness and accuracy compared to random association, by comparing with a null distribution determined by simulation. Random selection of QTLs is not readily done as selection of random intervals along the chromosomes is unlikely to model the true distribution of QTLs. However, since the physical locations of all genes on the microarray are known, random sets of genes are readily created by choosing genes at random, and considering the completeness or 5 accuracy of the QTL sets with respect to these genes. To assess the strength of association between a QTL set Q and a gene set G of size N, we compute the completeness and accuracy of Q. We then select genes at random from all the genes represented on the microarray. The simplest way to do this is to select N genes at random from the array. However, since there is considerable variability in the percentage of tissue-specific genes on each chromosome, and since the QTLs may not be randomly distributed among chromosomes, we can also consider selecting Ni genes from the ith chromosome, where Ni is the number of genes in the gene set on the chromosome. We call the latter method, the conditional method, because the number of genes selected from each chromosome is conditional on the number of genes in G on the chromosome. By contrast, the former method which selects genes completely at random is called the unconditional method. By repeatedly selecting gene sets at random and computing the completeness and accuracy for Q, a null distribution (unconditional or conditional) is computed. The p-value for the observed completeness or accuracy is the percentage of simulated data sets for which the completeness (accuracy) is as strong as or stronger than the observed value. The estimated p-values are displayed in Table 1 based on 1,000 sets of randomly selected genes. Since the p-values are based on count data, 3 continuity corrections are considered which differ in whether or not the rejection region includes the observed counts. As part of the simulation, we can also compute the distribution of number of genes covered by each QTL. Percentiles of this distribution can be used to identify QTLs that have unusually large coverage on the observed data, and are thus more likely to be associated with genes implicated in the QTL condition. Measure C (Completeness) A (Accuracy) S (Combined) Def. of p-value Conditional Unconditional p (# >observed) 0.085 ** 0.045 *** p (1/2 # observed + # >observed) 0.103 * 0.053 ** p (>= observed) 0.120 0.060 ** p (# >observed) 0.192 0.151 p (1/2 # observed + # >observed) 0.216 0.168 p (>= observed) 0.240 0.185 p (# >observed) 0.140 * 0.098 ** p (1/2 # observed + # >observed) 0.150 * 0.106 * p (>= observed) 0.159 0.113 * ***: significant at 5% level; **: significant at 10% level; * significant at 15% level Table 1. Simulated one-sided p-value for the hypothesis H0: the association is not stronger than expected by chance. The simulation result moderately supports the claim that the hypothesized association is stronger than expected by chance. The p-values for completeness are around 0.10 under both random sampling schemes. It seems the association is not significantly more accurate than expected by chance with p-values around 0.20. The observed completeness is C = 24.1%, the accuracy is A = 44.2%, and the observed S is .0047, compared with the theoretical maximum for 6 S of 0.014. Moreover P(M) and q(N) can be estimated from the simulation and hence we can estimate the three local minimum under limiting circumstances 2, 3 and 4 discussed in the end of section 2. Table 2 displays the comparison of S values under both randomization and limiting circumstances. Although it is far from the maximum, the observed S is above all the estimated local minima for genes selected at random. Limiting case defined in Section Conditional Unconditional random selection 4.06E-3 3.89E-3 spurious genes 1.69E-3 1.60E-3 spurious QTLs 2.37E-3 2.29E-3 2 1 (Theoretical maximum) 14.4E-3 Observed 4.67E-3 Table 2. Estimated limiting extrema of combined measure S The count of non-empty QTLs and covered genes can be used to construct a chi-square type of test for either accuracy or completeness. Using chromosome as the natural category, we define the test statistic as T 2 i 1 ( X i - EX i ) 2 ~ T21 under H 0 : the link is no different from random, EX i (7) X i n i , mi where EXi under H0 can be estimated by random sampling genes, and T is the total number of chromosomes. Under H0: the link is no different from random, EXi is the same as expected counts when genes are selected at random. For a given set of QTLs, we can repeatedly sample random genes. The average of observed counts of non-empty QTLs or covered genes is a consistent estimator for EXi and is quite accurate given that we repeat sampling many times. A true association between QTLs and genes will either strengthen or weaken the link which results in a larger chi-square test statistic. Moreover, the chi-square test provides us with additional insight in identifying potential candidate differentially expressed genes, if QTLs are mapping the same or very similar quantitative traits as the partner microarray study. On the one hand, a chromosome that has a large positive value suggests a region of strong association with the gene set, and hence lends support to the hypotheses that the QTLs and the covered genes are associated with the trait of interest Conversely, a chromosome for which Xi-EXi is a small positive value or a larger negative value suggests that the QTLs and the covered genes may not be associated with the trait. Thus, association between the Q and G can also be used to select QTLs and genes which are more likely to be of interest. The p-values for our study are in Table 3. Again, they support the hypothesis that completeness is higher than expected by chance, but accuracy is only marginally higher than expected by chance – i.e. more genes are covered than expected, but the number of QTLs containing genes in G is about what is expected by chance. For example, chromosome 18 and 16 7 have much higher positive values than average in both completeness and accuracy (under conditional resampling and considering the contribution in ( X i - EX i )/ EX i , chromosome 18 has 3.14 in accuracy and 6.48 in completeness; chromosome 16 has 2.10 in accuracy and 1.84 in completeness; the genome average is .34 and .51 respectively), which suggests that the genes on these two chromosomes are more likely to be associated with the drug abuse trait, and the QTLs on these two chromosomes are more likely to be associated with the NA region than genome-wide average. Conditional Unconditional ni (Completeness) <.001 *** 0.120 * mi (Accuracy) 0.097 ** 0.245 ***: significant at 5% level; **: significant at 10% level; * significant at 15% level, d.f. is 17 Table 3. p-value from the chi-square test for the hypothesis H0: the association is not different from expected by chance. A third test approach is based on the risky assumption that chromosomes are random samples from the same population when measuring the strength of an association. Then the three measures we used can be seen as random samples from two populations: one for the hypothesized association between QTL and NA genes, the other for the random association representing background strength. The measures are paired on each chromosome. A paired two-sample t-test or Wilcoxon sign-rank test (Myles et al, 1999) can then be applied. The results are in Table 4. The data and R code can be accessed from http://www.stat.psu.edu/~hanbing/qtlpaper/. Test C (Completeness) A (Accuracy) S (Combined) Conditional Unconditional 0.106 * 0.100 ** Wilcoxon 0.196 0.209 Paired t 0.199 0.316 Wilcoxon 0.261 0.290 Paired t 0.191 0.180 Wilcoxon 0.275 0.275 Paired t ***: significant at 5% level; **: significant at 10% level; * significant at 15% level Table 4. p-value from the paired t and Wilcoxon signed-rank test for the hypothesis Ho: the association is not stronger than expected by chance. 4 Conclusion and discussion The association between the QTL set and NA genes appears to be stronger than expected by 8 chance in terms of completeness under both randomization schemes. However, the statistical significance of accuracy is weaker. Using the simulated one-sided p-value in Table 1 and the chi-square test on the counts, we conclude that NA genes are significantly more complete in the QTL spans than expected by chance at minimum at the 15% significance level. However, it seems that the QTL set is not accurate in terms of matching NA genes, i.e. most tests fail to reject null hypothesis even at 15% level. The combined measure S strikes a balance between completeness and accuracy. The simulated one-sided p-values reject the null hypothesis in most cases. The p-values from the paired tests including both t-test and Wilcoxon test in Table 4 should be used cautiously. The assumption that chromosomes are an i.i.d. sample from a population is dubious. From Figure 1 at least three features differ distinctly among chromosomes: length, number and location of NA genes, and number and location of QTLs. We noticed that the paired tests produce p-values with similar patterns to other tests but larger values. In summary, there appears to be moderate evidence that the association between QTL and NA genes is stronger than expected by chance. In particular, the QTLs cover NA genes more completely than expected by chance, and there appear to be an excess of QTLs that are not associated with NA genes. A strong association was expected between the NA genes and the drug abuse QTLs. However, this association is only moderately stronger than expected by chance. A possible reason is that the randomly selected genes were selected from those represented on the Affymetrix® array U74Av2, which consists of about one third of the whole genome, so that QTLs may be empty because the associated gene is not on the array. A second possibility is that there are a number of QTLs without association with NA genes. Finally, because the genes were selected from an unreplicated study based on fold-change, we can expect large false detection and nondetection rates. The false detection rate implies that the NA genes likely include several genes from the array which are not associated with the NA which will decrease the statistical significance of completeness somewhat, as the false discoveries similar to genes selected at random. We can expect a much larger number of false nondetections which should induce a correspondingly larger impact on the power to detect the accuracy of the QTLs. This is consistent with high completeness and low accuracy of the QTL set. Completeness, accuracy, and the combined measure, have been proposed as methods to determine whether a set of QTLs and a set of genes are associated. The statistical significance of the association can be estimated by selecting sets of genes at random from the population of genes from which the gene set was determined. When p-values or other measures of strength of association between the trait of interest and the QTL are available, we might consider weighting the measures of accuracy so that a penalty is incurred if a QTL highly associated with the trait is empty, and a gain is incurred if a gene is covered by a highly associated QTL. Weights on the QTLs can readily be incorporated in the simulations required to estimate the p-values, because the QTLs are fixed in the simulation. Weights on the genes are less readily handled, because weights are not available for genes selected at random. The chi-squared test provides a simple method to identify QTLs and genes from the gene set that are most highly associated. A QTL that covers more genes than expected by chance is likely to include a cluster of genes from the gene set, which lends credence to the hypotheses that 9 the QTL and the covered genes are associated with the trait of interest. References Carelli RM, and Wightman RM (2004) Functional microcircuitry in the accumbens underlying drug addiction: insights from realtime signaling during behavior, Curr Opin Neurobiol. 14, 763-768. Fischer, G, Ibrahim, SM, Brockmann, GA, Pahnke, J, Bartocci, E, Thiesen, H, Serrano-Fernandez, P, and Molle, S. (2003) Expressionview: visualization of quantitative trait loc and geneexpression data in Ensembl. Genome Biology, 4: R77. Gentleman, RC, Carey, VJ, Bates, DM, Bolstad, B, Dettling, M, Dudoit, S, Ellis, B, Gautier, L, Ge, Y, and Gentry, J. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5: R80. Hollander, M., Wolfe DA. (1999) Nonparametric statistical inference 2nd ed. John Wiley & Sons. New York, US. Jung, M. (2003) unpublished honors BS thesis, The Pennsylvania State University. Silver, LM. (1995) Mouse genetics: concepts and applications. Oxford University Press, Oxford, UK. Spellman PT, Rubin GM. (2002) Evidence for large domains of similarly expressed genes in the Drosophila genome. Journal of Biology 1:5.1-5. Voigt C, Moller S, Ibrahim SM, Serrano-Fernandez P. (2004). Non-linear conversion between genetic and physical chromosomal distances. Bioinformatics. 20:1966-1967. Wang J, Williams RW, Manly KF. (2003) WebQTL: Web-based complex trait analysis. Neuroinformatics 1: 299-308. Wayne, ML and Mclntyre, LM (2002) Combining mapping and arraying: an approach to candidate gene identification. PNAS:Genetics, 99, 14903-14906. Vandenbergh, DJ, Mong, JA, Klein, LC, Vogler, GP, Callegari, F, Chiaromonte, F, Stine, MM, Peterson, R, and Pfaff, DW (in preparation) Prenatal Nicotine Exposure Regulates Gene Expression in a Sex-Dependent Manner 10