Research Gene body methylation shows distinct patterns associated with different gene origins and duplication modes and has a heterogeneous relationship with gene expression in Oryza sativa (rice) Yupeng Wang1,2, Xiyin Wang1,3, Tae-Ho Lee1, Shahid Mansoor1 and Andrew H. Paterson1 1 Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA; 2Computational Biology Service Unit, Cornell University, Ithaca, NY 14853, USA; 3Center for Genomics and Computational Biology, School of Life Sciences, School of Sciences, Hebei United University, Tangshan, Hebei, 063000, China Summary Author for correspondence: Andrew H. Paterson Tel: +1 706 583 0162 Email: paterson@plantbio.uga.edu Received: 31 October 2012 Accepted: 6 December 2012 New Phytologist (2013) doi: 10.1111/nph.12137 Key words: correlation analysis, DNA methylation, gene body, gene duplication, gene origin, Ks, rice (Oryza sativa). Whole-genome duplication (WGD) has been recurring and single-gene duplication is also widespread in angiosperms. Recent whole-genome DNA methylation maps indicate that gene body methylation (i.e. of coding regions) has a functional role. However, whether gene body methylation is related to gene origins and duplication modes has yet to be reported. In rice (Oryza sativa), we computed a body methylation level (proportion of methylated CpG within coding regions) for each gene in five tissues. Body methylation levels follow a bimodal distribution, but show distinct patterns associated with transposable element-related genes; WGD, tandem, proximal and transposed duplicates; and singleton genes. For pairs of duplicated genes, divergence in body methylation levels increases with physical distance and synonymous (Ks) substitution rates, and WGDs show lower divergence than single-gene duplications of similar Ks levels. Intermediate body methylation tends to be associated with high levels of gene expression, whereas heavy body methylation is associated with lower levels of gene expression. The biological trends revealed here are consistent across five rice tissues, indicating that genes of different origins and duplication modes have distinct body methylation patterns, and body methylation has a heterogeneous relationship with gene expression and may be related to survivorship of duplicated genes. Introduction Gene duplication is a primary mechanism for the evolution of novelty and complexity in higher organisms (Ohno, 1970; Flagel & Wendel, 2009; Innan & Kondrashov, 2010). It is now known that genes may be duplicated by various modes, generally referred to as large-scale and small-scale duplications (Maere et al., 2005; Casneuf et al., 2006; Ganko et al., 2007; Freeling, 2009; Wang et al., 2012). The most frequent consequence of gene duplication is reversion to single-copy (singleton) status (Freeling & Thomas, 2006; Freeling, 2009); however, genes retained in duplicate offer the potential for the evolution of novelty (Ohno, 1970; Flagel & Wendel, 2009; Innan & Kondrashov, 2010). Thus, the study of mechanisms for gene retention and evolution in view of different gene duplication modes is very important (Wang et al., 2012). Oryza sativa (rice) is a good model to elucidate the genetic mechanisms and evolutionary features of different gene duplication modes (Wang et al., 2007, 2011; Li et al., 2009). Rice has experienced at least two whole-genome duplications (WGDs), one shared with most if not all cereals (q), and another Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust more ancient event (r) (Paterson et al., 2004; Tang et al., 2010). In angiosperm species, most duplicated chromosomal segments are thought to arise from WGDs (Tang et al., 2008a,b). Smallscale gene duplications, often referred to as single-gene duplications, are also widespread in rice (Wang et al., 2007, 2011; Li et al., 2009). According to the physical distance between duplicates, single-gene duplications can be further classified into local and transposed gene duplications (Ganko et al., 2007; Wang et al., 2011, 2012). Local duplications may occur as tandem duplications (i.e. duplicated genes are consecutive in the genome), which may be caused by illegitimate chromosomal recombination (Freeling, 2009), or proximal duplications (i.e. separated by one or more genes), which may be caused by localized transposon activities (Zhao et al., 1998; Wang et al., 2011, 2012). Transposable element (TE)-related genes comprise a significant portion of rice protein-coding genes (Yuan et al., 2005; Jiao & Deng, 2007). TE-related genes have normal gene structures with coding capacity and transcriptional activity, but share significant sequence similarity with known TEs (Jiao & Deng, 2007). Transposed duplications that create two gene copies far New Phytologist (2013) 1 www.newphytologist.com New Phytologist 2 Research away from each other are widespread in plants (Freeling et al., 2008; Freeling, 2009; Woodhouse et al., 2010, 2011; Wang et al., 2011, 2012), suggesting that many non-TE-related genes are also mobile, via either DNA- or RNA-mediated transposition (Cusack & Wolfe, 2007). Transposed duplicates may also occur by intrachromosomal recombination (Woodhouse et al., 2011). Divergence between duplicated genes increases with time, but the rate/extent of divergence is affected by gene duplication modes (Casneuf et al., 2006; Arabidopsis Interactome Mapping Consortium, 2011; Wang et al., 2011). Generally, WGD duplicates are less divergent than other duplicates (Casneuf et al., 2006; Ganko et al., 2007; Li et al., 2009; Wang et al., 2011). Moreover, singletons show higher interspecies conservation than duplicates based on cross-species comparison of genomic and expression data (Ha et al., 2009; Wang et al., 2011). Indeed, the distinct evolutionary effects of gene duplication modes may, in turn, affect the rates of gene retention, depending on functional category-specific selection pressures on neo-functionalization, functional buffering or high expression (Freeling, 2009; Innan & Kondrashov, 2010; Wang et al., 2012). Under-explored and controversial in the current literature are the roles of epigenetic marks in gene duplication, evolution and retention. DNA methylation is one of the most important epigenetic marks, and high-resolution whole-genome DNA methylation maps based on bisulfite sequencing have been made for rice (Feng et al., 2010; Zemach et al., 2010a,b). Previous analyses of whole-genome DNA methylation data have suggested that rice DNA methylation occurs predominantly at cytosine followed by guanine, that is, ‘CpG’ dinucleotides (Feng et al., 2010; Zemach et al., 2010b). Gene body methylation (DNA methylation of coding regions) is conserved across eukaryotic lineages (Lee et al., 2010; Su et al., 2011). Although it is broadly accepted that promoter methylation is generally associated with the repression of plant gene expression (Zhang et al., 2006; Su et al., 2011), the functional roles of gene body methylation are controversial (Lee et al., 2010; Su et al., 2011). To date, gene body methylation has been suggested to enhance accurate splicing of primary transcripts (Lorincz et al., 2004; Kolasinska-Zwierz et al., 2009; Schwartz et al., 2009; Luco et al., 2010) and/or prevent ‘leaky’ expression from intragenic cryptic promoters (Zilberman et al., 2007; Maunakea et al., 2010). In Arabidopsis and rice, association of gene body methylation with active transcription has been proposed (Zhang et al., 2006; Zilberman et al., 2007; Zemach et al., 2010b; Takuno & Gaut, 2012). By contrast, several studies in rice have suggested that the major effect of body methylation on gene expression is repression (Li et al., 2008; He et al., 2010). From the point of view of evolution, body-methylated genes have been suggested to be functionally important and to evolve slowly (Sarda et al., 2012; Takuno & Gaut, 2012). However, the interplay between gene body methylation and gene duplication, as well as the evolution of duplicate genes, has been little explored. Study of the potential interplay between gene body methylation and gene origins and duplications may help us to understand the roles of epigenetic factors in shaping current genomes, as well as the mechanisms underlying gene duplications and evolution. In rice, we analyzed single-base resolution, whole-genome DNA New Phytologist (2013) www.newphytologist.com methylation maps of five tissues (Zemach et al., 2010a,b). For each gene, we computed a body methylation level (proportion of methylated CpG dinucleotides within coding regions) in each tissue. We classified rice genes into different origins and duplication modes, including TE-related genes, singletons, and WGD, tandem, proximal and transposed duplicates, and compared the body methylation levels among different categories of genes. For duplicated genes, we examined divergence in body methylation levels and its relationship with coding sequence divergence. Furthermore, we studied the potential relationships between body methylation and duplicate gene retention. Finally, we investigated the complicated relationships between body methylation and gene expression levels. Materials and Methods Sequence sources The rice gene set was retrieved from the Rice Genome Annotation Project (TIGR5, http://rice.plantbiology.msu.edu/). The gene sets of outgroups, including Sorghum bicolor, Brachypodium and Zea mays, were retrieved from Phytozome (http://www. phytozome.net/). For each gene, only the first transcript in the genome annotation (transcript name suffixed by ‘.1’) was used for analysis. Identification of genes of different origins Rice genes were first divided into TE-related and non-TE-related genes, according to TIGR5. The non-TE-related genes were further classified into WGD duplicates, singletons, tandem, proximal, transposed and dispersed duplicates. To this end, the population of potential gene duplications in rice was identified using BLASTP (Altschul et al., 1990) (TE-related genes were not considered for BLASTP). For each gene, only the top five nonself BLASTP matches that met a threshold of E < 10 10 were considered as potential gene duplication relationships. The genes without any BLASTP hit were deemed singletons. WGD duplicates were obtained from a previous study (Tang et al., 2010). We then derived single-gene duplications by excluding pairs of WGD duplicates from the population of gene duplications. Tandem duplicates were adjacent homologs and proximal duplicates were not adjacent, but within 10 annotated genes of each other on the same chromosomes and without any paralog between them. The remaining single-gene duplications, that is, after deduction of the tandem and proximal duplications, were searched for transposed duplications. To accomplish this aim, genes at ancestral (i.e. interspecies collinear) chromosomal positions were discerned by aligning syntenic blocks within rice and between rice and its outgroups, including Sorghum bicolor, Brachypodium and Zea mays. For a pair of transposed duplicates, we required that one duplicate was at its ancestral locus and the other was at a nonancestral locus, named the parental duplicate and transposed duplicate, respectively. For a transposed duplicate, there may be multiple ancestral paralogs, and we regarded the ancestral paralog with highest sequence identity as its parental duplicate. The Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust New Phytologist remaining duplicates which do not belong to any of the WGD, tandem, proximal and transposed duplicates were simply denoted as dispersed duplicates. Rice whole-genome DNA methylation data Rice single-base resolution DNA methylation data of embryo, endosperm, leaf, root and shoot tissues, generated by bisulfite sequencing technology, were obtained from two previous studies (Zemach et al., 2010a,b). We used the processed data provided by the authors, available at the Gene Expression Omnibus database (accession numbers: GSM497260, GSM560562, GSM560563, GSM560564 and GSM560565). In the processed data, the likelihood of methylation was shown for each CpG, CHG and CHH site, whose chromosomal position was annotated according to TIGR5. Only CpG methylation was considered in this study. The likelihood of CpG methylation showed a strong bimodal distribution, and we regarded a value of > 0.5 as methylation of CpG dinucleotides. Comparing the distributions of body methylation levels As body methylation levels tend to be bimodally distributed, it is not reasonable to compute a single mean and standard deviation of body methylation levels for a gene group. To compare the distributions of body methylation levels of different gene groups, we used both parametric and nonparametric tests: (1) parametric test: we counted the gene numbers associated with low methylation (body methylation level < 0.1), intermediate methylation (0.1 body methylation level 0.9), and high methylation (body methylation level > 0.9) for each gene group, and then compared the gene numbers with different extent of methylation between different gene groups using a v2 test; and (2) nonparametric test: the comparison of the distributions of body methylation levels between two gene groups was modeled as testing whether one gene group had more outliers (highly body-methylated genes) than the other group. The Outlier-Sum statistic (Tibshirani & Hastie, 2007) was adopted. P values were assessed based on 104 permutations of the pooled body methylation levels of the two gene groups for comparison. Ks calculation Protein sequences of duplicated genes were aligned using Clustalw (Thompson et al., 1994) with default parameters. Then, the protein alignment was converted to a coding sequence alignment using the ‘Bio::Align::Utilities’ module in the BioPerl package (http://www.bioperl.org/). Ks was calculated using the methods of Nei & Gojobori (1986) and Yang & Nielsen (2000), via the ‘Bio::Align::DNAStatistics’ and ‘Bio::Tools::Run::Phylo:: PAML::Yn00’ modules, respectively, in the BioPerl package. It should be noted that extremely high levels of sequence divergence between duplicated genes may cause the ‘Bio::Align::DNAStatistics’ module to generate invalid Ks values, which were then ruled out from the related analysis. Following a previous study in rice (Tang et al., 2010), we excluded Ks values for gene pairs with Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust Research 3 average third-codon-position GC content (GC3) > 75% from related statistical analyses because there are two distinct groups of genes with significantly different GC3. Ks values > 3.0 were also excluded because of saturated substitutions at synonymous positions. Gene expression data Processed rice expression data over 508 tissues and physiological conditions, generated by the Affymetrix GeneChip Rice Genome Array, were obtained from previous studies (Ficklin et al., 2010; Wang et al., 2011). In the data, the numbers of columns that sampled embryo, endosperm, leaf, root and shoot were 3, 4, 50, 99 and 84, respectively. For some genes, there are multiple probe sets on the array to measure their expression. Inclusion or exclusion of ‘suboptimal’ probe sets with suffix ‘_s_at’ or ‘_x_at’, which were suspected of potential crosshybridization, has been shown previously to have only trivial effects (Wang et al., 2011). In this study, all types of probe sets were considered and, for a gene with multiple probe sets, the first probe set according to alphabetic sorting was used to represent its expression profile. Correlation analysis and smoothing spline regression In this study, correlations were measured by Spearman’s correlation coefficients. Smoothing spline regression was performed via the ‘smooth.spline’ function of R language. To avoid overfitting in smoothing spline regression, three degrees of freedom, including 2, 4 and 6, were tested. Results Gene origins in rice Like many other eukaryotic species, the rice genome has been shaped and dynamically reconstructed by multiple evolutionary forces and events, which render its genes to have different origins (International Rice Genome Sequencing Project, 2005). TErelated genes are classified on the basis of sharing significant sequence similarity with TEs (Jiao & Deng, 2007). Among nonTE-related genes, those present in only single copies were deemed to be singletons, whereas others were deemed to be duplicated. Duplicated genes were further classified in terms of duplication modes, with those at collinear positions of intraspecies syntenic blocks deemed to be WGD duplicates (Tang et al., 2010). All other duplicates were assumed to have occurred by single-gene duplications, further classified into tandem, proximal and dispersed, as described above. The mechanisms underlying dispersed duplications are very complicated (Wang et al., 2012). However, if one member of a pair of dispersed duplications was at its ancestral locus and the other was at a nonancestral locus, such gene duplications were deemed to be transposed (Wang et al., 2011, 2012). Summary statistics on rice gene origins are shown in Table 1, and the classification of duplicated genes is shown in Supporting Information Table S1. New Phytologist (2013) www.newphytologist.com New Phytologist 4 Research Table 1 Statistics on rice (Oryza sativa) genes of different origins and duplication modes Gene origin Number of gene pairs Number of distinct genes Non-TE-related Singletons Duplicates WGD Tandem Proximal Transposed Dispersed TE-related N/A N/A N/A 3087 2008 2484 6269 N/A N/A 41 046 12 618 28 428 5061 3529 3728 6269 12 957 15 232 N/A, not applicable; TE, transposable element. Body methylation levels show different distributions associated with gene origins and duplication modes To investigate the patterns of gene body methylation in view of different gene origins and duplication modes, we computed the body methylation level for each gene, defined as the proportion of methylated CpG dinucleotides relative to all CpG dinucleotides within its coding region, in embryo, endosperm, leaf, root and shoot. To test the consistency of body methylation levels across tissues, we visualized the body methylation levels of all genes between all pairs of tissues via scatter plots (Fig. S1). Although endosperm tissue shows higher variations than other tissues, body methylation levels are much more likely to be consistent (rather than different) across tissues, that is, points (genes) are densely distributed along the ‘y = x’ diagonal line in the scatter plots. This analysis indicates that it is feasible to study the evolutionary characteristics of body methylation for large groups of genes with the acknowledgement of the existence of tissuespecific body methylation for specific genes. A recent study has suggested that gene bodies cluster into two groups corresponding to high and low levels of DNA methylation, respectively, in honeybee, silkworm, sea squirt and sea anemone (Sarda et al., 2012). We plotted the distribution of body methylation levels for all rice genes (Fig. 1a), finding a clear bimodal distribution peaking at ‘0’ or ‘1’, suggesting that gene bodies tend to be either highly methylated or little methylated in rice. We found that different gene origins differ in the distributions of body methylation levels. First, we compared the distributions of body methylation levels between TE-related and nonTE-related genes, and found that the two distributions were significantly different (P < 2.2 9 10 16, v2; P < 10 4, OutlierSum statistic; see the Materials and Methods section) (Fig. 1b). Specifically, most TE-related genes are highly body-methylated (body methylation level > 0.9), consistent with previous studies (Zilberman et al., 2007; Li et al., 2008; Feng et al., 2010; He et al., 2010; Zemach et al., 2010b), whereas non-TE-related genes are bimodally distributed, with more genes little bodymethylated (body methylation level < 0.1). As noted previously, TE-related genes exhibit much lower transcriptional activities New Phytologist (2013) www.newphytologist.com than non-TE-related genes (Jiao & Deng, 2007), suggesting that high levels of body methylation may be associated with reduced transcription, and conflicting with the hypothesis that body methylation has only minor, but positive, effects on the levels of gene expression (Zhang et al., 2006; Zilberman et al., 2007; Zemach et al., 2010b; Takuno & Gaut, 2012). We compared the distributions of body methylation levels between different origins within non-TE-related genes. Singletons show a higher frequency of high body methylation than do duplicates (Fig. 1c; P < 2.2 9 10 16, v2; P < 10 4, Outlier-Sum statistic; see the Materials and Methods section). Tandem, proximal and transposed duplicates show an obvious frequency peak of high body methylation (Fig. 1d), whereas WGD duplicates do not (P < 2.2 9 10 16, v2; P < 10 4, Outlier-Sum statistic; see the Materials and Methods section). Moreover, the likelihood of a duplicated gene being highly body-methylated follows the tendency: transposed > proximal > tandem > WGD (P < 2.2 9 10 16, v2; P < 10 4, Outlier-Sum statistic; see the Materials and Methods section). In partial summary, body methylation levels show different distributions associated with gene origins and duplication modes, suggesting that genes of different origins tend to have distinct epigenetic features. Divergence in body methylation levels between duplicated genes Genes duplicated by different modes differ in the extent of expression divergence and the rewiring of protein–protein networks (De Smet & Van de Peer, 2012; Wang et al., 2012). Here, we examined whether duplicated genes of different modes also differ significantly in divergence in body methylation levels. Divergence in body methylation levels among gene pairs duplicated by different modes (Fig. 2a) showed the following trend: random gene pairs > transposed duplicates > proximal duplicates > tandem duplicates WGD duplicates (both an ANOVA model involving all duplication modes and Tukey’s honestly significant difference (HSD) test between adjacent duplication modes were significant at a = 0.05), indicating that different modes of gene duplication tend to result in different extents of divergence in body methylation levels. The physical distance between single-gene duplicates (in terms of number of genes apart) also followed a trend: transposed duplicates > proximal duplicates > tandem duplicates. We hypothesized that there may be position effects that affect body methylation levels, for example, genes that are closer to each other on chromosomes tend to have more similar body methylation levels. To this end, we randomly selected 20 000 gene pairs on the same chromosomes and computed the correlations between divergence in body methylation levels and physical distance. These correlations ranged from 0.053 to 0.061 (P < 4.2 9 10 14), indicating that there exist weak position effects that affect body methylation levels for all rice genes. For single-gene duplicates, these correlations ranged from 0.111 to 0.137 (P < 2.2 9 10 16), indicating that the position effects increase slightly for single-gene duplicate pairs relative to random gene pairs. At the same physical distance, single-gene duplicates diverge less in body methylation levels than Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust New Phytologist Research 5 (a) (b) (c) (d) Fig. 1 Gene body methylation shows different patterns associated with gene origins and duplication modes. Each column represents one tissue. (a) Distribution of body methylation levels for all rice genes. (b) Comparison of distributions of body methylation levels between transposable element (TE)related and non-TE-related genes. (c) Comparison of distributions of body methylation levels between singleton and duplicate genes. (d) Comparison of distributions of body methylation levels among whole-genome duplication (WGD), tandem, proximal and transposed duplicates. do random gene pairs (Fig. 2b), suggesting that body methylation patterns are either copied or recapitulated following gene duplication. Relationship between body methylation patterns and Ks for pairs of duplicated genes To understand how gene body methylation evolves following gene duplication, it may be helpful to relate patterns of body methylation of duplicated genes to the divergence of their coding sequence. Synonymous (Ks) substitution rates largely reflect the neutral mutation rates of coding sequences, suggested to increase approximately linearly with time for relatively low levels of sequence divergence (Li, 1997). We first related divergence in body methylation levels between duplicated genes to Ks using linear regression (Fig. 3a). Positive correlations were found for all Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust duplication modes (0.113 r 0.175, P < 2.2 9 10 16). For single-gene duplicates, these correlations ranged from 0.112 to 0.185 (P 1.081 9 10 9). However, as we have shown that, for single-gene duplicates, there is a weak correlation between divergence in body methylation levels and physical distance, the position effects could be a nuisance factor for the correlation between divergence in body methylation levels and Ks. To remove the effect of physical distance on these correlations for single-gene duplicates, we computed the partial correlations between divergence in body methylation levels and Ks. These partial correlations ranged from 0.101 to 0.159 (P 3.794 9 10 8), declining by 0.01–0.03 from their corresponding correlations, indicating that physical distance has a very weak effect on the correlation between divergence in body methylation levels and Ks. Thus, divergence in body methylation levels between duplicated genes tends to increase with Ks. Moreover, at similar New Phytologist (2013) www.newphytologist.com 6 Research New Phytologist (a) (b) Fig. 2 Divergence in body methylation levels between duplicated genes. Each column represents one tissue. (a) Comparison of divergence in body methylation levels among different modes of gene duplication. Whiskers correspond to the minimum and maximum values in the data. (b) Linear regressions between divergence in body methylation levels and physical distance for random gene pairs and single-gene duplicate pairs. Ks levels, WGDs tend to have smaller divergence in body methylation levels between duplicates than do tandem, proximal or transposed duplications. The different extent of divergence in body methylation levels between gene duplication modes may be explained by the hypothesis that WGDs generate duplicated chromosomal segments in which collinear duplicates are more likely to have similar chromatin environments, whereas singlegene, especially transposed, duplications re-locate to new chromosomal positions which often have different chromatin environments. Next, we related the body methylation levels of duplicated genes to Ks using linear regression (Fig. 3b). The direction of the correlations differs among different modes of gene duplication: Body methylation of WGD duplicates is positively correlated with Ks (0.051 r 0.084, P < 0.05), whereas body methylation of single-gene duplicates decreases with Ks ( 0.212 r 0.082, P < 9.4 9 10 4). Some duplicated genes are highly methylated, particularly those generated by single-gene duplications. It is well known that single-gene duplicates have a shorter half-life than WGD-generated duplicates (Lynch & Conery, 2000). Different rates of nonrandom gene loss shortly after WGD and single-gene duplication may contribute to the contrasting directions of the correlations between body methylation levels and Ks. In the first few million years following single-gene duplication, many duplicates become nonfunctionalized and are lost (Innan & Kondrashov, 2010). Biases among these genes may mitigate the long-term tendency towards increased body methylation, as in WGD duplicates, for example if highly bodymethylated duplicates are preferentially lost. Thus, there could be links between body methylation patterns and the probability of long-term survival of duplicated genes. New Phytologist (2013) www.newphytologist.com Relationship between gene body methylation and gene expression The observation that TE-related genes are highly body-methylated, but little expressed, appears to conflict with the observation that body methylation has a positive effect on the levels of gene expression (Zhang et al., 2006; Zilberman et al., 2007; Zemach et al., 2010b; Takuno & Gaut, 2012). However, these two observations might be reconciled if gene body methylation has heterogeneous effects on gene expression, that is, gene body methylation affects gene expression in different ways under different conditions. We plotted the regression lines between gene expression levels and body methylation levels for all nonTE-related genes based on each tissue, using smooth splines with different degrees of freedom (Fig. 4); this showed that intermediate body methylation tends to be associated with higher gene expression levels than both low and high body methylation. To test this observation statistically, we computed the correlations between body methylation levels and expression levels for the genes with body methylation levels of < 0.5 and 0.5. These correlations ranged from 0.223 to 0.284 (P < 2.2 9 10 16) when the body methylation level was < 0.5, and from 0.182 to 0.101 (P 1.648 9 10 9) when the body methylation level was 0.5. This result suggests that intermediate body methylation may indeed have positive effects on transcription, possibly through the enhancement of accurate splicing of primary transcripts, whereas high body methylation is more likely to repress gene expression, which may lead to pseudofunctionalization or gene losses. We related gene expression to variances of body methylation levels across tissues. Based on Fig. S1, we inferred that TE-related Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust New Phytologist Research 7 (a) (b) Fig. 3 Relationships between patterns of body methylation and Ks for duplicated genes. Each column represents one tissue. (a) Linear regressions between divergence in body methylation levels and Ks for different modes of gene duplication. (b) Linear regressions between body methylation levels and Ks for different modes of gene duplication. genes tend to have more uniform body methylation levels (closer to the ‘y = x’ diagonal line) than do non-TE-related genes, which was then proven statistically by two-sample t-test for variances of body methylation levels between TE-related and non-TE-related genes (P < 2.2 9 10 16). This observation indicates that the ‘repressive’ TE-related body methylation tends to be uniform across tissues. For non-TE-related genes, we found that there is a significant positive correlation (r = 0.173, P < 2.2 9 10 16) between the average expression levels and variances of body methylation levels, indicating that non-TE-related genes with high expression tend to vary in body methylation across tissues. Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust Discussion We have related gene body methylation to gene origins and duplication modes in rice. Our results suggest that genes of different origins and duplication modes are associated with different patterns of gene body methylation, and highly body-methylated genes are preferentially lost following gene duplication. Although it is known that natural variations in DNA methylation exist among individuals of a species (Becker et al., 2011; Bell et al., 2011; Fraser et al., 2012) and that, within an individual, many cytosines may be differentially methylated among different tissues (Zemach et al., 2010a; Zhang et al., 2011; Vining et al., 2012) or New Phytologist (2013) www.newphytologist.com 8 Research New Phytologist Fig. 4 Gene body methylation has heterogeneous effects on gene expression. Smooth spline curves are fitted between gene expression levels and body methylation levels for all non-transposable element (TE)-related genes, based on different degrees of freedom. A body methylation level of 0.5 appears to be a point dividing the up- and down-regulation of gene expression levels. developmental stages (Alisch et al., 2012), or between normal and stress conditions (Chinnusamy & Zhu, 2009), our analyses of body methylation patterns based on five different tissues reveal highly consistent evolutionary trends. We summarized a body methylation level for each gene that may involve hundreds of CpG dinucleotides. Further, we compared body methylation levels among large groups of genes with each group consisting of several thousand genes. Thus, our computational procedure, through mitigation of the effect of dynamic changes of New Phytologist (2013) www.newphytologist.com methylation status that may occur at some cytosine nucleotides, is reliable for large-scale evolutionary analyses. DNA methylation is an important epigenetic mark and can affect the nucleotide composition of DNA sequences. DNA methylation can trigger the spontaneous deamination of methylcytosine to thymine (Bird, 1980; Jones et al., 1987; Pfeifer, 2006), which makes DNA methylation levels and GC levels interdependent. The data of this study showed strong negative correlations ( 0.514 r 0.458, P < 2.2 9 10 16) between Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust New Phytologist body methylation levels and the GC content at the third codon position (GC3) for rice genes. The evolution of DNA methylation patterns and DNA sequences can be intermingled, and the study of DNA methylation evolution may facilitate the understanding of mechanisms for DNA sequence evolution. In eukaryotic genomes, there are multiple epigenetic marks, including DNA methylation, histone modifications, nucleosome positioning and others, all of which may contribute to the regulation of gene expression (Henderson & Jacobsen, 2007). Among these epigenetic marks, DNA methylation has been studied extensively for its role in the regulation of gene expression. In rice, Li et al. (2008) showed an interplay between DNA methylation, histone methylation and gene expression, and that gene expression appeared to be repressed by DNA methylation, but to be rescued by the concurrence of DNA and H3K4 methylation. He et al. (2010) found a weak negative correlation between DNA methylation and transcript levels, and that TE-related genes are highly methylated and little transcribed. In Populus trichocarpa, gene body methylation is suggested to have a more repressive effect than promoter methylation on transcription (Vining et al., 2012). By contrast, in Arabidopsis, many studies have suggested that gene body methylation is associated with active transcription (Zhang et al., 2006; Zilberman et al., 2007; Takuno & Gaut, 2012). The conflicting conclusions on the direction of the relationship between body methylation and gene expression in previous studies may be because an overall correlation pattern has often been sought, overlooking the possibility that body methylation may have heterogeneous effects on gene expression. In conclusion, in rice, using the proportion of methylated CpG dinucleotides within coding regions to measure the level of gene body methylation, we found that body methylation levels follow a bimodal distribution peaking at ‘0’ or ‘1’, and display distinct patterns associated with different gene origins and duplication modes. For pairs of duplicated genes, divergence in body methylation levels increases with physical distance and Ks, and WGDs show lower divergence than single-gene duplications at similar Ks levels. Body methylation of WGD duplicates tends to increase with Ks, whereas the body methylation levels of single-gene duplicates decrease with Ks, indicating that highly body-methylated genes are preferentially lost following gene duplication. Moderate body methylation tends to enhance gene expression, whereas light or heavy body methylation tends to repress gene expression. This study suggests that genes of different origins and duplication modes have distinct body methylation patterns, and body methylation evolves with DNA sequence evolution, has heterogeneous effects on gene expression and might be related to survivorship of duplicated genes. Acknowledgements We thank Barry Marler for IT support, Xinyu Liu for statistical consulting and Haibao Tang for providing python scripts. A.H.P. appreciates funding from the National Science Foundation (NSF: DBI 0849896, MCB 0821096, MCB 1021718). This study was supported in part by resources and technical expertise from the Georgia Advanced Computing Resource Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust Research 9 Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer. References Alisch RS, Barwick BG, Chopra P, Myrick LK, Satten GA, Conneely KN, Warren ST. 2012. Age-associated DNA methylation in pediatric populations. Genome Research 22: 623–632. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403–410. Arabidopsis Interactome Mapping Consortium. 2011. Evidence for network evolution in an Arabidopsis interactome map. Science 333: 601–607. Becker C, Hagmann J, Muller J, Koenig D, Stegle O, Borgwardt K, Weigel D. 2011. Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480: 245–249. Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK. 2011. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology 12: R10. Bird AP. 1980. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Research 8: 1499–1504. Casneuf T, De Bodt S, Raes J, Maere S, Van de Peer Y. 2006. Nonrandom divergence of gene expression following gene and genome duplications in the flowering plant Arabidopsis thaliana. Genome Biology 7: R13. Chinnusamy V, Zhu JK. 2009. Epigenetic regulation of stress responses in plants. Current Opinion in Plant Biology 12: 133–139. Cusack BP, Wolfe KH. 2007. Not born equal: increased rate asymmetry in relocated and retrotransposed rodent gene duplicates. Molecular Biology and Evolution 24: 679–686. De Smet R, Van de Peer Y. 2012. Redundancy and rewiring of genetic networks following genome-wide duplication events. Current Opinion in Plant Biology 15: 168–176. Feng S, Cokus SJ, Zhang X, Chen PY, Bostick M, Goll MG, Hetzel J, Jain J, Strauss SH, Halpern ME et al. 2010. Conservation and divergence of methylation patterning in plants and animals. Proceedings of the National Academy of Sciences, USA 107: 8689–8694. Ficklin SP, Luo F, Feltus FA. 2010. The association of multiple interacting genes with specific phenotypes in rice using gene coexpression networks. Plant Physiology 154: 13–24. Flagel LE, Wendel JF. 2009. Gene duplication and evolutionary novelty in plants. New Phytologist 183: 557–564. Fraser HB, Lam LL, Neumann SM, Kobor MS. 2012. Population-specificity of human DNA methylation. Genome Biology 13: R8. Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annual Review of Plant Biology 60: 433–453. Freeling M, Lyons E, Pedersen B, Alam M, Ming R, Lisch D. 2008. Many or most genes in Arabidopsis transposed after the origin of the order Brassicales. Genome Research 18: 1924–1937. Freeling M, Thomas BC. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Research 16: 805–814. Ganko EW, Meyers BC, Vision TJ. 2007. Divergence in expression between duplicated genes in Arabidopsis. Molecular Biology and Evolution 24: 2298– 2309. Ha M, Kim ED, Chen ZJ. 2009. Duplicate genes increase expression diversity in closely related species and allopolyploids. Proceedings of the National Academy of Sciences, USA 106: 2295–2300. He G, Zhu X, Elling AA, Chen L, Wang X, Guo L, Liang M, He H, Zhang H, Chen F et al. 2010. Global epigenetic and transcriptional trends among two rice subspecies and their reciprocal hybrids. Plant Cell 22: 17–33. Henderson IR, Jacobsen SE. 2007. Epigenetic inheritance in plants. Nature 447: 418–424. Innan H, Kondrashov F. 2010. The evolution of gene duplications: classifying and distinguishing between models. Nature Reviews Genetics 11: 97–108. International Rice Genome Sequencing Project. 2005. The map-based sequence of the rice genome. Nature 436: 793–800. New Phytologist (2013) www.newphytologist.com New Phytologist 10 Research Jiao Y, Deng XW. 2007. A genome-wide transcriptional activity survey of rice transposable element-related genes. Genome Biology 8: R28. Jones M, Wagner R, Radman M. 1987. Mismatch repair of deaminated 5methyl-cytosine. Journal of Molecular Biology 194: 155–159. Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu XS, Ahringer J. 2009. Differential chromatin marking of introns and expressed exons by H3K36me3. Nature Genetics 41: 376–381. Lee TF, Zhai J, Meyers BC. 2010. Conservation and divergence in eukaryotic DNA methylation. Proceedings of the National Academy of Sciences, USA 107: 9027–9028. Li WH. 1997. Molecular evolution. Sunderland, MA, USA: Sinauer Associates. Li X, Wang X, He K, Ma Y, Su N, He H, Stolc V, Tongprasit W, Jin W, Jiang J et al. 2008. High-resolution mapping of epigenetic modifications of the rice genome uncovers interplay between DNA methylation, histone methylation, and gene expression. Plant Cell 20: 259–276. Li Z, Zhang H, Ge S, Gu X, Gao G, Luo J. 2009. Expression pattern divergence of duplicated genes in rice. BMC Bioinformatics 10(Suppl 6): S8. Lorincz MC, Dickerson DR, Schmitt M, Groudine M. 2004. Intragenic DNA methylation alters chromatin structure and elongation efficiency in mammalian cells. Nature Structural & Molecular Biology 11: 1068–1075. Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T. 2010. Regulation of alternative splicing by histone modifications. Science 327: 996–1000. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151–1155. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y. 2005. Modeling gene and genome duplications in eukaryotes. Proceedings of the National Academy of Sciences, USA 102: 5454–5459. Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D’Souza C, Fouse SD, Johnson BE, Hong C, Nielsen C, Zhao Y et al. 2010. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 466: 253–257. Nei M, Gojobori T. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution 3: 418–426. Ohno S. 1970. Evolution by gene duplication. New York, NY, USA: Springer. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proceedings of the National Academy of Sciences, USA 101: 9903– 9908. Pfeifer GP. 2006. Mutagenesis at methylated CpG sequences. DNA Methylation: Basic Mechanisms 301: 259–281. Sarda S, Zeng J, Hunt BG, Yi SV. 2012. The evolution of invertebrate gene body methylation. Molecular Biology and Evolution 29: 1907–1916. Schwartz S, Meshorer E, Ast G. 2009. Chromatin organization marks exon– intron structure. Nature Structural & Molecular Biology 16: 990–995. Su Z, Han L, Zhao Z. 2011. Conservation and divergence of DNA methylation in eukaryotes: new insights from single base-resolution DNA methylomes. Epigenetics 6: 134–140. Takuno S, Gaut BS. 2012. Body-methylated genes in Arabidopsis thaliana are functionally important and evolve slowly. Molecular Biology and Evolution 29: 219–227. Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008a. Synteny and collinearity in plant genomes. Science 320: 486–488. Tang H, Bowers JE, Wang X, Paterson AH. 2010. Angiosperm genome comparisons reveal early polyploidy in the monocot lineage. Proceedings of the National Academy of Sciences, USA 107: 472–477. Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH. 2008b. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Research 18: 1944–1954. Thompson JD, Higgins DG, Gibson TJ. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680. New Phytologist (2013) www.newphytologist.com Tibshirani R, Hastie T. 2007. Outlier sums for differential gene expression analysis. Biostatistics 8: 2–8. Vining KJ, Pomraning KR, Wilhelm LJ, Priest HD, Pellegrini M, Mockler TC, Freitag M, Strauss SH. 2012. Dynamic DNA cytosine methylation in the Populus trichocarpa genome: tissue-level variation and relationship to gene expression. BMC Genomics 13: 27. Wang X, Tang H, Bowers JE, Feltus FA, Paterson AH. 2007. Extensive concerted evolution of rice paralogs and the road to regaining independence. Genetics 177: 1753–1763. Wang Y, Wang X, Paterson AH. 2012. Genome and gene duplications and gene expression divergence: a view from plants. Annals of the New York Academy of Sciences 1256: 1–14. Wang Y, Wang X, Tang H, Tan X, Ficklin SP, Feltus FA, Paterson AH. 2011. Modes of gene duplication contribute differently to genetic novelty and redundancy, but show parallels across divergent angiosperms. PLoS ONE 6: e28150. Woodhouse MR, Pedersen B, Freeling M. 2010. Transposed genes in Arabidopsis are often associated with flanking repeats. PLoS Genetics 6: e1000949. Woodhouse MR, Tang H, Freeling M. 2011. Different gene families in Arabidopsis thaliana transposed in different epochs and at different frequencies throughout the rosids. Plant Cell 23: 4241–4253. Yang Z, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Molecular Biology and Evolution 17: 32–43. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, Hamilton J, Haas B, Sultana R, Cheung F et al. 2005. The institute for genomic research Osa1 rice genome annotation database. Plant Physiology 138: 18–26. Zemach A, Kim MY, Silva P, Rodrigues JA, Dotson B, Brooks MD, Zilberman D. 2010a. Local DNA hypomethylation activates genes in rice endosperm. Proceedings of the National Academy of Sciences, USA 107: 18729–18734. Zemach A, McDaniel IE, Silva P, Zilberman D. 2010b. Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328: 916–919. Zhang M, Xu C, von Wettstein D, Liu B. 2011. Tissue-specific differences in cytosine methylation and their association with differential gene expression in sorghum. Plant Physiology 156: 1955–1966. Zhang X, Yazaki J, Sundaresan A, Cokus S, Chan SW, Chen H, Henderson IR, Shinn P, Pellegrini M, Jacobsen SE et al. 2006. Genome-wide high-resolution mapping and functional analysis of DNA methylation in Arabidopsis. Cell 126: 1189–1201. Zhao XP, Si Y, Hanson RE, Crane CF, Price HJ, Stelly DM, Wendel JF, Paterson AH. 1998. Dispersed repetitive DNA has spread to new genomes since polyploid formation in cotton. Genome Research 8: 479–492. Zilberman D, Gehring M, Tran RK, Ballinger T, Henikoff S. 2007. Genomewide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nature Genetics 39: 61–69. Supporting Information Additional supporting information may be found in the online version of this article. Fig. S1 Comparison of body methylation levels of all genes between all pairs of tissues. Table S1 Classification of rice duplicated genes Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office. Ó 2013 The Authors New Phytologist Ó 2013 New Phytologist Trust