Category Feature pattern discovery to predict nucleosome occupancy in yeast and human Co-responding Author1,*, Co-author2 and Co-Author2 1 Department of XXXXXXX, Address XXXX etc. Department of XXXXXXX, Address XXXX etc. 2 Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT Motivation: Different sequence-based and structural features have been reported predictive for nucleosome positioning. Which features are important and whether important features are the same for different species (human and yeast) is not yet clear. Results: By dividing genome into different regions (gene region, promoter region, intergenic region) and by transcription level (highly expressed, lowly expressed), we found different feature subsets are predictive for different genomic regions or genes with different transcriptional activities. Some of these feature subsets are conserved between yeast and human. We also identified different patterns of nucleosome positioning that are not explainable by local sequence-based features. We show that different patterns are likely to be functional results of different chromatin remodelers and inclusion of linker length with effective feature subsets improves nucleosome positioning prediction. Availability: Contact: 1. INTRODUCTION Eukaryotic genomic DNA is packaged with histone and other proteins into a chromatin complex. The basic unit of DNA packaging is called nucleosome comprising of a histone octamer with ~147 bp DNA segments wrapping around it (Kornberg and Lorch, 1999; Luger, et al., 1997). The DNA segments that are around nucleosomes are often called nucleosomal DNA and those between two adjacent nucleosomes are often called linker DNA. The linker DNA can have variable lengths often ranging from 20 to 80 bp. The amount of histones occupying a unitlength of DNA segment in a population of cells is called the nucleosome occupancy level corresponding to the given DNA segment. It has been well known that nucleosome occupancy is not uniformly distributed across the genome and different nucleosome occupancies in different DNA regions can have important functional meanings (Daenen, et al., 2008; Tillo, et al., 2010). *To For example, highly occupied DNA regions are often difficult to be accessed by regulatory proteins such as transcription factors while nucleosome-depletion regions are more accessible to DNA binding proteins functioning in gene regulation (Sekinger, et al., 2005; Zhu and Thiele, 1996). Nucleosomes have also been reported important roles in epigenetic gene regulations (Jiang and Pugh, 2009; Li, et al., 2007). Therefore, understanding the cause of nucleosome forming on different genomic regions is important to understand gene regulation and uncover cellular mechanisms. Experimental methods to measure nucleosome occupancy have been advanced by the recent next-generation sequencing technology (Shendure and Ji, 2008), especially the parallel sequencing of nucleosomal DNA (Albert, et al., 2007; Barski, et al., 2007; Field, et al., 2008; Kaplan, et al., 2009; Valouev, et al., 2008; Zhang, et al., 2009). A number of datasets on genomescale measurement of nucleosome occupancy have become publicly available in multiple species including human and yeast (Lee, et al., 2007; Schones, et al., 2008; Wang, et al., 2008; Yuan, et al., 2005). Based on these datasets, dozens of literature studies have reported particular types of DNA sequences tend to have higher nucleosome occupancy levels in vivo and there potentially is a genomic code for nucleosome-forming sequences (Segal, et al., 2006). Accordingly, a number of computational methods have recently been developed to model and predict nucleosome occupancy based on the underlying features of the primary DNA sequences (Field, et al., 2008; Gupta, et al., 2008; Lee, et al., 2007; Peckham, et al., 2007; Reynolds, et al., 2010; Segal, et al., 2006). For example, Peckham et al has incorporated variable-length k-mers into a SVM model to predict nucleosomal DNA in yeast. Lee et al has applied lasso regression model to a set of DNA sequence and structural features for nucleosomal DNA prediction. Reynolds et al combines the nucleosomal DNA and the adjacent linker DNA sequences to characterize the nucleosome. Although the prediction accuracy is in general not sufficiently high leading to the common understanding that factors other than primary sequence may also influence nucleosome forming, these methods are able to identify a number of sequence features that in fact contribute to nucleosome occupancy prediction. The selected features however are not always consistent with each other. For example, several studies whom correspondence should be addressed. © Oxford University Press 2005 1 Yiyu Zheng et al. have confirmed ~10 bp periodicity of certain dinucleotides as predictive of nucleosome forming (Field, et al., 2008; Ioshikhes, et al., 2006; Kaplan, et al., 2009; Segal, et al., 2006). Lee et al claims structural features such as tilt and propeller twist are the most effective features, and identified “Tip, Tilt, and Propeller Twist” as important structural features for nucleosome occupancy prediction in yeast. The diversity of effective features identified in nucleosome-prediction studies leads to our hypothesis that multiple DNA features can simultaneously or combinatorially influence nucleosome-occupancy. In this paper, we develop a computational method FFN (Finding Features for Nucleosomes) to identify features and feature combinations that are important for nucleosome occupancy prediction. By applying FFN to genome-wide nucleosome occupancy measurement data in yeast (Lee, et al., 2007) and human (Schones, et al., 2008), we found that a number of different features when combined have high predictive power for nucleosome-forming or nucleosome-depletion sequences. The prediction power of combined features can be affected by factors beyond static DNA sequences such as gene transcriptional activities. We also show that certain structural features frequently appear in feature patterns in nucleosome forming sequences. 2. METHODS 2.1 Data source Nucleosome positioning data in yeast is obtained from Lee et al, and the data in human is from Schones et al for both resting and activated human T-cells (Lee, et al., 2007; Schones, et al., 2008). The Laplacian of Gaussian (LOG) method (Ozsolak, et al., 2007; Zhang, et al., 2008) is then applied to the raw data and identified enriched regions. The parameters are chosen for best consistency between LOG results and HMM results (Lee, et al., 2007). These enriched regions are then defined as nucleosome-containing sequences (NCS), and the sequences between two NCSs are defined as linker-containing sequences (LCS). For a given gene and its promoter region (1,000 bp upstream of the Transcription Start Site (TSS)), we define a nucleosome as 0th nucleosome if it overlaps with the TSS, the nucleosomes in the upstream region of the TSS as -1st nucleosome, -2nd nucleosome and so on, and the nucleosomes in the downstream region of the TSS as 1 st, 2nd nucleosomes and so on. The gene expression data in yeast is obtained from David et al. (David, et al., 2006). 5736 annotated genes with gene expression measurements are kept for further identification of high-confidence transcripts. The yeast gene annotation is based on the annotation resource at UCSC. High-confidence transcripts are defined as those transcript segments that overlap greater than 50% with a non-dubious annotated coding region in the 5’ end (Lee, et al., 2007). In total, 5300 out of the 5736 genes are defined as high-confidence transcripts. The gene expression data in human is obtained from GEO database (GSE10437), which measures whole genome gene expression under two T cell resting conditions and two activated conditions respectively (Schones, et al., 2008). 19049 genes with gene expression measurements are obtained using gene annotation resources at UCSC (human hg18 build). Genes that are absent under both of the resting conditions and present under both of the activated conditions are defined as induced genes. Similarly, genes that are present under both of the resting conditions and absent under both of the activated conditions are defined as repressed genes (Schones, et al., 2008). In total, 299 genes are defined as 2 induced genes and 393 genes as repressed genes. We define these induced genes and repressed genes together as disturbed genes. 2.2 Feature compilation and sequence representation Features that are relevant to nucleosome occupancy are compiled from literature. These compiled features can be categorized into two classes: (1) sequence features such as DNA k-mer frequency (Peckham, et al., 2007), poly(A) tracts, transcription factor binding sites and sequence repeat (Lee, et al., 2007); (2) DNA structural features from Lee et al and Abeel et al (Abeel, et al., 2008; Lee, et al., 2007). The structural feature values are computed based on the conversion tables (Supplementary Table 1). The structural features that have large correlation with other structural features (the absolute value for Pearson correlation calculation is greater than 0.9) are removed resulting in 23 structural features (Supplementary Table 1). Finally, 766 features including 694 k-mer frequency features, 4 Poly tracts, 40 poly(dA/dT) tracts, 2 sequence repeat features, 23 structural features and 3 motifs (yeast only) are kept for further analysis (Supplementary Table 2). We then performed LogitBoost (Friedman, et al., 2000) to further select the most relevant features. LogitBoost is a boosting algorithm for classification using logistic regression as cost function, and can assign weight to the features selected in the model so as to estimate the impact of the features on the model (Friedman, 2001). We used the implementation of LogitBoost in software “WEKA”(Hall et al., 2009). We chose top 1000 nucleosomes with highest profile score and top 1000 linker sequences with lowest scores as training sets in yeast, human Resting and Activated T-cell data respectively. We then performed 100 round iterations using all the features in LogitBoost. We kept the top 10 features in yeast, human resting and human activated T-cell dataset as they generally take 60% weight out of all the features selected (Supplementary Table 3). Then we collected the top 10 features selected in, and also included the top 10 features by Peckham, et al., 2007 and three structural features from Lee, et al., 2007. Finally, after removing some duplicated features, we included 30 features in our consideration (Table 1). We then used selected features to represent the NCSs and LCSs as follows. We computed all the feature values for every 147 bp-long subsequence in all the NCSs/LCSs. As the feature value generally conforms to normal distribution, we discretized each feature into m levels using cutoffs as μ – (m/2-1)σ, μ – (m/2+1)σ, …, μ, …, μ + (m/2-1)σ. For m=4, level 0 is (-∞, μ – σ), level 1 [μ – σ, μ), level 2 [μ, μ + σ), and level 3 [μ + σ, +∞). With the discretized feature values, every 147 bp subsequence in an NCS/LCS was replaced by the combination of its discretized feature values, called feature profiles. Table 1. Top 30 features 1 Tip 11 AAT/ATT 21 GAC/GTC 2 Minor groove mobility 12 ATTA/TAAT 22 CCCC/GGGG 3 Tilt 13 TAA/TTA 23 ACAC/GTGT 4 Z-DNA free energy 14 TAATA/TATTA 24 AATTA/TAATT 5 Persistence length 1 15 AAAA/TTTT 25 ATAT 6 Slide 16 CGCC/GGCG 26 A/T 7 Major groove mobility 17 AAG/CTT 27 TA 8 Propeller twist 18 ACA/TGT 28 AAA/TTT 9 10 ATA/TAT ATAA/TTAT 19 AAATA/TATTT 20 CCGCC/GGCGG 29 AT 30 AATA/TATT 30 features used in pattern discovery Feature pattern discovery to predict nucleosome occupancy in yeast and human rence is larger than α percent of the total number of NCS subsequences (e.g. α=20) are kept. The algorithm then searches for frequently co-occurred di-feature patterns using frequent pattern mining techniques (cite FIM). Next, FFN performs an extension step, in which FFN investigates whether some of these di-feature patterns in the same cluster can be further extended into triprofileCount a1b 2c 3 E a1b 2c 3 feature or even longer patterns. This extension step is implezscore (1) mented by a pattern merging procedure. For example, given four 2 2 E ( a1b 2c 3 ) ( E ( a1b 2c3) ) di-feature patterns a1b1, a1c3, b1c3 and c3d4 in one cluster, meaning they are frequently co-occurring in the input sequences, we observe that they can in fact be merged into three tri-feature The expectation of pattern a1b2c3 is calculated using the following patterns a1b1c3, b1c3d4, a1c3d4 and one tetra-feature pattern formula: a1b1c3d4. Note that the occurrence frequencies of these merged patterns may vary, but should be all greater than α% of the total E ( a1b 2c3) a1 * b 2 * c 3 NCS subsequences. After the pattern extension step, FFN evaluhi k k k (2) r 1 r 1 r 1 ( ax (Ta ) axa1 )( bx ' (Tb )bx ' bx )( cx '' (Tc ) cx '' a1 ) ates the statistical significance of the patterns’ enrichment in the r 2 x 1 x ' 1 x '' 1 NCS and LCS subsequences using z-scores (see Methods section). The D-score for a given pattern is then computed to dehi 1 hi termine whether it has sufficient discriminative power to distinE ((a1b2c3)2 ) E (a1b2c3) 2* ( r 1 j r 1 guish NCSs from LCSs. In this way, we can identify NCSk specific feature patterns that are frequently found in NCSs but ( ax (Tar 1 )axa1 (Taj r )a1a1 ) (3) less frequently found in LCSs. Similarly, we can start FFN from x 1 LCS data to identify LCS-specific feature patterns. k k r 1 j r r 1 j r *( bx ' (Tb )bx ' b 2 (Tb )b 2b 2 ) *( cx '' (Tc )cx '' c 3 (Tc )c 3c 3 )) 2.3 Scoring of a potential feature pattern To determine the discriminating power of a potential feature pattern, we defined the z-score of a specific feature pattern in nucleosome-forming sequences as Zn, and Zl in linker-forming sequences. So the discriminative score D Score = Zn – Zl x ' 1 x '' 1 Algorithm 1. FFN algorithm While Ta is the transition matrix for the Markov chain modeled feature a transitioning between its categories across all the windows in the m sequences, and π is calculated as following: a *Ta a Input: NCS profile, LCS profile Output: a set of discriminative patterns (4) Ck: candidate pattern set of length k Lk: frequent pattern set of length k While πax is between [0, 1] for all x, and meets the requirement that: k x 1 ax 1 (5) In the end we keep the patterns with score above 3√2 as discriminative patterns. As the ranges of D-Score in different species are different, in each situation, we normalize the D-Score of each pattern by dividing the original D-Score by the highest D-Score of all patterns to make the score in range [0, 1]. 3. EXPERIMENTS AND RESULTS To identify potential features that combinatorially characterize nucleosome/linker-containing sequences, we developed FFN algorithm (Algorithm 1. FFN algorithm). We then applied the FFN to discover feature and feature combinations in yeast and human T-cell data. 3.1 The FFN algorithm The FFN algorithm aims to discover combinations of features that are relevant to NCS/LCS-forming and can be used to distinguish NCSs from LCSs. Given all of the 147 bp-long subsequences in the obtained NCSs and LCSs, the FFN starts from enumerating all the possible two-feature combinations, called difeature patterns. Only those di-feature patterns whose occur- Initial discovered pattern set: R is Ø Start from length 1 pattern set L1 : {all features} While Lk is not null Determine candidate pattern set Ck+1 by merging patterns in Lk FOR each profile item p in Input FOR each candidate pattern c in Ck+1 IF p contains pattern c Increment support(c) Generate Lk+1 with all candidate patterns in Ck+1 with support > alpha For each candidate patterns P in Lk Calculate Zn:zScore(p) based on formula (1-5); Calculate Zl:zScore(p) in LCS profiles D-Score = Zn - Zl If (DScore > cutoff (3√2)) { put the pattern in the result set R Return discovered pattern set R. 3.2 Feature patterns identified in yeast data Applying the FFN algorithm to the yeast nucleosome data, we identified 88 NCS-specific patterns (NLPs) with D-score larger than 3√2 (Table 2, Supplementary Table 4). For example, “Minor grove mobility” level 1, “Z-DNA free energy” level 2, “Persistence length 1” level 1 and “A/T” level 2 form the pattern with highest D-score 80.56. This pattern occurs 228845 times in NCSs with a zscore 526.30 and 162726 times in LCSs with a 3 Yiyu Zheng et al. zscore 445.74. A length-3 subpattern of this pattern containing “Z-DNA free energy” level 2, “Persistence length 1” level 1 and ”A/T” level 2, is also identified as a discriminative pattern with D-score 57.04. We also identified several patterns with same feature combination but different levels, for example, pattern “Tip” level 2, “CCGCC/GGCGG” level 0, “TA” level 2 (Dscore=35.09) and “Tip” level 1, “CCGCC/GGCGG” level 0, “TA” level 1 (D-score=13.23). These identified patterns suggest that different combinations of features at various levels can be used for NCS prediction. We found that the feature combinations in NCSs are not specific to individual genes or promoters since nucleosomes in the same promoter regions often contain different features combination. For example, the promoter region of gene YAL047W-A contains 5 nucleosomes with very different patterns (Figure 1-A). We also identified 15 LCSspecific feature patterns in yeast (Supplementary Table 4). For example, “ATA/TAT” level 1, “TAATA/TATTA” level 0, and “CCGCC/GGCGG” level 0 form a high D-score pattern (Dscore = 10.38). It occurs 191297 times in LCSs with a zscore of 13.44 and 355422 times in NCSs with a zscore of 3.06. "CCGCC/GGCGG" level 0, "AATTA/TAATT" level 0, "AATA/TATT" level 1 also forms a pattern (D-score = 4.99) Table 2. Top 10 Yeast Patterns Identified Rank 1 2 3 4 5 6 7 8 9 Pattern Fig. 1. Pattern distribution Curve. (A) In the promoter region of YAL047W-A, the 5 nucleosomes contain different patterns: ("Z-DNA free energy" level 2, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "A/T" level 2), ("tip" level 2, "CCGCC/GGCGG" level 0, "TA" level 2), ("tip" level 2, "CCGCC/GGCGG" level 0, "TA" level 2), ("tip" level 1, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "TA" level 1), ("Z-DNA free energy" level 2, "persistence length 1" level 1, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0) respectively (from 5’ towards the region [-1000, 0] TSS) (B) in the promoter region of YOR202W, one nucleosome does not contain any pattern. The other three nucleosomes contains ("Z-DNA free energy" level 1, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "A/T" level 1), ("ATTA/TAAT" level 0, "TAATA/TATTA" level 0, "AATTA/TAATT" level 0), ("Slide" level 1, "Propeller twist" level 1, "CCGCC/GGCGG" level 0) respectively. Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "A/T" level 2 Z-DNA free energy level 2, "Persistence length 1" level 1, "CCCC/GGGG" level 0, "A/T" level 2 Z-DNA free energy level 2, "Persistence length 1" level 1, "CCGCC/GGCGG" level 0, "A/T" level 2 Z-DNA free energy level 2, "Persistence length 1" level 1, "A/T" level 2 Minor groove mobility level 1, "Z-DNA free energy" level 2, "CCCC/GGGG" level 0, "A/T" level 2 Minor groove mobility level 1, "Persistence length 1" level 1, "CCGCC/GGCGG" level 0, "A/T" level 2 Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "CCGCC/GGCGG" level 0 Minor groove mobility level 1, "Z-DNA free energy" level 2, "CCGCC/GGCGG" level 0, "A/T" level 2 Z-DNA free energy level 2, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "A/T" level 2 10 Minor groove mobility level 1, "Persistence length 1" level 1, "A/T" level 2 By applying FFN algorithm we identified 88 NCS-specific patterns (NLPs). Three structural features frequently occur in the top 10 patterns. We also compared the occurrences of features in NLP-specific patterns and LNP-specific patterns and found different feature preferences. Of all the 30 features, 17 features existed in NCSspecific patterns, and 11 existed in the LCS-specific patterns. There are 13 features that exclusively appear in NCS-specific patterns and 7 features only in LCS-specific patterns. We found that among the 13 NCS exclusive features, the feature of “A/T” is identified 40 out of the total 88 NCP-specific patterns but never in LCP-specific patterns. Especially, the feature “A/T” at level 2 frequently appears in the top scored feature patterns (9 out of top 10 patterns contains this feature, Table 2). This feature has been identified as the most useful feature to distinguish NCSs from LCSs in Peckham et al. It frequently forms patterns with structural features such as Minor groove mobility (11 times), Z-DNA free energy (12 times) and Persistence length 1 (11 times). All top 20 patterns contain at least one of these four features. This observation is consistent with the discovery that structural features will help nucleosome occupancy prediction in yeast (Lee, et al., 2007). “Z-DNA free energy”, which is related to the free energy required for transition from B-DNA to ZDNA transition (Ho et al., 1990), has been identified 39 times (10 times in level 1 and 29 times in level 2). This features is one of the structural features that are mostly negatively correlated with nucleosome occupancy (Gan et al., 2012). It usually cooccur with at least one of the other three features mentioned above in the discovered patterns (33 out of 39). 3.3 4 The distribution of patterns in yeast promoters Feature pattern discovery to predict nucleosome occupancy in yeast and human Since NCS-forming in the promoter regions can affect transcriptional activities, we investigated the relationship between the distribution of the identified feature patterns and NCSs in the yeast promoter regions. For each NCS/LCS in the promoter regions, we assigned it a pattern score equal to the largest Dscore received by the feature patterns exhibited by its containing subsequences. We found that in general, the pattern scores are averagely higher in NCSs/LCSs closer to TSSs than those in NCSs/LCSs farther away from TSS (Figure 2-A). For the -1st and the 0th NCSs and the neighboring LCSs, the average score of NCSs is slightly higher than that of the LCSs. This is consistent with studies reporting that the nucleosomes near the TSSs in yeast are more determined by sequence features, while the nucleosomes away from the TSS are more determined by other factors (cite xxx). To investigate whether the distribution of feature patterns are influenced by genes’ transcriptional activities, we further divided genes based on their expression levels. We investigated the pattern scores of the NCSs/LCSs in the promoters of the 1000 most highly expressed genes (expression level between 4.56 and 2.79) and 1000 most low-expressed genes (expression level between 0.0040 and 0.956). We found that for every nucleosomal location, the score is averagely larger for the NCSs/LCSs in the low-expressed genes than that in the highly expressed genes (Figure 2-B). For example, for the low-expressed gene “YOR258W” (Figure 2-C), nearly all the NCSs in its promoter region contain the top-scored patterns while the neighboring LCSs contain low-scored patterns. Similarly, the low-expressed gene “YMR126C” (Figure 2-D) shows this trend. On the other hand, highly expressed genes such as the gene “YDR002W” often do not contain many high score patterns in the promoter region (Figure 2-E). These observations implicate nucleosomes in promoters tend to be less sequence-predictable when there is significant gene transcriptional activities (Lee, et al., 2007). We also investigated whether the distribution of feature patterns is related to the nucleosome density in the promoter region. We grouped the promoters by the number of nucleosomes in the regions. We found that near the TSS regions, the average score for the 0th nucleosome of promoters containing six nucleosomes (0.445) is much lower than the score for the 0th nucleosome in other promoter regions (above 0.5) (Figure 3-A). When taking gene transcriptional activities into account, we found that the average scores of the nucleosome near TSS in the highly expressed promoter regions containing 6 nucleosomes are much smaller compared to other, the scores are 0.306, 0.362, 0.336 for the (-1st, 0th, 1st) nucleosomes while other groups are generate above 0.45 and displaces higher score compared to farther nucleosome (Figure 3-B). For the lowly expressed genes, the average scores are similar and relatively higher than farther nucleosomes (Figure 3-C). Fig. 2. Pattern Distribution on Yeast. (A) Average pattern score for nucleosomes and linkers around the TSS. (0th, 1st…). The scores for the 2nd, -1st, 0th, 1st, 2nd NCSs are 0.458, 0.515, 0.524, 0.429, 0.444 respectively, and the scores for the -2nd, -1st, 0th, 1st, 2nd LCSs are 0.481, 0.479, 0.521, 0.477, 0.482 respectively. Scores for NCS/LCS near TSS are higher. (B) Average pattern score for nucleosome in top1000 high expressed genes and top 1000 low expressed genes. (C-E) Pattern score curve for the promoter region of gene YOR258W, YMR126C and YDR002W. Fig. 3. Average patterns scores of nucleosome vary with nucleosome density in the promoter region (A-C) average pattern score of nucleo- 5 Yiyu Zheng et al. some in all the genes, top 1000 high expressed genes and top 1000 low expressed genes. 3.4 Identified feature and feature patterns in human T-cell Applying the FFN algorithm to human T-cell resting and activated data separately, we identified 2328 NCS-specific pattern in human T cell resting data, and 589 NCS-specific patterns in human T cell activated data (Supplementary Table 5). For example, the pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "A/T" level 2 form a pattern (D-score=302.71) in human T-cell, which has also been identified in yeast. Pattern "Tip" level 1, "AATTA/TAATT" level 0, "TA" level 1 is another pattern (D Score=39.9) being identified in both human and yeast. Similar to yeast, nucleosomes in the same promoters can exhibit different feature patterns in human T-cell data. For example, the four nucleosomes in the promoter region of gene TARDBP contain very different patterns (Fig 4). Fig. 4. Pattern distribution Curve in the promoter region of TARDBP in human Resting status. There are 4 nucleosomes, and each of them contains different patterns: "Minor groove mobility" level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "Major groove mobility" level 2, "CCGCC/GGCGG" level 0, "A/T" level 2 (rank1), "Minor groove mobility" level 2, "Z-DNA free energy" level 1, "Persistence length 1" level 2, "Major groove mobility" level 1, "AATTA/TAATT" level 0 (rank39), "Minor groove mobility" level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "Major groove mobility" level 2, "CCGCC/GGCGG" level 0, "A/T" level 2 (rank2), "Z-DNA free energy" level 1, "Persistence length 1" level 2, "Major groove mobility" level 1, "AATTA/TAATT" level 0 (rank 93) respectively (from 5’ towards the region [-1000, 0] TSS) Comparing features in human resting and activated T cells, we found they contain nearly all features except AAG/CTT, GAC/GTC, and ACAC/GTGT. The four features we mentioned in yeast are also frequently co-occurring in top-ranked patterns, and another feature “AAA/TTT” is identified 647 out of the total 2328 NCS-specific patterns in human resting status and 183 out of total 589 patterns in human activated status. This feature is identified by Peckham, et al. as a significant 3-mer feature in distinguishing the nucleosomes and linkers. We noticed that the top patterns discovered in both human resting and activated status are similar. The top 2 patterns are same ("minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2, "CCGCC/GGCGG" level 0, "A/T" level 2, and "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2, "A/T" level 2), and nearly all top100 patterns contain feature “A/T” and are 6 formed by the combination of "minor groove mobility”, "ZDNA free energy", "persistence length 1", and "major groove mobility". We also noticed that top patterns discovered in resting status generally contain "minor groove mobility” level 1 while patterns from activated status contain both level 1 and level 2. There are 576 same patterns in patterns discovered in resting and activated status (total 589 patterns), while if we do not consider the level, there are 580 patterns that are formed with the same features. Although there are slightly difference in discretization of feature value in resting and activated, it demonstrate that our method can robustly discover lots of useful patterns. One noticing pattern that is discovered in activated status but not in resting status is "minor groove mobility" level 2, "Z-DNA free energy" level 1, "persistence length 1" level 2, "A/T" level 1 (rank 6 with zscore 538). In resting status the zscore for this pattern is -6.24. 3.5 The distribution of patterns in human We analyzed the distribution of patterns in the NCSs and LCSs in the human promoters. We observed that in both resting and activated status, the patterns in both NCSs and LCSs have lower scores when closer to TSSs comparing with when farther away from TSSs, which is different from the trend yeast. (Figure 5-A). We also observed that the average scores of the NCSs are higher than its neighboring LCSs near the TSS and in the gene body, while the average scores for LCSs in the core promoter region (the -1st and 0th nucleosome locations) are higher than that of the paired nucleosomes. In general, the -1st nucleosome and 1st nucleosome near TSS without a 0th nucleosome have a low score compared to these with a 0th nucleosome in human T cell. Additionally, the 0th nucleosomes in human score the lowest among all nucleosomes (Figure 7, Supplementary Figure 3-H). When the gene expression levels are taken into account in both resting and activation status, the patterns in NCSs closer to TSSs of low-expressed genes have averagely higher scores than those obtained in NCSs of highly expressed genes, while the difference is not that significant for those nucleosomes farther from TSS (Figure 5-B for resting and Supplementary Figure 3-B for activated). Also, the score differences between NCSs and LCSs become larger in the low-expressed genes compared to that in the highly expressed genes (See supplementary figure). This is consistent with the hypothesis that the lack of transcriptional activities can lead to sequence-determined nucleosomeforming events (Segal et al., 2006). Figure 5 C&D are two examples of pattern distribution around the TSS. Feature pattern discovery to predict nucleosome occupancy in yeast and human upstream of it. We found that after TCR signaling activation, that nucleosome has moved into that low score region. Fig. 5. Pattern Distribution on Human resting status. (A) Average pattern score for nucleosomes and linkers around the TSS. For example, the D Scores for the -2nd, -1st, 0th, 1st, 2nd NCSs are 0.920, 0.872, 0.799, 0.799 and 0.863 respectively. The scores for the -2nd, -1st, 0th, 1st, 2nd LCSs are 0.921, 0.894, 0.810, 0.741, 0.803 respectively. (B) Average pattern score for nucleosome in top1000 high expressed genes and top 1000 low expressed genes. (C-D) Pattern score curve for the promoter region of gene FCRLA and RPL13 in resting status. (C) Lowly expressed gene NM_032738/ FCRLA with average expression level 4.3 in human resting status, the nucleosomes in the promoter regions from 4 nucleosome to 2 nucleosome contains patterns ranked (1, 1, 4, 3, 1, 16, 4, 311). (D) Highly expressed gene NM_033251/RPL13 with expression level 169504 in resting status, the nucleosomes in the promoter region from -4th nucleosome to +2nd nucleosome contain patterns ranked (1, 1, 2, 1406, 1109, 1413, 39) To investigate the pattern distribution in disturbed genes and the pattern changes before and after T-cell activation, we assigned the pattern scores to all of the NCSs/LCSs in the promoter regions of the 299 induced genes and 393 repressed genes. For the repressed genes and induced genes in resting status, the scores for the -1st and 0th NCSs are generally lower than their neighboring LCSs (Figure 6-A, 6-B), suggesting that “+1 nucleosome may already be depleted and prepared for gene activation before TCR signaling”(Schones et al., 2008). For example, the induced gene HPDL (Figure 6-C, 6-D), it displays a similar nucleosome occupancy in resting and activated status in the promoter region and similar pattern distribution curve. The scores for induced and repressed genes in activated status are both relative lower compared to the average score in high expressed genes. The score for 0th nucleosome is 0.702 for induced genes, 0.677 for repressed genes and 0.750 for high expressed genes. It seems that for perturbed genes in activated status, the nucleosome-forming is more likely to be predicted by factors related to gene activity and TCR signaling rather than the sequence features. Take the repressed gene NM_173485 (Figure 6-E, 6-F) for example, unlike the gene HPDL although the pattern distribution curve in resting and activated status are still similar, the nucleosome occupancy near the TSS position has changed. In the resting status, the -1st nucleosome positioned in a region with high pattern score with a low score region in the Fig. 6. Pattern distribution varies with different gene status on Human. (A) Average pattern score for nucleosomes and linkers in repressed genes in Human resting status. (B) Average pattern score for nucleosomes and linkers in induced genes in Human resting status. (CD) Pattern score curve for the promoter region of gene HPDL in resting status and activated status. (E-F) Pattern score curve for gene NM_173485 We also grouped the promoters by nucleosome density based on the nucleosome numbers in the promoter regions. Unlike in yeast that only the promoters with 6 nucleosomes have low score, we found that the more nucleosomes in the promoter region, the lower score for each nucleosome in respective position (Figure 7). In activated status, we found that the scores of i-th nucleosome just increase with the density in the near TSS region (Supplementary Figure 3-E). We found that for the top1000 highly expressed gene groups if we grouped them by nucleosome density (Supplementary Figure 3-F), it still keeps the similar trend as using all promoters, while in the top1000 lowly expressed gene groups (Supplementary Figure 3-G), the genes that contains only 1 nucleosome contain higher scores than the genes with other nucleosome density. 7 Yiyu Zheng et al. Fig. 7. Pattern distribution varies with nucleosome density on Human. (consider to remove) 3.6 Yeast Patterns compare with Human Patterns There are 35 exactly same patterns are conserved in yeast, human resting and activated T cells (Supplementary Table 6). For example, "Minor groove mobility" level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "A/T" level 2 form a conserved pattern with high ranks in both yeast and human (rank 1st in yeast, 9th in human resting and 20th in human activated), indicating the three structural features together with the sequence “A/T” feature are very important factors influencing nucleosome-forming in both species. Note that because of the different feature value distributions in yeast and human genome, the discretization level of these features in yeast and human can be different. Only considering the features but not their levels, we discovered 41 conserved feature combinations across yeast, human Resting and human Activated T cells (Supplementary Table 6). For example, the feature combination of "Z-DNA free energy", "CGCC/GGCG", "CCGCC/GGCGG" and "A/T" is conserved but at different feature levels in yeast and human. In detail, "ZDNA free energy" at level 1, "CGCC/GGCG" at level 0, "CCGCC/GGCGG" at level 0, and "A/T" at level 1 is discovered in yeast, while in human resting and activated, the pattern discovered is "Z-DNA free energy" at level 2, "CGCC/GGCG" at level 0, "CCGCC/GGCGG" at level 0, and "A/T" at level 2. This might be caused by the different feature value ranges in different species. Yeast genome has a lower GC content (38%) compared to human (41%). The GC content for the promoter [-1000, 0] of yeast and human genes is 38.32% and 53.19% respectively. Also, there are 53 patterns discovered in yeast (Supplementary Table 6) that does not discovered in human. For example, the yeast pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "CCCC/GGGG" level 0, "A/T" level 2 (rank 6) is not discovered in human data. It demonstrates that in different species different feature combination will help nucleosome occupancy prediction. 4. DISCUSSIONS AND CONCLUSIONS Understanding the interaction between DNA sequence and nucleosome occupancy is important to uncovering gene regulatory mechanisms. Whether DNA sequence directly determines nucleosome occupancy and nucleosome positioning, and if yes, how much, is still under debate. We have developed an efficient method to study DNA features and their combinations that are useful for nucleosome occupancy prediction. Applying our method to the yeast and human T-cell data, we discovered thousands of feature combination patterns that have different enrichment between NCSs and LCSs. These discovered feature patterns involve both DNA structural features and sequence features and provide multiple possibilities for nucleosomeforming. Comparison between feature patterns between yeast and human, we found that different patterns might prevail in different species. One important observation is that nucleosome-occupancy prediction accuracy is location-dependent. The farther away from 8 TSSs, the more accurate is the sequence-based prediction in human. Another related observation is that nucleosomeoccupancy tends to be hard to predict from the discovered feature patterns when the containing genes are transcriptional active. Our results also show that feature levels can be important indicators for NCS/LCS-forming. In yeast patterns most features have two levels (level 1 and 2) appearing. For example, “ZDNA free energy” is identified 39 times in 88 patterns with 10 times in level 1 and 29 times in level 2. We observe that all patterns containing “Z-DNA free energy” at level 1 rank lower than 44, while most of the patterns (25 out of 29) containing “Z-DNA free energy” at level 2 rank in the top. . Also we noticed that two features "Z-DNA free energy" level 2, "A/T" level 2 co-occur 12 times in all the patterns, and "Z-DNA free energy" level 1 with "A/T" level 1 18 times. By changing the frequency parameter alpha (we used alpha = 20% in the paper) we can get different numbers of patterns, as in frequent pattern find step of the algorithm, if the alpha is smaller, we will include more patterns, and if the alpha is larger, we will only keep the patterns that occurs more frequently. For example, in yeast data, if we use alpha = 30%, there will be only 39 frequent patterns left, and none of them are discriminative. The pattern with largest D-Score is “ATAA/TTAT” level 1, “CGCC/GGCG” level 0, “CCGCC/GGCGG” level 0 of score 3.407 which does not meet our criteria of the discriminative patterns (see 2.4). Also all the patterns contain no structural feature, and contain at most one valid feature. It is because the k-mer frequency features especially the 4-mer and 5-mer ones have high possibility that does not exist in the sequence, which makes these patterns more frequent than the patterns with valid features. When we try to use the alpha = 10%, it will include more patterns. We finally get 2203 patterns (Supplementary Table 7) compared to the 88 patterns discovered using alpha = 0.2. All the previously 88 patterns are included in the newly discovered patterns, while we all discovered some new patterns such as "minor groove mobility" level 3, "Z-DNA free energy" level 0, "persistence length 1" level 3, "A/T" level 0. This one contains the same feature combination as the rank1 patterns in the yeast while the features values are in different level. By changing the alpha, we can discover more/ less patterns. Also different discretization method will affect the pattern discover as different bin dividing method will make the frequency of the pattern changes. If we use the {μ – 2σ, μ, μ + 2σ} as three cutoffs as the new discretization method in yeast, then the patterns we discovered will be different, because the feature value have more possibility to fall in the level 1 and level 2 bins, thus making the patterns with level 1 and level 2 bins more frequent. With the same parameter alpha = 0.2, we discovered 2447 patterns (Supplementary Table 7) using the new discretization cutoffs compared to the 88 patterns discovered using the discretization method {μ – σ, μ, μ + σ}. The patterns with highest DScore is "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "slide" level 1, "major groove mobility" level 2, "propellertwist" level 1, "CCGCC/GGCGG" level 0, "A/T" level 2. This pattern is discovered 220920 times (20.15%) in NCS profile and 226873 times (27.04%) in LCS profile. This pattern is composed mostly by the structural features as the distribution of structural features Feature pattern discovery to predict nucleosome occupancy in yeast and human are more conform to normal distribution that K-mer frequency features, and “A/T” is also well conform to normal distribution compared to other k-mer frequency features. 82 out of 88 patterns are included in the new patterns, while the other 6 we can find corresponding patterns that contain same features with different level. For example the previous pattern "Z-DNA free energy" level 1, "AAAA/TTTT" level 1, "A/T" level 1, we discovered a pattern with same features but all in level 2 ("Z-DNA free energy" level 2, "AAAA/TTTT" level 2, "A/T" level 2). We also discovered lots of long patterns (combination of more than 4 features) while there are only length 3 and 4 patterns using the previous method. These long patterns are generally composited by the frequent features mentioned in 3.2 with some other features. For example, we discover a new pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2, "CGCC/GGCG" level 0, "A/T" level 2, and it contains same features and same level as the previous discovered pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "A/T" level 2 with two more features "major groove mobility" level 2 and "CGCC/GGCG" level 0. ACKNOWLEDGEMENTS Funding: REFERENCES Gan,Y. et al. (2012) Structural features based genome-wide characterization and prediction of nucleosome organization. BMC bioinformatics, 13, 49. Hall,M. et al. The WEKA Data Mining Software : An Update. 11, 10–18. Ho,P.S. et al. (1990) Polarized electronic spectra of Z-DNA single crystals. Biopolymers, 30, 151–63. Abeel, T., Saeys, Y., Bonnet, E., Rouze, P. and Van de Peer, Y. (2008) Generic eukaryotic core promoter prediction using structural features of DNA, Genome research, 18, 310-323. Albert, I., Mavrich, T.N., Tomsho, L.P., Qi, J., Zanton, S.J., Schuster, S.C. and Pugh, B.F. (2007) Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome, Nature, 446, 572-576. Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I. and Zhao, K. (2007) High-resolution profiling of histone methylations in the human genome, Cell, 129, 823-837. Daenen, F., van Roy, F. and De Bleser, P.J. (2008) Low nucleosome occupancy is encoded around functional human transcription factor binding sites, BMC genomics, 9, 332. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome, Proceedings of the National Academy of Sciences of the United States of America, 103, 5320-5325. Field, Y., Kaplan, N., Fondufe-Mittendorf, Y., Moore, I.K., Sharon, E., Lubling, Y., Widom, J. and Segal, E. (2008) Distinct modes of regulation by chromatin encoded through nucleosome positioning signals, PLoS computational biology, 4, e1000216. Friedman, J., Hastie, T. and Tibshirani, R. (2000) Additive logistic regression: a statistical view of boosting, The Annals of Statistics, 28, 337–407. Friedman, J.H. (2001) Greedy function approximation: A gradient boosting machine, The Annals of Statistics, 29, 1189–1232. Gupta, S., Dennis, J., Thurman, R.E., Kingston, R., Stamatoyannopoulos, J.A. and Noble, W.S. (2008) Predicting human nucleosome occupancy from primary sequence, PLoS computational biology, 4, e1000134. Ioshikhes, I.P., Albert, I., Zanton, S.J. and Pugh, B.F. (2006) Nucleosome positions predicted through comparative genomics, Nature genetics, 38, 1210-1215. Jiang, C. and Pugh, B.F. (2009) Nucleosome positioning and gene regulation: advances through genomics, Nature reviews, 10, 161-172. Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Tillo, D., Field, Y., LeProust, E.M., Hughes, T.R., Lieb, J.D., Widom, J. and Segal, E. (2009) The DNA-encoded nucleosome organization of a eukaryotic genome, Nature, 458, 362-366. Kornberg, R.D. and Lorch, Y. (1999) Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome, Cell, 98, 285-294. Lee, W., Tillo, D., Bray, N., Morse, R.H., Davis, R.W., Hughes, T.R. and Nislow, C. (2007) A high-resolution atlas of nucleosome occupancy in yeast, Nature genetics, 39, 1235-1244. Li, B., Carey, M. and Workman, J.L. (2007) The role of chromatin during transcription, Cell, 128, 707-719. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. and Richmond, T.J. (1997) Crystal structure of the nucleosome core particle at 2.8 A resolution, Nature, 389, 251-260. Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. (2007) High-throughput mapping of the chromatin structure of human promoters, Nature biotechnology, 25, 244-248. Peckham, H.E., Thurman, R.E., Fu, Y., Stamatoyannopoulos, J.A., Noble, W.S., Struhl, K. and Weng, Z. (2007) Nucleosome positioning signals in genomic DNA, Genome research, 17, 1170-1177. Reynolds, S.M., Bilmes, J.A. and Noble, W.S. (2010) Learning a weighted sequence model of the nucleosome core and linker yields more accurate predictions in Saccharomyces cerevisiae and Homo sapiens, PLoS computational biology, 6, e1000834. Schones, D.E., Cui, K., Cuddapah, S., Roh, T.Y., Barski, A., Wang, Z., Wei, G. and Zhao, K. (2008) Dynamic regulation of nucleosome positioning in the human genome, Cell, 132, 887-898. Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I.K., Wang, J.P. and Widom, J. (2006) A genomic code for nucleosome positioning, Nature, 442, 772-778. Sekinger, E.A., Moqtaderi, Z. and Struhl, K. (2005) Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast, Molecular cell, 18, 735-748. Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing, Nature biotechnology, 26, 1135-1145. Tillo, D., Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Field, Y., Lieb, J.D., Widom, J., Segal, E. and Hughes, T.R. (2010) High nucleosome occupancy is encoded at human regulatory sequences, PloS one, 5, e9129. Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K., Malek, J.A., Costa, G., McKernan, K., Sidow, A., Fire, A. and Johnson, S.M. (2008) A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning, Genome research, 18, 10511063. Wang, Z., Zang, C., Rosenfeld, J.A., Schones, D.E., Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Peng, W., Zhang, M.Q. and Zhao, K. (2008) Combinatorial patterns of histone acetylations and methylations in the human genome, Nature genetics, 40, 897-903. Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J. and Rando, O.J. (2005) Genome-scale identification of nucleosome positions in S. cerevisiae, Science (New York, N.Y, 309, 626-630. Zhang, Y., Moqtaderi, Z., Rattner, B.P., Euskirchen, G., Snyder, M., Kadonaga, J.T., Liu, X.S. and Struhl, K. (2009) Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo, Nature structural & molecular biology, 16, 847-852. Zhang, Y., Shin, H., Song, J.S., Lei, Y. and Liu, X.S. (2008) Identifying positioned nucleosomes with epigenetic marks in human from ChIP-Seq, BMC genomics, 9, 537. Zhu, Z. and Thiele, D.J. (1996) A specialized nucleosome modulates transcription factor access to a C. glabrata metal responsive promoter, Cell, 87, 459-470. 9