Category Feature pattern discovery to predict nucleosome occupancy in yeast and human Co-responding Author1,*, Co-author2 and Co-Author2 1 Department of XXXXXXX, Address XXXX etc. Department of XXXXXXX, Address XXXX etc. 2 Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT Motivation: Different sequence-based and structural features have been reported predictive for nucleosome positioning. Which features are important and whether important features are the same for different species (human and yeast) is not yet clear. Results: By dividing genome into different regions (gene region, promoter region, intergenic region) and by transcription level (highly expressed, lowly expressed), we found different feature subsets are predictive for different genomic regions or genes with different transcriptional activities. Some of these feature subsets are conserved between yeast and human. We also identified different patterns of nucleosome positioning that are not explainable by local sequence-based features. We show that different patterns are likely to be functional results of different chromatin remodelers and inclusion of linker length with effective feature subsets improves nucleosome positioning prediction. Availability: Contact: 1. INTRODUCTION Eukaryotic genomic DNA is packaged with histone and other proteins into a chromatin complex. The basic unit of DNA packaging is called nucleosome comprising of a histone octamer with ~147 bp DNA segments wrapping around it [1, 2]. The DNA segments that are around nucleosomes are often called nucleosomal DNA and those between two adjacent nucleosomes are often called linker DNA. The linker DNA can have variable lengths often ranging from 20 to 80 bp. The amount of histones occupying a unit-length of DNA segment in a population of cells is called the nucleosome occupancy level corresponding to the given DNA segment. It has been well known that nucleosome occupancy is not uniformly distributed across the genome and different nucleosome occupancies in different DNA regions can have important functional meanings [3, 4]. For example, highly occupied DNA regions are often difficult to be accessed by *To regulatory proteins such as transcription factors while nucleosome-depletion regions are more accessible to DNA binding proteins functioning in gene regulation [5, 6]. Nucleosomes have also been reported important roles in epigenetic gene regulations [7, 8]. Therefore, understanding the cause of nucleosome forming on different genomic regions is important to understand gene regulation and uncover cellular mechanisms. Experimental methods to measure nucleosome occupancy have been advanced by the recent next-generation sequencing technology [9], especially the parallel sequencing of nucleosomal DNA [10-15]. A number of datasets on genome-scale measurement of nucleosome occupancy have become publicly available in multiple species including human and yeast [16-19]. Based on these datasets, dozens of literature studies have reported particular types of DNA sequences tend to have higher nucleosome occupancy levels in vivo and there potentially is a genomic code for nucleosome-forming sequences [20]. Accordingly, a number of computational methods have recently been developed to model and predict nucleosome occupancy based on the underlying features of the primary DNA sequences [10, 18, 20-23]. For example, Peckham et al has incorporated variablelength k-mers into a SVM model to predict nucleosomal DNA in yeast. Lee et al has applied lasso regression model to a set of DNA sequence and structural features for nucleosomal DNA prediction. Reynolds et al combines the nucleosomal DNA and the adjacent linker DNA sequences to characterize the nucleosome. Although the prediction accuracy is in general not sufficiently high leading to the common understanding that factors other than primary sequence may also influence nucleosome forming, these methods are able to identify a number of sequence features that in fact contribute to nucleosome occupancy prediction. The selected features however are not always consistent with each other. For example, several studies have confirmed ~10 bp periodicity of certain dinucleotides as predictive of nucleosome forming [10, 13, 20, 24]. Lee et al claims structural features such as tilt and propeller twist are the most effective features, and identified “Tip, Tilt, and Propeller Twist” as important structural features for nucleosome occupancy prediction in yeast. The diversity of effective features identified in nucleosome-prediction studies leads to our hypothesis that mul- whom correspondence should be addressed. © Oxford University Press 2005 1 Yiyu Zheng et al. tiple DNA features can simultaneously or combinatorially influence nucleosome-occupancy. In this paper, we develop a computational method FFN (Finding Features for Nucleosomes) to identify features and feature combinations that are important for nucleosome occupancy prediction. By applying FFN to genome-wide nucleosome occupancy measurement data in yeast [18] and human [16], we found that a number of different features when combined have high predictive power for nucleosome-forming or nucleosomedepletion sequences. The prediction power of combined features can be affected by factors beyond static DNA sequences such as gene transcriptional activities. We also show that certain structural features frequently appear in feature patterns in nucleosome forming sequences. 2. METHODS 2.1 Data source Nucleosome positioning data in yeast is obtained from Lee et al, and the data in human is from Schones et al for both resting and activated human T-cells [16, 18]. The Laplacian of Gaussian (LOG) method [25, 26] is then applied to the raw data and identified enriched regions. The parameters are chosen for best consistency between LOG results and HMM results [18]. These enriched regions are then defined as nucleosomecontaining sequences (NCS), and the sequences between two NCSs are defined as linker-containing sequences (LCS). For a given gene and its promoter region (1,000 bp upstream of the Transcription Start Site (TSS)), we define a nucleosome as 0th nucleosome if it overlaps with the TSS, the nucleosomes in the upstream region of the TSS as -1st nucleosome, -2nd nucleosome and so on, and the nucleosomes in the downstream region of the TSS as 1 st, 2nd nucleosomes and so on. The gene expression data in yeast is obtained from David et al. [27]. 5736 annotated genes with gene expression measurements are kept for further identification of high-confidence transcripts. The yeast gene annotation is based on the annotation resource at UCSC. Highconfidence transcripts are defined as those transcript segments that overlap greater than 50% with a non-dubious annotated coding region in the 5’ end [18]. In total, 5300 out of the 5736 genes are defined as highconfidence transcripts. The gene expression data in human is obtained from GEO database (GSE10437), which measures whole genome gene expression under two T cell resting conditions and two activated conditions respectively [16]. 19049 genes with gene expression measurements are obtained using gene annotation resources at UCSC (human hg18 build). Genes that are absent under both of the resting conditions and present under both of the activated conditions are defined as induced genes. Similarly, genes that are present under both of the resting conditions and absent under both of the activated conditions are defined as repressed genes [16]. In total, 299 genes are defined as induced genes and 393 genes as repressed genes. We define these induced genes and repressed genes together as disturbed genes. 2.2 Feature compilation and sequence representation Features that are relevant to nucleosome occupancy are compiled from literature. These compiled features can be categorized into two classes: (1) sequence features such as DNA k-mer frequency [21], poly(A) tracts, transcription factor binding sites and sequence repeat [18]; (2) DNA 2 structural features from Lee et al and Abeel et al [18, 28]. The structural feature values are computed based on the conversion tables (Supplementary Table 1). The structural features that have large correlation with other structural features (the absolute value for Pearson correlation calculation is greater than 0.9) are removed resulting in 23 structural features (Supplementary Table 1). Finally, 766 features including 694 kmer frequency features, 4 Poly tracts, 40 poly(dA/dT) tracts, 2 sequence repeat features, 23 structural features and 3 motifs (yeast only) are kept for further analysis (Supplementary Table 2). We then performed LogitBoost [29] to further select the most relevant features. LogitBoost is a boosting algorithm for classification using logistic regression as cost function, and can assign weight to the features selected in the model so as to estimate the impact of the features on the model [30]. We used the implementation of LogitBoost in software “WEKA”(Hall et al., 2009). We chose top 1000 nucleosomes with highest profile score and top 1000 linker sequences with lowest scores as training sets in yeast, human Resting and Activated T-cell data respectively. We then performed 100 round iterations using all the features in LogitBoost. We kept the top 10 features in yeast, human resting and human activated T-cell dataset as they generally take 60% weight out of all the features selected (Supplementary Table 3). Then we collected the top 10 features selected in, and also included the top 10 features by [21] and three structural features from [18]. Finally, after removing some duplicated features, we included 30 features in our consideration (Table 1). We then used selected features to represent the NCSs and LCSs as follows. We computed all the feature values for every 147 bp-long subsequence in all the NCSs/LCSs. As the feature value generally conforms to normal distribution, we discretized each feature into m levels (m=4) using cutoffs as μ – (m/2-1)σ, μ – (m/2+1)σ, …, μ, …, μ + (m/2-1)σ. With the discretized feature values, every 147 bp subsequence in an NCS/LCS was replaced by the combination of its discretized feature values, called feature profiles. Table 1. Top 30 features 1 Tip 11 AAT/ATT 21 GAC/GTC 2 Minor groove mobility 12 ATTA/TAAT 22 CCCC/GGGG 3 Tilt 13 TAA/TTA 23 ACAC/GTGT 4 Z-DNA free energy 14 TAATA/TATTA 24 AATTA/TAATT 5 Persistence length 1 15 AAAA/TTTT 25 ATAT 6 Slide 16 CGCC/GGCG 26 A/T 7 Major groove mobility 17 AAG/CTT 27 TA 8 Propeller twist 18 ACA/TGT 28 AAA/TTT 9 10 ATA/TAT ATAA/TTAT 19 AAATA/TATTT 20 CCGCC/GGCGG 29 AT 30 AATA/TATT 30 features used in pattern discovery 2.3 Scoring of a potential feature pattern To determine the discriminating power of a potential feature pattern, we defined the z-score of a specific feature pattern in nucleosome-forming sequences as Zn, and Zl in linker-forming sequences. So the discriminative score D Score = Zn – Zl zscore profileCount a1b 2c 3 E a1b 2c3 E (a1b 2c3 ) ( E ( a1b 2c3) ) 2 2 (1) Feature pattern discovery to predict nucleosome occupancy in yeast and human The expectation of pattern a1b2c3 is calculated using the following formula: patterns a1b1c3, b1c3d4, a1c3d4 and one tetra-feature pattern a1b1c3d4. Note that the occurrence frequencies of these merged patterns may vary, but should be all greater than α% of the total E ( a1b 2c3) a1 * b 2 * c 3 NCS subsequences. After the pattern extension step, FFN evaluhi k k k (2) ates the statistical significance of the patterns’ enrichment in the r 1 r 1 r 1 ( ax (Ta ) axa1 )( bx ' (Tb )bx ' bx )( cx '' (Tc ) cx '' a1 ) NCS and LCS subsequences using z-scores (see Methods sec r 2 x 1 x ' 1 x '' 1 tion). The D-score for a given pattern is then computed to dehi 1 hi termine whether it has sufficient discriminative power to distinE ((a1b 2c3)2 ) E (a1b 2c3) 2 * ( guish NCSs from LCSs. In this way, we can identify NCSr 1 j r 1 specific feature patterns that are frequently found in NCSs but k ( ax (Tar 1 )axa1 (Taj r )a1a1 ) (3) less frequently found in LCSs. Similarly, we can start FFN from x 1 LCS data to identify LCS-specific feature patterns. k k x ' 1 x '' 1 *( bx ' (Tbr 1 )bx ' b 2 (Tb j r )b 2b 2 ) * ( cx '' (Tcr 1 )cx '' c 3 (Tc j r )c 3c 3 )) Algorithm 1. FFN algorithm While Ta is the transition matrix for the Markov chain modeled feature a transitioning between its categories across all the windows in the m sequences, and π is calculated as following: a *Ta a (4) Ck: candidate pattern set of length k Lk: frequent pattern set of length k While πax is between [0, 1] for all x, and meets the requirement that: k x 1 ax 1 (5) In the end we keep the patterns with score above 3√2 as discriminative patterns. As the ranges of D-Score in different species are different, in each situation, we normalize the D-Score of each pattern by dividing the original D-Score by the highest D-Score of all patterns to make the score in range [0, 1]. 3. EXPERIMENTS AND RESULTS To identify potential features that combinatorially characterize nucleosome/linker-containing sequences, we developed FFN algorithm (Algorithm 1. FFN algorithm). We then applied the FFN to discover feature and feature combinations in yeast and human T-cell data. 3.1 Input: NCS profile, LCS profile Output: a set of discriminative patterns The FFN algorithm The FFN algorithm aims to discover combinations of features that are relevant to NCS/LCS-forming and can be used to distinguish NCSs from LCSs. Given all of the 147 bp-long subsequences in the obtained NCSs and LCSs, the FFN starts from enumerating all the possible two-feature combinations, called difeature patterns. Only those di-feature patterns whose occurrence is larger than α percent of the total number of NCS subsequences (e.g. α=20) are kept. The algorithm then searches for frequently co-occurred di-feature patterns using frequent pattern mining techniques (cite FIM). Next, FFN performs an extension step, in which FFN investigates whether some of these di-feature patterns in the same cluster can be further extended into trifeature or even longer patterns. This extension step is implemented by a pattern merging procedure. For example, given four di-feature patterns a1b1, a1c3, b1c3 and c3d4 in one cluster, meaning they are frequently co-occurring in the input sequences, we observe that they can in fact be merged into three tri-feature Initial discovered pattern set: R is Ø Start from length 1 pattern set L1 : {all features} While Lk is not null Determine candidate pattern set Ck+1 by merging patterns in Lk FOR each profile item p in Input FOR each candidate pattern c in Ck+1 IF p contains pattern c Increment support(c) Generate Lk+1 with all candidate patterns in Ck+1 with support > alpha For each candidate patterns P in Lk Calculate Zn:zScore(p) based on formula (1-5); Calculate Zl:zScore(p) in LCS profiles D-Score = Zn - Zl If (DScore > cutoff (3√2)) { put the pattern in the result set R Return discovered pattern set R. 3.2 Feature patterns identified in yeast data Applying the FFN algorithm to the yeast nucleosome data, we identified 88 NCS-specific patterns with D-score larger than 3√2 (Table 2, Supplementary Table 4). For example, “Minor grove mobility” level 1, “Z-DNA free energy” level 2, “Persistence length 1” level 1 and “A/T” level 2 form the pattern with highest D-score 80.56. This pattern occurs 228845 times in NCSs with a zscore 526.30 and 162726 times in LCSs with a zscore 445.74. A length-3 subpattern of this pattern containing “Z-DNA free energy” level 2, “Persistence length 1” level 1 and ”A/T” level 2 is also identified as a discriminative pattern with D-score 57.04. We also identified several patterns with same feature combination but different feature levels, for example, pattern “Tip” level 2, “CCGCC/GGCGG” level 0, “TA” level 2 (D-score=35.09) and “Tip” level 1, “CCGCC/GGCGG” level 0, “TA” level 1 (Dscore=13.23). These identified patterns suggest that different combinations of features at various levels can be considered for NCS prediction. 3 Yiyu Zheng et al. Table 2. Top 10 Yeast Patterns Identified Rank 1 2 3 4 5 6 7 8 9 Pattern Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "A/T" level 2 Z-DNA free energy level 2, "Persistence length 1" level 1, "CCCC/GGGG" level 0, "A/T" level 2 Z-DNA free energy level 2, "Persistence length 1" level 1, "CCGCC/GGCGG" level 0, "A/T" level 2 Z-DNA free energy level 2, "Persistence length 1" level 1, "A/T" level 2 Minor groove mobility level 1, "Z-DNA free energy" level 2, "CCCC/GGGG" level 0, "A/T" level 2 Minor groove mobility level 1, "Persistence length 1" level 1, "CCGCC/GGCGG" level 0, "A/T" level 2 Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "CCGCC/GGCGG" level 0 Minor groove mobility level 1, "Z-DNA free energy" level 2, "CCGCC/GGCGG" level 0, "A/T" level 2 Z-DNA free energy level 2, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "A/T" level 2 10 Minor groove mobility level 1, "Persistence length 1" level 1, "A/T" level 2 By applying FFN algorithm we identified 88 NCS-specific patterns (NLPs). Three structural features frequently occur in the top 10 patterns. We also compared the occurrences of features in NCSspecific patterns and LCS-specific patterns and found different feature preferences. Of all the 30 features, 17 features existed in NCS-specific patterns, and 11 existed in the LCS-specific patterns. There are 13 features that exclusively appear in NCSspecific patterns and 7 features only in LCS-specific patterns. We found that among the 13 NCS exclusive features, the feature of “A/T” is identified 40 out of the total 88 NCP-specific patterns but never in LCP-specific patterns. Especially, the feature “A/T” at level 2 frequently appears in the top scored feature patterns (9 out of top 10 patterns contains this feature, Table 2). This feature has been identified as the most useful feature to distinguish NCSs from LCSs in Peckham et al. It frequently forms patterns with structural features such as Minor groove mobility (11 times), Z-DNA free energy (12 times) and Persistence length 1 (11 times). All top 20 patterns contain at least one of these four features. This observation is consistent with the discovery that structural features will help nucleosome occupancy prediction in yeast [18]. “Z-DNA free energy”, which is related to the free energy required for transition from B-DNA to Z-DNA transition (Ho et al., 1990), has been identified 39 times (10 times in level 1 and 29 times in level 2). This features is one of the structural features that are mostly negatively correlated with nucleosome occupancy (Gan et al., 2012). It usually cooccur with at least one of the other three features mentioned above in the discovered patterns (33 out of 39). 3.3 The distribution of patterns in yeast promoters Since NCS-forming in the promoter regions can affect transcriptional activities, we investigated the relationship between the distribution of the identified feature patterns and NCSs in the 4 yeast promoter regions. For each NCS/LCS in the promoter regions, we assigned it a pattern score equal to the largest Dscore received by the feature patterns exhibited by its containing subsequences. We found that in general, the pattern scores are averagely higher in NCSs/LCSs closer to TSSs than those s farther away from TSS (Fig. 1-A). For the -1st and the 0th NCSs and the neighboring LCSs, the average score of NCSs is slightly higher than that of the LCSs. To investigate whether the distribution of feature patterns are influenced by genes’ transcriptional activities, we further divided genes based on their expression levels. We investigated the pattern scores of the NCSs/LCSs in the promoters of the 1000 most highly expressed genes (expression level between 4.56 and 2.79) and 1000 most low-expressed genes (expression level between 0.0040 and 0.956). We found that for every nucleosomal location, the score is averagely larger for the NCSs in the low-expressed genes than that in the highly expressed genes (Fig 1-B). For example, for the low-expressed gene “YOR258W” (Fig. 1-C), nearly all the NCSs in its promoter region contain the top-scored patterns while the neighboring LCSs contain lowscored patterns. Similarly, the low-expressed gene “YMR126C” (Fig. 1-D) shows this trend. On the other hand, highly expressed genes such as the gene “YDR002W” often do not contain many high score patterns in the promoter region (Fig. 1-E). These observations implicate nucleosomes in promoters tend to be less sequence-predictable when there is significant gene transcriptional activities [18]. We found that the feature combinations in NCSs are not specific to individual genes or promoters since nucleosomes in the same promoter regions often contain different features combination. For example, the promoter region of gene YDR002W contains 4 nucleosomes with different patterns (Fig. 1-E). Fig. 1. Distribution of patterns in yeast. (A) Average pattern score for nucleosomes and linkers around the TSS. (0th, 1st…). The scores for the 2nd, -1st, 0th, 1st, 2nd NCSs are 0.458, 0.515, 0.524, 0.429, 0.444 respectively and scores for NCS/LCS near TSS are higher. (B) Average Feature pattern discovery to predict nucleosome occupancy in yeast and human pattern score for nucleosome in top1000 high expressed genes and top 1000 low expressed genes. (C-E) Pattern score distribution for the promoter region of gene YOR258W (C), YMR126C (D) and YDR002W (E). 3.4 Identified feature and feature patterns in human T-cell Applying the FFN algorithm to human T-cell resting and activated data separately, we identified 2328 NCS-specific pattern in human T cell resting data, and 589 NCS-specific patterns in human T cell activated data (Supplementary Table 5). For example, the pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "A/T" level 2 form a pattern (D-score=302.71) in human T-cell, which has also been identified in yeast. Pattern "Tip" level 1, "AATTA/TAATT" level 0, "TA" level 1 is another pattern (D Score=39.9) being identified in both human and yeast There are 576 patterns conserved in both resting and activated status (total 589 patterns). For example, the pattern comprising of "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2, "CCGCC/GGCGG" level 0, "A/T" level 2 is identified in both conditions with high D score. This number becomes 580 if the difference of feature levels is not considered. Almost all the patterns ranked in top 100 contain the feature “A/T”, which frequently appears together with the features such as "minor groove mobility”, "Z-DNA free energy", "persistence length 1", and "major groove mobility". One example showing the difference of the patterns in the two conditions is the pattern comprising of "minor groove mobility" level 2, "Z-DNA free energy" level 1, "persistence length 1" level 2, "A/T" level 1 (rank 6 with zscore 538), which is discovered in activated status only. In resting status the zscore for this pattern is -6.24. Comparing features in human resting and activated T cells, we found they contain nearly all features except AAG/CTT, GAC/GTC, and ACAC/GTGT. The four features we mentioned in yeast are also frequently co-occurring in top-ranked patterns, and another feature “AAA/TTT” is identified 647 out of the total 2328 NCSspecific patterns in human resting status and 183 out of total 589 patterns in human activated status. This feature is identified by [21] as a significant 3-mer feature in distinguishing the nucleosomes and linkers. 3.5 The distribution of patterns in human We analyzed the distribution of patterns in the NCSs and LCSs in the human promoters. We observed that in both resting and activated status, the patterns in both NCSs and LCSs have lower scores when closer to TSSs comparing with when farther away from TSSs (Fig. 2-A for resting, Supplementary Fig. 1-A for activated)., which is different from the trend in yeast. We also observed that the average scores of the NCSs are higher than its neighboring LCSs near the TSS and in the gene body, while the average scores for LCSs in the core promoter region (the -1st and 0th nucleosome locations) are higher than that of the paired nucleosomes. In general, the -1st nucleosome and 1st nucleosome near TSS without a 0th nucleosome have a low score compared to these with a 0th nucleosome in human T cell. Additionally, the 0th nucleosomes in human score the lowest among all nucleosomes (Fig. 2-A, Supplementary Figure 1-A). When the gene expression levels are taken into account in both resting and activation status, the patterns in NCSs closer to TSSs of low-expressed genes have averagely higher scores than those obtained in NCSs of highly expressed genes, while the difference is not that significant for those nucleosomes farther from TSS (Fig. 2-B for resting and Supplementary Fig. 1-B for activated). Also, the score differences between NCSs and LCSs become larger in the low-expressed genes compared to that in the highly expressed genes (See supplementary figure). This is consistent with the hypothesis that the lack of transcriptional activities can lead to sequence-determined nucleosome-forming events (Segal et al., 2006). Fig. 2 C&D are two examples of pattern distribution around the TSS. Similar to yeast, nucleosomes in the same promoters can exhibit different feature patterns in human T-cell data. For example, the four nucleosomes in the promoter region of gene TARDBP contain very different patterns (Fig. 2-E). Fig. 2. Distribution of patterns in human resting status. (A) Average pattern score for nucleosomes and linkers around the TSS. (B) Average pattern score for nucleosome in top1000 high expressed genes and top 1000 low expressed genes. (C) Pattern score distribution curve for the lowly expressed gene NM_032738/ FCRLA in human resting status shows that -4th nucleosome to +2nd nucleosome contains patterns ranked (1, 1, 4, 3, 1, 16, 4, 311). (D) Pattern score distribution curve for the highly expressed gene NM_033251/RPL13 in resting status shows that – 4th nucleosome to +2nd nucleosome contain patterns ranked (1, 1, 2, 1406, 1109, 1413, 39). (E) The 4 nucleosome in the promoter region of TARDBP in human resting status contain different patterns. To investigate the pattern distribution in disturbed genes and the pattern changes before and after T-cell activation, we assigned the pattern scores to all of the NCSs/LCSs in the promoter regions of the 299 induced genes and 393 repressed genes. For the repressed genes and induced genes in resting status, the 5 Yiyu Zheng et al. scores for the -1st and 0th NCSs are generally lower than their neighboring LCSs (Fig. 3) For induced and repressed genes in activated status, the average scores for -2nd, -1st, 0th, 1st, 2nd are both relative lower compared to the average score in high expressed genes (Supplementary Fig. 2). Take the repressed gene NM_173485 (Fig. 3) for example, the nucleosome occupancy near the TSS position changed after activation while the pattern distribution did not change much. In the resting status, the -1st nucleosome positioned in a region with high pattern score with a low score region in the upstream of it. We found that after TCR signaling activation, that nucleosome has moved into that low score region. For perturbed genes in activated status, the nucleosome-forming is more likely to be predicted by factors related to gene activity and TCR signaling rather than the sequence features. in yeast, while in human resting and activated, the pattern discovered is "Z-DNA free energy" at level 2, "CGCC/GGCG" at level 0, "CCGCC/GGCGG" at level 0, and "A/T" at level 2. This might be caused by the different feature value ranges in different species. Yeast genome has a lower GC content (38%) compared to human (41%). The GC content for the promoter [-1000, 0] of yeast and human genes is 38.32% and 53.19% respectively. Also, there are 53 patterns discovered in yeast (Supplementary Table 6) that does not discovered in human. For example, the yeast pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "CCCC/GGGG" level 0, "A/T" level 2 (rank 6) is not discovered in human data. It demonstrates that in different species different feature combination will help nucleosome occupancy prediction. 4. DISCUSSIONS AND CONCLUSIONS Fig. 3. Pattern distribution of perturbed genes. (A) Average pattern score for nucleosomes and linkers in repressed genes in human resting status. (B) Average pattern score for nucleosomes and linkers in induced genes in human resting status. (C-D) Pattern score curve for gene NM_173485 TSHZ2 in resting status and activated status. 3.6 Yeast Patterns compare with Human Patterns There are 35 exactly same patterns are conserved in yeast, human resting and activated T cells (Supplementary Table 6). For example, "Minor groove mobility" level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "A/T" level 2 form a conserved pattern with high ranks in both yeast and human (rank 1st in yeast, 9th in human resting and 20th in human activated), indicating the three structural features together with the sequence “A/T” feature are very important factors influencing nucleosome-forming in both species. Note that because of the different feature value distributions in yeast and human genome, the discretization level of these features in yeast and human can be different. Only considering the features but not their levels, we discovered 41 conserved feature combinations across yeast, human Resting and human Activated T cells (Supplementary Table 6). For example, the feature combination of "Z-DNA free energy", "CGCC/GGCG", "CCGCC/GGCGG" and "A/T" is conserved but at different feature levels in yeast and human. In detail, "ZDNA free energy" at level 1, "CGCC/GGCG" at level 0, "CCGCC/GGCGG" at level 0, and "A/T" at level 1 is discovered 6 Understanding the interaction between DNA sequence and nucleosome occupancy is important to uncovering gene regulatory mechanisms. Whether DNA sequence directly determines nucleosome occupancy and nucleosome positioning, and if yes, how much, is still under debate. We have developed an efficient method to study DNA features and their combinations that are useful for nucleosome occupancy prediction. Applying our method to the yeast and human T-cell data, we discovered thousands of feature combination patterns that have different enrichment between NCSs and LCSs. These discovered feature patterns involve both DNA structural features and sequence features and provide multiple possibilities for nucleosomeforming. Comparison between feature patterns between yeast and human, we found that different patterns might prevail in different species. One important observation is that nucleosome-occupancy prediction accuracy is location-dependent. The farther away from TSSs, the more accurate is the sequence-based prediction in human. Another related observation is that nucleosomeoccupancy tends to be hard to predict from the discovered feature patterns when the containing genes are transcriptional active. Our results also show that feature levels can be important indicators for NCS/LCS-forming. In yeast patterns most features have two levels (level 1 and 2) appearing. For example, “ZDNA free energy” is identified 39 times in 88 patterns with 10 times in level 1 and 29 times in level 2. We observe that all patterns containing “Z-DNA free energy” at level 1 rank lower than 44, while most of the patterns (25 out of 29) containing “Z-DNA free energy” at level 2 rank in the top. . Also we noticed that two features "Z-DNA free energy" level 2, "A/T" level 2 co-occur 12 times in all the patterns, and "Z-DNA free energy" level 1 with "A/T" level 1 18 times. By changing the frequency parameter alpha (we used alpha = 20% in the paper) we can get different numbers of patterns, as in frequent pattern find step of the algorithm, if the alpha is smaller, we will include more patterns, and if the alpha is larger, we will only keep the patterns that occurs more frequently. For example, in yeast data, if we use alpha = 30%, there will be only 39 frequent patterns left, and none of them are discriminative. The pattern with largest D-Score is “ATAA/TTAT” level 1, “CGCC/GGCG” level 0, “CCGCC/GGCGG” level 0 of score Feature pattern discovery to predict nucleosome occupancy in yeast and human 3.407 which does not meet our criteria of the discriminative patterns (see 2.4). Also all the patterns contain no structural feature, and contain at most one valid feature. It is because the k-mer frequency features especially the 4-mer and 5-mer ones have high possibility that does not exist in the sequence, which makes these patterns more frequent than the patterns with valid features. When we try to use the alpha = 10%, it will include more patterns. We finally get 2203 patterns (Supplementary Table 7) compared to the 88 patterns discovered using alpha = 0.2. All the previously 88 patterns are included in the newly discovered patterns, while we all discovered some new patterns such as "minor groove mobility" level 3, "Z-DNA free energy" level 0, "persistence length 1" level 3, "A/T" level 0. This one contains the same feature combination as the rank1 patterns in the yeast while the features values are in different level. By changing the alpha, we can discover more/ less patterns. Also different discretization method will affect the pattern discover as different bin dividing method will make the frequency of the pattern changes. If we use the {μ – 2σ, μ, μ + 2σ} as three cutoffs as the new discretization method in yeast, then the patterns we discovered will be different, because the feature value have more possibility to fall in the level 1 and level 2 bins, thus making the patterns with level 1 and level 2 bins more frequent. With the same parameter alpha = 0.2, we discovered 2447 patterns (Supplementary Table 7) using the new discretization cutoffs compared to the 88 patterns discovered using the discretization method {μ – σ, μ, μ + σ}. The patterns with highest DScore is "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "slide" level 1, "major groove mobility" level 2, "propellertwist" level 1, "CCGCC/GGCGG" level 0, "A/T" level 2. This pattern is discovered 220920 times (20.15%) in NCS profile and 226873 times (27.04%) in LCS profile. This pattern is composed mostly by the structural features as the distribution of structural features are more conform to normal distribution that K-mer frequency features, and “A/T” is also well conform to normal distribution compared to other k-mer frequency features. 82 out of 88 patterns are included in the new patterns, while the other 6 we can find corresponding patterns that contain same features with different level. For example the previous pattern "Z-DNA free energy" level 1, "AAAA/TTTT" level 1, "A/T" level 1, we discovered a pattern with same features but all in level 2 ("Z-DNA free energy" level 2, "AAAA/TTTT" level 2, "A/T" level 2). We also discovered lots of long patterns (combination of more than 4 features) while there are only length 3 and 4 patterns using the previous method. These long patterns are generally composited by the frequent features mentioned in 3.2 with some other features. For example, we discover a new pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2, "CGCC/GGCG" level 0, "A/T" level 2, and it contains same features and same level as the previous discovered pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "A/T" level 2 with two more features "major groove mobility" level 2 and "CGCC/GGCG" level 0. Funding: REFERENCES Gan,Y. et al. (2012) Structural features based genome-wide characterization and prediction of nucleosome organization. BMC bioinformatics, 13, 49. Hall,M. et al. The WEKA Data Mining Software : An Update. 11, 10–18. Ho,P.S. et al. (1990) Polarized electronic spectra of Z-DNA single crystals. Biopolymers, 30, 151–63. Abeel, T., Saeys, Y., Bonnet, E., Rouze, P. and Van de Peer, Y. (2008) Generic eukaryotic core promoter prediction using structural features of DNA, Genome research, 18, 310-323. Albert, I., Mavrich, T.N., Tomsho, L.P., Qi, J., Zanton, S.J., Schuster, S.C. and Pugh, B.F. (2007) Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome, Nature, 446, 572-576. Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I. and Zhao, K. (2007) High-resolution profiling of histone methylations in the human genome, Cell, 129, 823-837. Daenen, F., van Roy, F. and De Bleser, P.J. (2008) Low nucleosome occupancy is encoded around functional human transcription factor binding sites, BMC genomics, 9, 332. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome, Proceedings of the National Academy of Sciences of the United States of America, 103, 5320-5325. Field, Y., Kaplan, N., Fondufe-Mittendorf, Y., Moore, I.K., Sharon, E., Lubling, Y., Widom, J. and Segal, E. (2008) Distinct modes of regulation by chromatin encoded through nucleosome positioning signals, PLoS computational biology, 4, e1000216. Friedman, J., Hastie, T. and Tibshirani, R. (2000) Additive logistic regression: a statistical view of boosting, The Annals of Statistics, 28, 337–407. Friedman, J.H. (2001) Greedy function approximation: A gradient boosting machine, The Annals of Statistics, 29, 1189–1232. Gupta, S., Dennis, J., Thurman, R.E., Kingston, R., Stamatoyannopoulos, J.A. and Noble, W.S. (2008) Predicting human nucleosome occupancy from primary sequence, PLoS computational biology, 4, e1000134. Ioshikhes, I.P., Albert, I., Zanton, S.J. and Pugh, B.F. (2006) Nucleosome positions predicted through comparative genomics, Nature genetics, 38, 1210-1215. Jiang, C. and Pugh, B.F. (2009) Nucleosome positioning and gene regulation: advances through genomics, Nature reviews, 10, 161-172. Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Tillo, D., Field, Y., LeProust, E.M., Hughes, T.R., Lieb, J.D., Widom, J. and Segal, E. (2009) The DNA-encoded nucleosome organization of a eukaryotic genome, Nature, 458, 362-366. Kornberg, R.D. and Lorch, Y. (1999) Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome, Cell, 98, 285-294. Lee, W., Tillo, D., Bray, N., Morse, R.H., Davis, R.W., Hughes, T.R. and Nislow, C. (2007) A high-resolution atlas of nucleosome occupancy in yeast, Nature genetics, 39, 1235-1244. Li, B., Carey, M. and Workman, J.L. (2007) The role of chromatin during transcription, Cell, 128, 707-719. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. and Richmond, T.J. (1997) Crystal structure of the nucleosome core particle at 2.8 A resolution, Nature, 389, 251-260. Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. (2007) High-throughput mapping of the chromatin structure of human promoters, Nature biotechnology, 25, 244-248. Peckham, H.E., Thurman, R.E., Fu, Y., Stamatoyannopoulos, J.A., Noble, W.S., Struhl, K. and Weng, Z. (2007) Nucleosome positioning signals in genomic DNA, Genome research, 17, 1170-1177. Reynolds, S.M., Bilmes, J.A. and Noble, W.S. (2010) Learning a weighted sequence model of the nucleosome core and linker yields more accurate predictions in Saccharomyces cerevisiae and Homo sapiens, PLoS computational biology, 6, e1000834. Schones, D.E., Cui, K., Cuddapah, S., Roh, T.Y., Barski, A., Wang, Z., Wei, G. and Zhao, K. (2008) Dynamic regulation of nucleosome positioning in the human genome, Cell, 132, 887-898. ACKNOWLEDGEMENTS 7 Yiyu Zheng et al. Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I.K., Wang, J.P. and Widom, J. (2006) A genomic code for nucleosome positioning, Nature, 442, 772-778. Sekinger, E.A., Moqtaderi, Z. and Struhl, K. (2005) Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast, Molecular cell, 18, 735-748. Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing, Nature biotechnology, 26, 1135-1145. Tillo, D., Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Field, Y., Lieb, J.D., Widom, J., Segal, E. and Hughes, T.R. (2010) High nucleosome occupancy is encoded at human regulatory sequences, PloS one, 5, e9129. Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K., Malek, J.A., Costa, G., McKernan, K., Sidow, A., Fire, A. and Johnson, S.M. (2008) A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning, Genome research, 18, 10511063. Wang, Z., Zang, C., Rosenfeld, J.A., Schones, D.E., Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Peng, W., Zhang, M.Q. and Zhao, K. (2008) Combinatorial patterns of histone acetylations and methylations in the human genome, Nature genetics, 40, 897-903. Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J. and Rando, O.J. (2005) Genome-scale identification of nucleosome positions in S. cerevisiae, Science (New York, N.Y, 309, 626-630. Zhang, Y., Moqtaderi, Z., Rattner, B.P., Euskirchen, G., Snyder, M., Kadonaga, J.T., Liu, X.S. and Struhl, K. (2009) Intrinsic histone-DNA interactions are not the major determinant of nucleosome positions in vivo, Nature structural & molecular biology, 16, 847-852. Zhang, Y., Shin, H., Song, J.S., Lei, Y. and Liu, X.S. (2008) Identifying positioned nucleosomes with epigenetic marks in human from ChIP-Seq, BMC genomics, 9, 537. Zhu, Z. and Thiele, D.J. (1996) A specialized nucleosome modulates transcription factor access to a C. glabrata metal responsive promoter, Cell, 87, 459-470. [1] LUGER, K., ET AL., CRYSTAL STRUCTURE OF THE NUCLEOSOME CORE PARTICLE AT 2.8 A RESOLUTION. NATURE, 1997. 389(6648): P. 251-60. [2] KORNBERG, R.D. AND Y. LORCH, TWENTY-FIVE YEARS OF THE NUCLEOSOME, FUNDAMENTAL PARTICLE OF THE EUKARYOTE CHROMOSOME. CELL, 1999. 98(3): P. 285-94. [3] TILLO, D., ET AL., HIGH NUCLEOSOME OCCUPANCY IS ENCODED AT HUMAN REGULATORY SEQUENCES. PLOS ONE, 2010. 5(2): P. E9129. [4] DAENEN, F., F. VAN ROY, AND P.J. DE BLESER, LOW NUCLEOSOME OCCUPANCY IS ENCODED AROUND FUNCTIONAL HUMAN TRANSCRIPTION FACTOR BINDING SITES. BMC GENOMICS, 2008. 9: P. 332. [5] SEKINGER, E.A., Z. MOQTADERI, AND K. STRUHL, INTRINSIC HISTONE-DNA INTERACTIONS AND LOW NUCLEOSOME DENSITY ARE IMPORTANT FOR PREFERENTIAL ACCESSIBILITY OF PROMOTER REGIONS IN YEAST. MOL CELL, 2005. 18(6): P. 735-48. [6] ZHU, Z. AND D.J. THIELE, A SPECIALIZED NUCLEOSOME MODULATES TRANSCRIPTION FACTOR ACCESS TO A C. GLABRATA METAL RESPONSIVE PROMOTER. CELL, 1996. 87(3): P. 459-70. [7] JIANG, C. AND B.F. PUGH, NUCLEOSOME POSITIONING AND GENE REGULATION: ADVANCES THROUGH GENOMICS. NAT REV GENET, 2009. 10(3): P. 161-72. [8] LI, B., M. CAREY, AND J.L. WORKMAN, THE ROLE OF CHROMATIN DURING TRANSCRIPTION. CELL, 2007. 128(4): P. 707-19. [9] SHENDURE, J. AND H. JI, NEXT-GENERATION DNA SEQUENCING. NAT BIOTECHNOL, 2008. 26(10): P. 1135-45. [10] FIELD, Y., ET AL., DISTINCT MODES OF REGULATION BY CHROMATIN ENCODED THROUGH NUCLEOSOME POSITIONING SIGNALS. PLOS COMPUT BIOL, 2008. 4(11): P. E1000216. [11] ALBERT, I., ET AL., TRANSLATIONAL AND ROTATIONAL SETTINGS OF H2A.Z NUCLEOSOMES ACROSS THE SACCHAROMYCES CEREVISIAE GENOME. NATURE, 2007. 446(7135): P. 572-6. 8 Feature pattern discovery to predict nucleosome occupancy in yeast and human [12] ZHANG, Y., ET AL., INTRINSIC HISTONE-DNA INTERACTIONS ARE NOT THE MAJOR DETERMINANT OF NUCLEOSOME POSITIONS IN VIVO. NAT STRUCT MOL BIOL, 2009. 16(8): P. 847-52. [13] KAPLAN, N., ET AL., THE DNA-ENCODED NUCLEOSOME ORGANIZATION OF A EUKARYOTIC GENOME. NATURE, 2009. 458(7236): P. 362-6. [14] VALOUEV, A., ET AL., A HIGH-RESOLUTION, NUCLEOSOME POSITION MAP OF C. ELEGANS REVEALS A LACK OF UNIVERSAL SEQUENCE-DICTATED POSITIONING. GENOME RES, 2008. 18(7): P. 1051-63. [15] BARSKI, A., ET AL., HIGH-RESOLUTION PROFILING OF HISTONE METHYLATIONS IN THE HUMAN GENOME. CELL, 2007. 129(4): P. 823-37. [16] SCHONES, D.E., ET AL., DYNAMIC REGULATION OF NUCLEOSOME POSITIONING IN THE HUMAN GENOME. CELL, 2008. 132(5): P. 887-98. [17] YUAN, G.C., ET AL., GENOME-SCALE IDENTIFICATION OF NUCLEOSOME POSITIONS IN S. CEREVISIAE. SCIENCE, 2005. 309(5734): P. 626-30. [18] LEE, W., ET AL., A HIGH-RESOLUTION ATLAS OF NUCLEOSOME OCCUPANCY IN YEAST. NAT GENET, 2007. 39(10): P. 1235-44. [19] WANG, Z., ET AL., COMBINATORIAL PATTERNS OF HISTONE ACETYLATIONS AND METHYLATIONS IN THE HUMAN GENOME. NAT GENET, 2008. 40(7): P. 897-903. [20] SEGAL, E., ET AL., A GENOMIC CODE FOR NUCLEOSOME POSITIONING. NATURE, 2006. 442(7104): P. 772-8. [21] PECKHAM, H.E., ET AL., NUCLEOSOME POSITIONING SIGNALS IN GENOMIC DNA. GENOME RES, 2007. 17(8): P. 1170-7. [22] REYNOLDS, S.M., J.A. BILMES, AND W.S. NOBLE, LEARNING A WEIGHTED SEQUENCE MODEL OF THE NUCLEOSOME CORE AND LINKER YIELDS MORE ACCURATE PREDICTIONS IN SACCHAROMYCES CEREVISIAE AND HOMO SAPIENS. PLOS COMPUT BIOL, 2010. 6(7): P. E1000834. [23] GUPTA, S., ET AL., PREDICTING HUMAN NUCLEOSOME OCCUPANCY FROM PRIMARY SEQUENCE. PLOS COMPUT BIOL, 2008. 4(8): P. E1000134. [24] IOSHIKHES, I.P., ET AL., NUCLEOSOME POSITIONS PREDICTED THROUGH COMPARATIVE GENOMICS. NAT GENET, 2006. 38(10): P. 1210-5. [25] OZSOLAK, F., ET AL., HIGH-THROUGHPUT MAPPING OF THE CHROMATIN STRUCTURE OF HUMAN PROMOTERS. NAT BIOTECHNOL, 2007. 25(2): P. 244-8. [26] ZHANG, Y., ET AL., IDENTIFYING POSITIONED NUCLEOSOMES WITH EPIGENETIC MARKS IN HUMAN FROM CHIP-SEQ. BMC GENOMICS, 2008. 9: P. 537. [27] DAVID, L., ET AL., A HIGH-RESOLUTION MAP OF TRANSCRIPTION IN THE YEAST GENOME. PROC NATL ACAD SCI U S A, 2006. 103(14): P. 5320-5. [28] ABEEL, T., ET AL., GENERIC EUKARYOTIC CORE PROMOTER PREDICTION USING STRUCTURAL FEATURES OF DNA. GENOME RES, 2008. 18(2): P. 310-23. 9 Yiyu Zheng et al. [29] FRIEDMAN, J., T. HASTIE, AND R. TIBSHIRANI, ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING. THE ANNALS OF STATISTICS, 2000. 28: P. 337–407. [30] FRIEDMAN, J.H., GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE. THE ANNALS OF STATISTICS, 2001. 29: P. 1189–1232. 10