nucleoFeature_0125Yiyu

advertisement
Category
Feature pattern discovery to predict nucleosome occupancy in
yeast and human
Co-responding Author1,*, Co-author2 and Co-Author2
1
Department of XXXXXXX, Address XXXX etc.
Department of XXXXXXX, Address XXXX etc.
2
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
Motivation: Different sequence-based and structural features
have been reported predictive for nucleosome positioning. Which
features are important and whether important features are the
same for different species (human and yeast) is not yet clear.
Results: By dividing genome into different regions (gene region,
promoter region, intergenic region) and by transcription level
(highly expressed, lowly expressed), we found different feature
subsets are predictive for different genomic regions or genes with
different transcriptional activities. Some of these feature subsets
are conserved between yeast and human. We also identified
different patterns of nucleosome positioning that are not explainable by local sequence-based features. We show that different
patterns are likely to be functional results of different chromatin
remodelers and inclusion of linker length with effective feature
subsets improves nucleosome positioning prediction.
Availability:
Contact:
1. INTRODUCTION
Eukaryotic genomic DNA is packaged with histone and other
proteins into a chromatin complex. The basic unit of DNA packaging is called nucleosome comprising of a histone octamer with
~147 bp DNA segments wrapping around it (Kornberg and
Lorch, 1999; Luger, et al., 1997). The DNA segments that are
around nucleosomes are often called nucleosomal DNA and
those between two adjacent nucleosomes are often called linker
DNA. The linker DNA can have variable lengths often ranging
from 20 to 80 bp. The amount of histones occupying a unitlength of DNA segment in a population of cells is called the
nucleosome occupancy level corresponding to the given DNA
segment. It has been well known that nucleosome occupancy is
not uniformly distributed across the genome and different nucleosome occupancies in different DNA regions can have important
functional meanings (Daenen, et al., 2008; Tillo, et al., 2010).
*To
For example, highly occupied DNA regions are often difficult to
be accessed by regulatory proteins such as transcription factors
while nucleosome-depletion regions are more accessible to
DNA binding proteins functioning in gene regulation (Sekinger,
et al., 2005; Zhu and Thiele, 1996). Nucleosomes have also been
reported important roles in epigenetic gene regulations (Jiang
and Pugh, 2009; Li, et al., 2007). Therefore, understanding the
cause of nucleosome forming on different genomic regions is
important to understand gene regulation and uncover cellular
mechanisms.
Experimental methods to measure nucleosome occupancy
have been advanced by the recent next-generation sequencing
technology (Shendure and Ji, 2008), especially the parallel sequencing of nucleosomal DNA (Albert, et al., 2007; Barski, et
al., 2007; Field, et al., 2008; Kaplan, et al., 2009; Valouev, et al.,
2008; Zhang, et al., 2009). A number of datasets on genomescale measurement of nucleosome occupancy have become publicly available in multiple species including human and yeast
(Lee, et al., 2007; Schones, et al., 2008; Wang, et al., 2008; Yuan, et al., 2005). Based on these datasets, dozens of literature
studies have reported particular types of DNA sequences tend to
have higher nucleosome occupancy levels in vivo and there
potentially is a genomic code for nucleosome-forming sequences (Segal, et al., 2006). Accordingly, a number of computational
methods have recently been developed to model and predict
nucleosome occupancy based on the underlying features of the
primary DNA sequences (Field, et al., 2008; Gupta, et al., 2008;
Lee, et al., 2007; Peckham, et al., 2007; Reynolds, et al., 2010;
Segal, et al., 2006). For example, Peckham et al has incorporated variable-length k-mers into a SVM model to predict nucleosomal DNA in yeast. Lee et al has applied lasso regression
model to a set of DNA sequence and structural features for nucleosomal DNA prediction. Reynolds et al combines the nucleosomal DNA and the adjacent linker DNA sequences to characterize the nucleosome. Although the prediction accuracy is in
general not sufficiently high leading to the common understanding that factors other than primary sequence may also influence
nucleosome forming, these methods are able to identify a number of sequence features that in fact contribute to nucleosome
occupancy prediction. The selected features however are not
always consistent with each other. For example, several studies
whom correspondence should be addressed.
© Oxford University Press 2005
1
Yiyu Zheng et al.
have confirmed ~10 bp periodicity of certain dinucleotides as
predictive of nucleosome forming (Field, et al., 2008; Ioshikhes,
et al., 2006; Kaplan, et al., 2009; Segal, et al., 2006). Lee et al
claims structural features such as tilt and propeller twist are the
most effective features, and identified “Tip, Tilt, and Propeller
Twist” as important structural features for nucleosome occupancy prediction in yeast. The diversity of effective features identified in nucleosome-prediction studies leads to our hypothesis
that multiple DNA features can simultaneously or combinatorially influence nucleosome-occupancy.
In this paper, we develop a computational method FFN (Finding Features for Nucleosomes) to identify features and feature
combinations that are important for nucleosome occupancy prediction. By applying FFN to genome-wide nucleosome occupancy measurement data in yeast (Lee, et al., 2007) and human
(Schones, et al., 2008), we found that a number of different features when combined have high predictive power for nucleosome-forming or nucleosome-depletion sequences. The prediction power of combined features can be affected by factors beyond static DNA sequences such as gene transcriptional activities. We also show that certain structural features frequently
appear in feature patterns in nucleosome forming sequences.
2. METHODS
2.1
Data source
Nucleosome positioning data in yeast is obtained from Lee et al, and the
data in human is from Schones et al for both resting and activated human
T-cells (Lee, et al., 2007; Schones, et al., 2008). The Laplacian of
Gaussian (LOG) method (Ozsolak, et al., 2007; Zhang, et al., 2008) is
then applied to the raw data and identified enriched regions. The parameters are chosen for best consistency between LOG results and HMM
results (Lee, et al., 2007). These enriched regions are then defined as
nucleosome-containing sequences (NCS), and the sequences between
two NCSs are defined as linker-containing sequences (LCS).
For a given gene and its promoter region (1,000 bp upstream of the
Transcription Start Site (TSS)), we define a nucleosome as 0th nucleosome if it overlaps with the TSS, the nucleosomes in the upstream region
of the TSS as -1st nucleosome, -2nd nucleosome and so on, and the nucleosomes in the downstream region of the TSS as 1 st, 2nd nucleosomes
and so on.
The gene expression data in yeast is obtained from David et al. (David, et al., 2006). 5736 annotated genes with gene expression measurements are kept for further identification of high-confidence transcripts.
The yeast gene annotation is based on the annotation resource at UCSC.
High-confidence transcripts are defined as those transcript segments that
overlap greater than 50% with a non-dubious annotated coding region in
the 5’ end (Lee, et al., 2007). In total, 5300 out of the 5736 genes are
defined as high-confidence transcripts.
The gene expression data in human is obtained from GEO database
(GSE10437), which measures whole genome gene expression under two
T cell resting conditions and two activated conditions respectively
(Schones, et al., 2008). 19049 genes with gene expression measurements
are obtained using gene annotation resources at UCSC (human hg18
build). Genes that are absent under both of the resting conditions and
present under both of the activated conditions are defined as induced
genes. Similarly, genes that are present under both of the resting conditions and absent under both of the activated conditions are defined as
repressed genes (Schones, et al., 2008). In total, 299 genes are defined as
2
induced genes and 393 genes as repressed genes. We define these induced genes and repressed genes together as disturbed genes.
2.2
Feature compilation and sequence representation
Features that are relevant to nucleosome occupancy are compiled from
literature. These compiled features can be categorized into two classes:
(1) sequence features such as DNA k-mer frequency (Peckham, et al.,
2007), poly(A) tracts, transcription factor binding sites and sequence
repeat (Lee, et al., 2007); (2) DNA structural features from Lee et al and
Abeel et al (Abeel, et al., 2008; Lee, et al., 2007). The structural feature
values are computed based on the conversion tables (Supplementary
Table 1). The structural features that have large correlation with other
structural features (the absolute value for Pearson correlation calculation
is greater than 0.9) are removed resulting in 23 structural features (Supplementary Table 1). Finally, 766 features including 694 k-mer frequency features, 4 Poly tracts, 40 poly(dA/dT) tracts, 2 sequence repeat features, 23 structural features and 3 motifs (yeast only) are kept for further
analysis (Supplementary Table 2).
We then performed LogitBoost (Friedman, et al., 2000) to further select the most relevant features. LogitBoost is a boosting algorithm for
classification using logistic regression as cost function, and can assign
weight to the features selected in the model so as to estimate the impact
of the features on the model (Friedman, 2001). We used the implementation of LogitBoost in software “WEKA”(Hall et al., 2009). We chose
top 1000 nucleosomes with highest profile score and top 1000 linker
sequences with lowest scores as training sets in yeast, human Resting
and Activated T-cell data respectively. We then performed 100 round
iterations using all the features in LogitBoost. We kept the top 10 features in yeast, human resting and human activated T-cell dataset as they
generally take 60% weight out of all the features selected (Supplementary Table 3). Then we collected the top 10 features selected in, and also
included the top 10 features by Peckham, et al., 2007 and three structural
features from Lee, et al., 2007. Finally, after removing some duplicated
features, we included 30 features in our consideration (Table 1).
We then used selected features to represent the NCSs and LCSs as
follows. We computed all the feature values for every 147 bp-long subsequence in all the NCSs/LCSs. As the feature value generally conforms
to normal distribution, we discretized each feature into m levels using
cutoffs as μ – (m/2-1)σ, μ – (m/2+1)σ, …, μ, …, μ + (m/2-1)σ. For
m=4, level 0 is (-∞, μ – σ), level 1 [μ – σ, μ), level 2 [μ, μ + σ), and level
3 [μ + σ, +∞). With the discretized feature values, every 147 bp subsequence in an NCS/LCS was replaced by the combination of its discretized feature values, called feature profiles.
Table 1. Top 30 features
1
Tip
11 AAT/ATT
21 GAC/GTC
2
Minor groove mobility
12 ATTA/TAAT
22 CCCC/GGGG
3
Tilt
13 TAA/TTA
23 ACAC/GTGT
4
Z-DNA free energy
14 TAATA/TATTA
24 AATTA/TAATT
5
Persistence length 1
15 AAAA/TTTT
25 ATAT
6
Slide
16 CGCC/GGCG
26 A/T
7
Major groove mobility
17 AAG/CTT
27 TA
8
Propeller twist
18 ACA/TGT
28 AAA/TTT
9
10
ATA/TAT
ATAA/TTAT
19 AAATA/TATTT
20 CCGCC/GGCGG
29 AT
30 AATA/TATT
30 features used in pattern discovery
Feature pattern discovery to predict nucleosome occupancy in yeast and human
rence is larger than α percent of the total number of NCS subsequences (e.g. α=20) are kept. The algorithm then searches for
frequently co-occurred di-feature patterns using frequent pattern
mining techniques (cite FIM). Next, FFN performs an extension
step, in which FFN investigates whether some of these di-feature
patterns in the same cluster can be further extended into triprofileCount  a1b 2c 3  
E  a1b 2c 3 
feature or even longer patterns. This extension step is implezscore 
(1)
mented by a pattern merging procedure. For example, given four
2
2
E ( a1b 2c 3 )  ( E ( a1b 2c3) )
di-feature patterns a1b1, a1c3, b1c3 and c3d4 in one cluster,
meaning they are frequently co-occurring in the input sequences,
we observe that they can in fact be merged into three tri-feature
The expectation of pattern a1b2c3 is calculated using the following
patterns a1b1c3, b1c3d4, a1c3d4 and one tetra-feature pattern
formula:
a1b1c3d4. Note that the occurrence frequencies of these merged
patterns may vary, but should be all greater than α% of the total
E ( a1b 2c3)   a1 *  b 2 *  c 3
NCS subsequences. After the pattern extension step, FFN evaluhi
k
k
k

 (2)
r 1
r 1
r 1
  (  ax (Ta ) axa1 )(   bx ' (Tb )bx ' bx )(   cx '' (Tc ) cx '' a1 ) 
ates the statistical significance of the patterns’ enrichment in the

r  2  x 1
x ' 1
x '' 1
NCS and LCS subsequences using z-scores (see Methods section). The D-score for a given pattern is then computed to dehi 1 hi
termine whether it has sufficient discriminative power to distinE ((a1b2c3)2 )  E (a1b2c3)  2*   (
r 1 j  r 1
guish NCSs from LCSs. In this way, we can identify NCSk
specific feature patterns that are frequently found in NCSs but
(  ax (Tar 1 )axa1 (Taj  r )a1a1 )
(3) less frequently found in LCSs. Similarly, we can start FFN from
x 1
LCS data to identify LCS-specific feature patterns.
k
k
r 1
j r
r 1
j r
*(  bx ' (Tb )bx ' b 2 (Tb )b 2b 2 ) *(   cx '' (Tc )cx '' c 3 (Tc )c 3c 3 ))
2.3
Scoring of a potential feature pattern
To determine the discriminating power of a potential feature pattern, we
defined the z-score of a specific feature pattern in nucleosome-forming
sequences as Zn, and Zl in linker-forming sequences. So the discriminative score D Score = Zn – Zl


x ' 1
x '' 1
Algorithm 1. FFN algorithm
While Ta is the transition matrix for the Markov chain modeled feature a transitioning between its categories across all the windows in the
m sequences, and π is calculated as following:
 a *Ta   a
Input: NCS profile, LCS profile
Output: a set of discriminative patterns
(4) Ck: candidate pattern set of length k
Lk: frequent pattern set of length k
While πax is between [0, 1] for all x, and meets the requirement that:
k

x 1
ax
1
(5)
In the end we keep the patterns with score above 3√2 as discriminative patterns. As the ranges of D-Score in different species are different,
in each situation, we normalize the D-Score of each pattern by dividing
the original D-Score by the highest D-Score of all patterns to make the
score in range [0, 1].
3. EXPERIMENTS AND RESULTS
To identify potential features that combinatorially characterize
nucleosome/linker-containing sequences, we developed FFN
algorithm (Algorithm 1. FFN algorithm). We then applied the
FFN to discover feature and feature combinations in yeast and
human T-cell data.
3.1
The FFN algorithm
The FFN algorithm aims to discover combinations of features
that are relevant to NCS/LCS-forming and can be used to distinguish NCSs from LCSs. Given all of the 147 bp-long subsequences in the obtained NCSs and LCSs, the FFN starts from
enumerating all the possible two-feature combinations, called difeature patterns. Only those di-feature patterns whose occur-
Initial discovered pattern set: R is Ø
Start from length 1 pattern set L1 : {all features}
While Lk is not null
Determine candidate pattern set Ck+1 by merging patterns in Lk
FOR each profile item p in Input
FOR each candidate pattern c in Ck+1
IF p contains pattern c
Increment support(c)
Generate Lk+1 with all candidate patterns in Ck+1 with support > alpha
For each candidate patterns P in Lk
Calculate Zn:zScore(p) based on formula (1-5);
Calculate Zl:zScore(p) in LCS profiles
D-Score = Zn - Zl
If (DScore > cutoff (3√2)) {
put the pattern in the result set R
Return discovered pattern set R.
3.2
Feature patterns identified in yeast data
Applying the FFN algorithm to the yeast nucleosome data, we
identified 88 NCS-specific patterns (NLPs) with D-score larger
than 3√2 (Table 2, Supplementary Table 4). For example, “Minor grove mobility” level 1, “Z-DNA free energy” level 2,
“Persistence length 1” level 1 and “A/T” level 2 form the pattern
with highest D-score 80.56. This pattern occurs 228845 times in
NCSs with a zscore 526.30 and 162726 times in LCSs with a
3
Yiyu Zheng et al.
zscore 445.74. A length-3 subpattern of this pattern containing
“Z-DNA free energy” level 2, “Persistence length 1” level 1 and
”A/T” level 2, is also identified as a discriminative pattern with
D-score 57.04. We also identified several patterns with same
feature combination but different levels, for example, pattern
“Tip” level 2, “CCGCC/GGCGG” level 0, “TA” level 2 (Dscore=35.09) and “Tip” level 1, “CCGCC/GGCGG” level 0,
“TA” level 1 (D-score=13.23). These identified patterns suggest
that different combinations of features at various levels can be
used for NCS prediction. We found that the feature combinations in NCSs are not specific to individual genes or promoters
since nucleosomes in the same promoter regions often contain
different features combination. For example, the promoter region of gene YAL047W-A contains 5 nucleosomes with very
different patterns (Figure 1-A). We also identified 15 LCSspecific feature patterns in yeast (Supplementary Table 4). For
example, “ATA/TAT” level 1, “TAATA/TATTA” level 0, and
“CCGCC/GGCGG” level 0 form a high D-score pattern (Dscore = 10.38). It occurs 191297 times in LCSs with a zscore of
13.44 and 355422 times in NCSs with a zscore of 3.06.
"CCGCC/GGCGG" level 0, "AATTA/TAATT" level 0,
"AATA/TATT" level 1 also forms a pattern (D-score = 4.99)
Table 2. Top 10 Yeast Patterns Identified
Rank
1
2
3
4
5
6
7
8
9
Pattern
Fig. 1. Pattern distribution Curve. (A) In the promoter region of
YAL047W-A, the 5 nucleosomes contain different patterns: ("Z-DNA
free energy" level 2, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level
0, "A/T" level 2), ("tip" level 2, "CCGCC/GGCGG" level 0, "TA" level
2), ("tip" level 2, "CCGCC/GGCGG" level 0, "TA" level 2), ("tip" level
1, "CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "TA" level 1),
("Z-DNA free energy" level 2, "persistence length 1" level 1,
"CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0) respectively (from
5’ towards the region [-1000, 0] TSS) (B) in the promoter region of
YOR202W, one nucleosome does not contain any pattern. The other
three nucleosomes contains ("Z-DNA free energy" level 1,
"CGCC/GGCG" level 0, "CCGCC/GGCGG" level 0, "A/T" level 1),
("ATTA/TAAT" level 0, "TAATA/TATTA" level 0, "AATTA/TAATT"
level 0), ("Slide" level 1, "Propeller twist" level 1, "CCGCC/GGCGG"
level 0) respectively.
Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence
length 1" level 1, "A/T" level 2
Z-DNA free energy level 2, "Persistence length 1" level 1, "CCCC/GGGG"
level 0, "A/T" level 2
Z-DNA free energy level 2, "Persistence length 1" level 1,
"CCGCC/GGCGG" level 0, "A/T" level 2
Z-DNA free energy level 2, "Persistence length 1" level 1, "A/T" level 2
Minor groove mobility level 1, "Z-DNA free energy" level 2,
"CCCC/GGGG" level 0, "A/T" level 2
Minor groove mobility level 1, "Persistence length 1" level 1,
"CCGCC/GGCGG" level 0, "A/T" level 2
Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence
length 1" level 1, "CCGCC/GGCGG" level 0
Minor groove mobility level 1, "Z-DNA free energy" level 2,
"CCGCC/GGCGG" level 0, "A/T" level 2
Z-DNA free energy level 2, "CGCC/GGCG" level 0, "CCGCC/GGCGG"
level 0, "A/T" level 2
10 Minor groove mobility level 1, "Persistence length 1" level 1, "A/T" level 2
By applying FFN algorithm we identified 88 NCS-specific patterns (NLPs). Three
structural features frequently occur in the top 10 patterns.
We also compared the occurrences of features in NLP-specific
patterns and LNP-specific patterns and found different feature
preferences. Of all the 30 features, 17 features existed in NCSspecific patterns, and 11 existed in the LCS-specific patterns.
There are 13 features that exclusively appear in NCS-specific
patterns and 7 features only in LCS-specific patterns. We found
that among the 13 NCS exclusive features, the feature of “A/T”
is identified 40 out of the total 88 NCP-specific patterns but
never in LCP-specific patterns. Especially, the feature “A/T” at
level 2 frequently appears in the top scored feature patterns (9
out of top 10 patterns contains this feature, Table 2). This feature has been identified as the most useful feature to distinguish
NCSs from LCSs in Peckham et al. It frequently forms patterns
with structural features such as Minor groove mobility (11
times), Z-DNA free energy (12 times) and Persistence length 1
(11 times). All top 20 patterns contain at least one of these four
features. This observation is consistent with the discovery that
structural features will help nucleosome occupancy prediction in
yeast (Lee, et al., 2007). “Z-DNA free energy”, which is related
to the free energy required for transition from B-DNA to ZDNA transition (Ho et al., 1990), has been identified 39 times
(10 times in level 1 and 29 times in level 2). This features is one
of the structural features that are mostly negatively correlated
with nucleosome occupancy (Gan et al., 2012). It usually cooccur with at least one of the other three features mentioned
above in the discovered patterns (33 out of 39).
3.3
4
The distribution of patterns in yeast promoters
Feature pattern discovery to predict nucleosome occupancy in yeast and human
Since NCS-forming in the promoter regions can affect transcriptional activities, we investigated the relationship between the
distribution of the identified feature patterns and NCSs in the
yeast promoter regions. For each NCS/LCS in the promoter
regions, we assigned it a pattern score equal to the largest Dscore received by the feature patterns exhibited by its containing
subsequences. We found that in general, the pattern scores are
averagely higher in NCSs/LCSs closer to TSSs than those in
NCSs/LCSs farther away from TSS (Figure 2-A). For the -1st
and the 0th NCSs and the neighboring LCSs, the average score
of NCSs is slightly higher than that of the LCSs. This is consistent with studies reporting that the nucleosomes near the TSSs
in yeast are more determined by sequence features, while the
nucleosomes away from the TSS are more determined by other
factors (cite xxx).
To investigate whether the distribution of feature patterns are
influenced by genes’ transcriptional activities, we further divided genes based on their expression levels. We investigated the
pattern scores of the NCSs/LCSs in the promoters of the 1000
most highly expressed genes (expression level between 4.56 and
2.79) and 1000 most low-expressed genes (expression level
between 0.0040 and 0.956). We found that for every nucleosomal location, the score is averagely larger for the NCSs/LCSs in
the low-expressed genes than that in the highly expressed genes
(Figure 2-B). For example, for the low-expressed gene
“YOR258W” (Figure 2-C), nearly all the NCSs in its promoter
region contain the top-scored patterns while the neighboring
LCSs contain low-scored patterns. Similarly, the low-expressed
gene “YMR126C” (Figure 2-D) shows this trend. On the other
hand, highly expressed genes such as the gene “YDR002W”
often do not contain many high score patterns in the promoter
region (Figure 2-E). These observations implicate nucleosomes
in promoters tend to be less sequence-predictable when there is
significant gene transcriptional activities (Lee, et al., 2007).
We also investigated whether the distribution of feature patterns is related to the nucleosome density in the promoter region.
We grouped the promoters by the number of nucleosomes in the
regions. We found that near the TSS regions, the average score
for the 0th nucleosome of promoters containing six nucleosomes
(0.445) is much lower than the score for the 0th nucleosome in
other promoter regions (above 0.5) (Figure 3-A). When taking
gene transcriptional activities into account, we found that the
average scores of the nucleosome near TSS in the highly expressed promoter regions containing 6 nucleosomes are much
smaller compared to other, the scores are 0.306, 0.362, 0.336 for
the (-1st, 0th, 1st) nucleosomes while other groups are generate
above 0.45 and displaces higher score compared to farther nucleosome (Figure 3-B). For the lowly expressed genes, the average scores are similar and relatively higher than farther nucleosomes (Figure 3-C).
Fig. 2. Pattern Distribution on Yeast. (A) Average pattern score for
nucleosomes and linkers around the TSS. (0th, 1st…). The scores for the 2nd, -1st, 0th, 1st, 2nd NCSs are 0.458, 0.515, 0.524, 0.429, 0.444 respectively, and the scores for the -2nd, -1st, 0th, 1st, 2nd LCSs are
0.481, 0.479, 0.521, 0.477, 0.482 respectively. Scores for NCS/LCS near
TSS are higher. (B) Average pattern score for nucleosome in top1000
high expressed genes and top 1000 low expressed genes. (C-E) Pattern
score curve for the promoter region of gene YOR258W, YMR126C and
YDR002W.
Fig. 3. Average patterns scores of nucleosome vary with nucleosome
density in the promoter region (A-C) average pattern score of nucleo-
5
Yiyu Zheng et al.
some in all the genes, top 1000 high expressed genes and top 1000 low
expressed genes.
3.4
Identified feature and feature patterns in human T-cell
Applying the FFN algorithm to human T-cell resting and activated data separately, we identified 2328 NCS-specific pattern
in human T cell resting data, and 589 NCS-specific patterns in
human T cell activated data (Supplementary Table 5). For example, the pattern "minor groove mobility" level 1, "Z-DNA
free energy" level 2, "persistence length 1" level 1, "A/T" level 2
form a pattern (D-score=302.71) in human T-cell, which has
also been identified in yeast. Pattern "Tip" level 1,
"AATTA/TAATT" level 0, "TA" level 1 is another pattern (D
Score=39.9) being identified in both human and yeast. Similar to
yeast, nucleosomes in the same promoters can exhibit different
feature patterns in human T-cell data. For example, the four
nucleosomes in the promoter region of gene TARDBP contain
very different patterns (Fig 4).
Fig. 4. Pattern distribution Curve in the promoter region of TARDBP in
human Resting status. There are 4 nucleosomes, and each of them contains different patterns: "Minor groove mobility" level 1, "Z-DNA free
energy" level 2, "Persistence length 1" level 1, "Major groove mobility"
level 2, "CCGCC/GGCGG" level 0, "A/T" level 2 (rank1), "Minor
groove mobility" level 2, "Z-DNA free energy" level 1, "Persistence
length 1" level 2, "Major groove mobility" level 1, "AATTA/TAATT"
level 0 (rank39), "Minor groove mobility" level 1, "Z-DNA free energy"
level 2, "Persistence length 1" level 1, "Major groove mobility" level 2,
"CCGCC/GGCGG" level 0, "A/T" level 2 (rank2), "Z-DNA free energy"
level 1, "Persistence length 1" level 2, "Major groove mobility" level 1,
"AATTA/TAATT" level 0 (rank 93) respectively (from 5’ towards the
region [-1000, 0] TSS)
Comparing features in human resting and activated T cells, we
found they contain nearly all features except AAG/CTT,
GAC/GTC, and ACAC/GTGT. The four features we mentioned
in yeast are also frequently co-occurring in top-ranked patterns,
and another feature “AAA/TTT” is identified 647 out of the total
2328 NCS-specific patterns in human resting status and 183 out
of total 589 patterns in human activated status. This feature is
identified by Peckham, et al. as a significant 3-mer feature in
distinguishing the nucleosomes and linkers. We noticed that the
top patterns discovered in both human resting and activated
status are similar. The top 2 patterns are same ("minor groove
mobility" level 1, "Z-DNA free energy" level 2, "persistence
length 1" level 1, "major groove mobility" level 2,
"CCGCC/GGCGG" level 0, "A/T" level 2, and "minor groove
mobility" level 1, "Z-DNA free energy" level 2, "persistence
length 1" level 1, "major groove mobility" level 2, "A/T" level
2), and nearly all top100 patterns contain feature “A/T” and are
6
formed by the combination of "minor groove mobility”, "ZDNA free energy", "persistence length 1", and "major groove
mobility". We also noticed that top patterns discovered in resting status generally contain "minor groove mobility” level 1
while patterns from activated status contain both level 1 and
level 2.
There are 576 same patterns in patterns discovered in resting
and activated status (total 589 patterns), while if we do not consider the level, there are 580 patterns that are formed with the
same features. Although there are slightly difference in discretization of feature value in resting and activated, it demonstrate
that our method can robustly discover lots of useful patterns.
One noticing pattern that is discovered in activated status but not
in resting status is "minor groove mobility" level 2, "Z-DNA
free energy" level 1, "persistence length 1" level 2, "A/T" level 1
(rank 6 with zscore 538). In resting status the zscore for this
pattern is -6.24.
3.5
The distribution of patterns in human
We analyzed the distribution of patterns in the NCSs and LCSs
in the human promoters. We observed that in both resting and
activated status, the patterns in both NCSs and LCSs have lower
scores when closer to TSSs comparing with when farther away
from TSSs, which is different from the trend yeast. (Figure 5-A).
We also observed that the average scores of the NCSs are higher
than its neighboring LCSs near the TSS and in the gene body,
while the average scores for LCSs in the core promoter region
(the -1st and 0th nucleosome locations) are higher than that of the
paired nucleosomes. In general, the -1st nucleosome and 1st nucleosome near TSS without a 0th nucleosome have a low score
compared to these with a 0th nucleosome in human T cell. Additionally, the 0th nucleosomes in human score the lowest among
all nucleosomes (Figure 7, Supplementary Figure 3-H).
When the gene expression levels are taken into account in
both resting and activation status, the patterns in NCSs closer to
TSSs of low-expressed genes have averagely higher scores than
those obtained in NCSs of highly expressed genes, while the
difference is not that significant for those nucleosomes farther
from TSS (Figure 5-B for resting and Supplementary Figure 3-B
for activated). Also, the score differences between NCSs and
LCSs become larger in the low-expressed genes compared to
that in the highly expressed genes (See supplementary figure).
This is consistent with the hypothesis that the lack of transcriptional activities can lead to sequence-determined nucleosomeforming events (Segal et al., 2006). Figure 5 C&D are two examples of pattern distribution around the TSS.
Feature pattern discovery to predict nucleosome occupancy in yeast and human
upstream of it. We found that after TCR signaling activation,
that nucleosome has moved into that low score region.
Fig. 5. Pattern Distribution on Human resting status. (A) Average
pattern score for nucleosomes and linkers around the TSS. For example,
the D Scores for the -2nd, -1st, 0th, 1st, 2nd NCSs are 0.920, 0.872,
0.799, 0.799 and 0.863 respectively. The scores for the -2nd, -1st, 0th,
1st, 2nd LCSs are 0.921, 0.894, 0.810, 0.741, 0.803 respectively. (B)
Average pattern score for nucleosome in top1000 high expressed genes
and top 1000 low expressed genes. (C-D) Pattern score curve for the
promoter region of gene FCRLA and RPL13 in resting status. (C) Lowly
expressed gene NM_032738/ FCRLA with average expression level 4.3
in human resting status, the nucleosomes in the promoter regions from 4 nucleosome to 2 nucleosome contains patterns ranked (1, 1, 4, 3, 1, 16,
4, 311). (D) Highly expressed gene NM_033251/RPL13 with expression
level 169504 in resting status, the nucleosomes in the promoter region
from -4th nucleosome to +2nd nucleosome contain patterns ranked (1, 1,
2, 1406, 1109, 1413, 39)
To investigate the pattern distribution in disturbed genes and
the pattern changes before and after T-cell activation, we assigned the pattern scores to all of the NCSs/LCSs in the promoter regions of the 299 induced genes and 393 repressed genes.
For the repressed genes and induced genes in resting status, the
scores for the -1st and 0th NCSs are generally lower than their
neighboring LCSs (Figure 6-A, 6-B), suggesting that “+1 nucleosome may already be depleted and prepared for gene activation
before TCR signaling”(Schones et al., 2008). For example, the
induced gene HPDL (Figure 6-C, 6-D), it displays a similar
nucleosome occupancy in resting and activated status in the
promoter region and similar pattern distribution curve.
The scores for induced and repressed genes in activated status
are both relative lower compared to the average score in high
expressed genes. The score for 0th nucleosome is 0.702 for induced genes, 0.677 for repressed genes and 0.750 for high expressed genes. It seems that for perturbed genes in activated
status, the nucleosome-forming is more likely to be predicted by
factors related to gene activity and TCR signaling rather than the
sequence features. Take the repressed gene NM_173485 (Figure
6-E, 6-F) for example, unlike the gene HPDL although the pattern distribution curve in resting and activated status are still
similar, the nucleosome occupancy near the TSS position has
changed. In the resting status, the -1st nucleosome positioned in
a region with high pattern score with a low score region in the
Fig. 6. Pattern distribution varies with different gene status on
Human. (A) Average pattern score for nucleosomes and linkers in repressed genes in Human resting status. (B) Average pattern score for
nucleosomes and linkers in induced genes in Human resting status. (CD) Pattern score curve for the promoter region of gene HPDL in resting
status and activated status. (E-F) Pattern score curve for gene
NM_173485
We also grouped the promoters by nucleosome density based
on the nucleosome numbers in the promoter regions. Unlike in
yeast that only the promoters with 6 nucleosomes have low
score, we found that the more nucleosomes in the promoter region, the lower score for each nucleosome in respective position
(Figure 7). In activated status, we found that the scores of i-th
nucleosome just increase with the density in the near TSS region
(Supplementary Figure 3-E). We found that for the top1000
highly expressed gene groups if we grouped them by nucleosome density (Supplementary Figure 3-F), it still keeps the similar trend as using all promoters, while in the top1000 lowly expressed gene groups (Supplementary Figure 3-G), the genes that
contains only 1 nucleosome contain higher scores than the genes
with other nucleosome density.
7
Yiyu Zheng et al.
Fig. 7. Pattern distribution varies with nucleosome density on Human. (consider to remove)
3.6
Yeast Patterns compare with Human Patterns
There are 35 exactly same patterns are conserved in yeast, human resting and activated T cells (Supplementary Table 6). For
example, "Minor groove mobility" level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "A/T" level 2 form a
conserved pattern with high ranks in both yeast and human (rank
1st in yeast, 9th in human resting and 20th in human activated),
indicating the three structural features together with the sequence “A/T” feature are very important factors influencing
nucleosome-forming in both species. Note that because of the
different feature value distributions in yeast and human genome,
the discretization level of these features in yeast and human can
be different.
Only considering the features but not their levels, we discovered 41 conserved feature combinations across yeast, human
Resting and human Activated T cells (Supplementary Table 6).
For example, the feature combination of "Z-DNA free energy",
"CGCC/GGCG", "CCGCC/GGCGG" and "A/T" is conserved
but at different feature levels in yeast and human. In detail, "ZDNA free energy" at level 1, "CGCC/GGCG" at level 0,
"CCGCC/GGCGG" at level 0, and "A/T" at level 1 is discovered
in yeast, while in human resting and activated, the pattern discovered is "Z-DNA free energy" at level 2, "CGCC/GGCG" at
level 0, "CCGCC/GGCGG" at level 0, and "A/T" at level 2. This
might be caused by the different feature value ranges in different
species. Yeast genome has a lower GC content (38%) compared
to human (41%). The GC content for the promoter [-1000, 0] of
yeast and human genes is 38.32% and 53.19% respectively.
Also, there are 53 patterns discovered in yeast (Supplementary Table 6) that does not discovered in human. For example,
the yeast pattern "minor groove mobility" level 1, "Z-DNA free
energy" level 2, "CCCC/GGGG" level 0, "A/T" level 2 (rank 6)
is not discovered in human data. It demonstrates that in different
species different feature combination will help nucleosome occupancy prediction.
4. DISCUSSIONS AND CONCLUSIONS
Understanding the interaction between DNA sequence and nucleosome occupancy is important to uncovering gene regulatory
mechanisms. Whether DNA sequence directly determines nucleosome occupancy and nucleosome positioning, and if yes,
how much, is still under debate. We have developed an efficient
method to study DNA features and their combinations that are
useful for nucleosome occupancy prediction. Applying our
method to the yeast and human T-cell data, we discovered thousands of feature combination patterns that have different enrichment between NCSs and LCSs. These discovered feature
patterns involve both DNA structural features and sequence
features and provide multiple possibilities for nucleosomeforming. Comparison between feature patterns between yeast
and human, we found that different patterns might prevail in
different species.
One important observation is that nucleosome-occupancy prediction accuracy is location-dependent. The farther away from
8
TSSs, the more accurate is the sequence-based prediction in
human. Another related observation is that nucleosomeoccupancy tends to be hard to predict from the discovered feature patterns when the containing genes are transcriptional active.
Our results also show that feature levels can be important indicators for NCS/LCS-forming. In yeast patterns most features
have two levels (level 1 and 2) appearing. For example, “ZDNA free energy” is identified 39 times in 88 patterns with 10
times in level 1 and 29 times in level 2. We observe that all patterns containing “Z-DNA free energy” at level 1 rank lower than
44, while most of the patterns (25 out of 29) containing “Z-DNA
free energy” at level 2 rank in the top. . Also we noticed that two
features "Z-DNA free energy" level 2, "A/T" level 2 co-occur 12
times in all the patterns, and "Z-DNA free energy" level 1 with
"A/T" level 1 18 times.
By changing the frequency parameter alpha (we used alpha =
20% in the paper) we can get different numbers of patterns, as in
frequent pattern find step of the algorithm, if the alpha is smaller, we will include more patterns, and if the alpha is larger, we
will only keep the patterns that occurs more frequently. For
example, in yeast data, if we use alpha = 30%, there will be only
39 frequent patterns left, and none of them are discriminative.
The pattern with largest D-Score is “ATAA/TTAT” level 1,
“CGCC/GGCG” level 0, “CCGCC/GGCGG” level 0 of score
3.407 which does not meet our criteria of the discriminative
patterns (see 2.4). Also all the patterns contain no structural
feature, and contain at most one valid feature. It is because the
k-mer frequency features especially the 4-mer and 5-mer ones
have high possibility that does not exist in the sequence, which
makes these patterns more frequent than the patterns with valid
features. When we try to use the alpha = 10%, it will include
more patterns. We finally get 2203 patterns (Supplementary
Table 7) compared to the 88 patterns discovered using alpha =
0.2. All the previously 88 patterns are included in the newly
discovered patterns, while we all discovered some new patterns
such as "minor groove mobility" level 3, "Z-DNA free energy"
level 0, "persistence length 1" level 3, "A/T" level 0. This one
contains the same feature combination as the rank1 patterns in
the yeast while the features values are in different level. By
changing the alpha, we can discover more/ less patterns.
Also different discretization method will affect the pattern
discover as different bin dividing method will make the frequency of the pattern changes. If we use the {μ – 2σ, μ, μ + 2σ} as
three cutoffs as the new discretization method in yeast, then the
patterns we discovered will be different, because the feature
value have more possibility to fall in the level 1 and level 2 bins,
thus making the patterns with level 1 and level 2 bins more frequent. With the same parameter alpha = 0.2, we discovered 2447
patterns (Supplementary Table 7) using the new discretization
cutoffs compared to the 88 patterns discovered using the discretization method {μ – σ, μ, μ + σ}. The patterns with highest
DScore is "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "slide" level 1, "major
groove mobility" level 2, "propellertwist" level 1,
"CCGCC/GGCGG" level 0, "A/T" level 2. This pattern is discovered 220920 times (20.15%) in NCS profile and 226873
times (27.04%) in LCS profile. This pattern is composed mostly
by the structural features as the distribution of structural features
Feature pattern discovery to predict nucleosome occupancy in yeast and human
are more conform to normal distribution that K-mer frequency
features, and “A/T” is also well conform to normal distribution
compared to other k-mer frequency features. 82 out of 88 patterns are included in the new patterns, while the other 6 we can
find corresponding patterns that contain same features with different level. For example the previous pattern "Z-DNA free
energy" level 1, "AAAA/TTTT" level 1, "A/T" level 1, we discovered a pattern with same features but all in level 2 ("Z-DNA
free energy" level 2, "AAAA/TTTT" level 2, "A/T" level 2).
We also discovered lots of long patterns (combination of more
than 4 features) while there are only length 3 and 4 patterns
using the previous method. These long patterns are generally
composited by the frequent features mentioned in 3.2 with some
other features. For example, we discover a new pattern "minor
groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2,
"CGCC/GGCG" level 0, "A/T" level 2, and it contains same
features and same level as the previous discovered pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2,
"persistence length 1" level 1, "A/T" level 2 with two more features "major groove mobility" level 2 and "CGCC/GGCG" level
0.
ACKNOWLEDGEMENTS
Funding:
REFERENCES
Gan,Y. et al. (2012) Structural features based genome-wide characterization and
prediction of nucleosome organization. BMC bioinformatics, 13, 49.
Hall,M. et al. The WEKA Data Mining Software : An Update. 11, 10–18.
Ho,P.S. et al. (1990) Polarized electronic spectra of Z-DNA single crystals. Biopolymers, 30, 151–63.
Abeel, T., Saeys, Y., Bonnet, E., Rouze, P. and Van de Peer, Y. (2008) Generic
eukaryotic core promoter prediction using structural features of DNA, Genome
research, 18, 310-323.
Albert, I., Mavrich, T.N., Tomsho, L.P., Qi, J., Zanton, S.J., Schuster, S.C. and
Pugh, B.F. (2007) Translational and rotational settings of H2A.Z nucleosomes
across the Saccharomyces cerevisiae genome, Nature, 446, 572-576.
Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G.,
Chepelev, I. and Zhao, K. (2007) High-resolution profiling of histone methylations in the human genome, Cell, 129, 823-837.
Daenen, F., van Roy, F. and De Bleser, P.J. (2008) Low nucleosome occupancy is
encoded around functional human transcription factor binding sites, BMC genomics, 9, 332.
David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones,
T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome, Proceedings of the National Academy of Sciences of the United States of America, 103, 5320-5325.
Field, Y., Kaplan, N., Fondufe-Mittendorf, Y., Moore, I.K., Sharon, E., Lubling,
Y., Widom, J. and Segal, E. (2008) Distinct modes of regulation by chromatin
encoded through nucleosome positioning signals, PLoS computational biology,
4, e1000216.
Friedman, J., Hastie, T. and Tibshirani, R. (2000) Additive logistic regression: a
statistical view of boosting, The Annals of Statistics, 28, 337–407.
Friedman, J.H. (2001) Greedy function approximation: A gradient boosting machine, The Annals of Statistics, 29, 1189–1232.
Gupta, S., Dennis, J., Thurman, R.E., Kingston, R., Stamatoyannopoulos, J.A. and
Noble, W.S. (2008) Predicting human nucleosome occupancy from primary
sequence, PLoS computational biology, 4, e1000134.
Ioshikhes, I.P., Albert, I., Zanton, S.J. and Pugh, B.F. (2006) Nucleosome positions
predicted through comparative genomics, Nature genetics, 38, 1210-1215.
Jiang, C. and Pugh, B.F. (2009) Nucleosome positioning and gene regulation:
advances through genomics, Nature reviews, 10, 161-172.
Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Tillo, D., Field, Y.,
LeProust, E.M., Hughes, T.R., Lieb, J.D., Widom, J. and Segal, E. (2009) The
DNA-encoded nucleosome organization of a eukaryotic genome, Nature, 458,
362-366.
Kornberg, R.D. and Lorch, Y. (1999) Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome, Cell, 98, 285-294.
Lee, W., Tillo, D., Bray, N., Morse, R.H., Davis, R.W., Hughes, T.R. and Nislow,
C. (2007) A high-resolution atlas of nucleosome occupancy in yeast, Nature
genetics, 39, 1235-1244.
Li, B., Carey, M. and Workman, J.L. (2007) The role of chromatin during transcription, Cell, 128, 707-719.
Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. and Richmond, T.J.
(1997) Crystal structure of the nucleosome core particle at 2.8 A resolution,
Nature, 389, 251-260.
Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. (2007) High-throughput mapping of the chromatin structure of human promoters, Nature biotechnology, 25,
244-248.
Peckham, H.E., Thurman, R.E., Fu, Y., Stamatoyannopoulos, J.A., Noble, W.S.,
Struhl, K. and Weng, Z. (2007) Nucleosome positioning signals in genomic
DNA, Genome research, 17, 1170-1177.
Reynolds, S.M., Bilmes, J.A. and Noble, W.S. (2010) Learning a weighted sequence model of the nucleosome core and linker yields more accurate predictions in Saccharomyces cerevisiae and Homo sapiens, PLoS computational biology, 6, e1000834.
Schones, D.E., Cui, K., Cuddapah, S., Roh, T.Y., Barski, A., Wang, Z., Wei, G.
and Zhao, K. (2008) Dynamic regulation of nucleosome positioning in the human genome, Cell, 132, 887-898.
Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I.K.,
Wang, J.P. and Widom, J. (2006) A genomic code for nucleosome positioning,
Nature, 442, 772-778.
Sekinger, E.A., Moqtaderi, Z. and Struhl, K. (2005) Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of
promoter regions in yeast, Molecular cell, 18, 735-748.
Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing, Nature biotechnology, 26, 1135-1145.
Tillo, D., Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Field, Y.,
Lieb, J.D., Widom, J., Segal, E. and Hughes, T.R. (2010) High nucleosome occupancy is encoded at human regulatory sequences, PloS one, 5, e9129.
Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng,
K., Malek, J.A., Costa, G., McKernan, K., Sidow, A., Fire, A. and Johnson,
S.M. (2008) A high-resolution, nucleosome position map of C. elegans reveals
a lack of universal sequence-dictated positioning, Genome research, 18, 10511063.
Wang, Z., Zang, C., Rosenfeld, J.A., Schones, D.E., Barski, A., Cuddapah, S., Cui,
K., Roh, T.Y., Peng, W., Zhang, M.Q. and Zhao, K. (2008) Combinatorial patterns of histone acetylations and methylations in the human genome, Nature
genetics, 40, 897-903.
Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J. and
Rando, O.J. (2005) Genome-scale identification of nucleosome positions in S.
cerevisiae, Science (New York, N.Y, 309, 626-630.
Zhang, Y., Moqtaderi, Z., Rattner, B.P., Euskirchen, G., Snyder, M., Kadonaga,
J.T., Liu, X.S. and Struhl, K. (2009) Intrinsic histone-DNA interactions are not
the major determinant of nucleosome positions in vivo, Nature structural &
molecular biology, 16, 847-852.
Zhang, Y., Shin, H., Song, J.S., Lei, Y. and Liu, X.S. (2008) Identifying positioned
nucleosomes with epigenetic marks in human from ChIP-Seq, BMC genomics,
9, 537.
Zhu, Z. and Thiele, D.J. (1996) A specialized nucleosome modulates transcription
factor access to a C. glabrata metal responsive promoter, Cell, 87, 459-470.
9
Download