nucleoFeature_0128Yiyu

advertisement
Category
Feature pattern discovery to predict nucleosome occupancy in
yeast and human
Co-responding Author1,*, Co-author2 and Co-Author2
1
Department of XXXXXXX, Address XXXX etc.
Department of XXXXXXX, Address XXXX etc.
2
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
Motivation: Different sequence-based and structural features
have been reported predictive for nucleosome positioning. Which
features are important and whether important features are the
same for different species (human and yeast) is not yet clear.
Results: By dividing genome into different regions (gene region,
promoter region, intergenic region) and by transcription level
(highly expressed, lowly expressed), we found different feature
subsets are predictive for different genomic regions or genes with
different transcriptional activities. Some of these feature subsets
are conserved between yeast and human. We also identified
different patterns of nucleosome positioning that are not explainable by local sequence-based features. We show that different
patterns are likely to be functional results of different chromatin
remodelers and inclusion of linker length with effective feature
subsets improves nucleosome positioning prediction.
Availability:
Contact:
1. INTRODUCTION
Eukaryotic genomic DNA is packaged with histone and other
proteins into a chromatin complex. The basic unit of DNA packaging is called nucleosome comprising of a histone octamer with
~147 bp DNA segments wrapping around it [1, 2]. The DNA
segments that are around nucleosomes are often called nucleosomal DNA and those between two adjacent nucleosomes are
often called linker DNA. The linker DNA can have variable
lengths often ranging from 20 to 80 bp. The amount of histones
occupying a unit-length of DNA segment in a population of cells
is called the nucleosome occupancy level corresponding to the
given DNA segment. It has been well known that nucleosome
occupancy is not uniformly distributed across the genome and
different nucleosome occupancies in different DNA regions can
have important functional meanings [3, 4]. For example, highly
occupied DNA regions are often difficult to be accessed by
*To
regulatory proteins such as transcription factors while nucleosome-depletion regions are more accessible to DNA binding
proteins functioning in gene regulation [5, 6]. Nucleosomes have
also been reported important roles in epigenetic gene regulations
[7, 8]. Therefore, understanding the cause of nucleosome forming on different genomic regions is important to understand gene
regulation and uncover cellular mechanisms.
Experimental methods to measure nucleosome occupancy
have been advanced by the recent next-generation sequencing
technology [9], especially the parallel sequencing of nucleosomal DNA [10-15]. A number of datasets on genome-scale measurement of nucleosome occupancy have become publicly available in multiple species including human and yeast [16-19].
Based on these datasets, dozens of literature studies have reported particular types of DNA sequences tend to have higher nucleosome occupancy levels in vivo and there potentially is a
genomic code for nucleosome-forming sequences [20]. Accordingly, a number of computational methods have recently been
developed to model and predict nucleosome occupancy based on
the underlying features of the primary DNA sequences [10, 18,
20-23]. For example, Peckham et al has incorporated variablelength k-mers into a SVM model to predict nucleosomal DNA in
yeast. Lee et al has applied lasso regression model to a set of
DNA sequence and structural features for nucleosomal DNA
prediction. Reynolds et al combines the nucleosomal DNA and
the adjacent linker DNA sequences to characterize the nucleosome. Although the prediction accuracy is in general not sufficiently high leading to the common understanding that factors
other than primary sequence may also influence nucleosome
forming, these methods are able to identify a number of sequence features that in fact contribute to nucleosome occupancy
prediction. The selected features however are not always consistent with each other. For example, several studies have confirmed ~10 bp periodicity of certain dinucleotides as predictive
of nucleosome forming [10, 13, 20, 24]. Lee et al claims structural features such as tilt and propeller twist are the most effective features, and identified “Tip, Tilt, and Propeller Twist” as
important structural features for nucleosome occupancy prediction in yeast. The diversity of effective features identified in
nucleosome-prediction studies leads to our hypothesis that mul-
whom correspondence should be addressed.
© Oxford University Press 2005
1
Yiyu Zheng et al.
tiple DNA features can simultaneously or combinatorially influence nucleosome-occupancy.
In this paper, we develop a computational method FFN (Finding Features for Nucleosomes) to identify features and feature
combinations that are important for nucleosome occupancy prediction. By applying FFN to genome-wide nucleosome occupancy measurement data in yeast [18] and human [16], we found
that a number of different features when combined have high
predictive power for nucleosome-forming or nucleosomedepletion sequences. The prediction power of combined features
can be affected by factors beyond static DNA sequences such as
gene transcriptional activities. We also show that certain structural features frequently appear in feature patterns in nucleosome forming sequences.
2. METHODS
2.1
Data source
Nucleosome positioning data in yeast is obtained from Lee et al, and the
data in human is from Schones et al for both resting and activated human
T-cells [16, 18]. The Laplacian of Gaussian (LOG) method [25, 26] is
then applied to the raw data and identified enriched regions. The parameters are chosen for best consistency between LOG results and HMM
results [18]. These enriched regions are then defined as nucleosomecontaining sequences (NCS), and the sequences between two NCSs are
defined as linker-containing sequences (LCS).
For a given gene and its promoter region (1,000 bp upstream of the
Transcription Start Site (TSS)), we define a nucleosome as 0th nucleosome if it overlaps with the TSS, the nucleosomes in the upstream region
of the TSS as -1st nucleosome, -2nd nucleosome and so on, and the nucleosomes in the downstream region of the TSS as 1 st, 2nd nucleosomes
and so on.
The gene expression data in yeast is obtained from David et al. [27].
5736 annotated genes with gene expression measurements are kept for
further identification of high-confidence transcripts. The yeast gene
annotation is based on the annotation resource at UCSC. Highconfidence transcripts are defined as those transcript segments that overlap greater than 50% with a non-dubious annotated coding region in the
5’ end [18]. In total, 5300 out of the 5736 genes are defined as highconfidence transcripts.
The gene expression data in human is obtained from GEO database
(GSE10437), which measures whole genome gene expression under two
T cell resting conditions and two activated conditions respectively [16].
19049 genes with gene expression measurements are obtained using
gene annotation resources at UCSC (human hg18 build). Genes that are
absent under both of the resting conditions and present under both of the
activated conditions are defined as induced genes. Similarly, genes that
are present under both of the resting conditions and absent under both of
the activated conditions are defined as repressed genes [16]. In total, 299
genes are defined as induced genes and 393 genes as repressed genes.
We define these induced genes and repressed genes together as disturbed
genes.
2.2
Feature compilation and sequence representation
Features that are relevant to nucleosome occupancy are compiled from
literature. These compiled features can be categorized into two classes:
(1) sequence features such as DNA k-mer frequency [21], poly(A) tracts,
transcription factor binding sites and sequence repeat [18]; (2) DNA
2
structural features from Lee et al and Abeel et al [18, 28]. The structural
feature values are computed based on the conversion tables (Supplementary Table 1). The structural features that have large correlation with
other structural features (the absolute value for Pearson correlation calculation is greater than 0.9) are removed resulting in 23 structural features (Supplementary Table 1). Finally, 766 features including 694 kmer frequency features, 4 Poly tracts, 40 poly(dA/dT) tracts, 2 sequence
repeat features, 23 structural features and 3 motifs (yeast only) are kept
for further analysis (Supplementary Table 2).
We then performed LogitBoost [29] to further select the most relevant
features. LogitBoost is a boosting algorithm for classification using
logistic regression as cost function, and can assign weight to the features
selected in the model so as to estimate the impact of the features on the
model [30]. We used the implementation of LogitBoost in software
“WEKA”(Hall et al., 2009). We chose top 1000 nucleosomes with highest profile score and top 1000 linker sequences with lowest scores as
training sets in yeast, human Resting and Activated T-cell data respectively. We then performed 100 round iterations using all the features in
LogitBoost. We kept the top 10 features in yeast, human resting and
human activated T-cell dataset as they generally take 60% weight out of
all the features selected (Supplementary Table 3). Then we collected the
top 10 features selected in, and also included the top 10 features by [21]
and three structural features from [18]. Finally, after removing some
duplicated features, we included 30 features in our consideration (Table
1).
We then used selected features to represent the NCSs and LCSs as
follows. We computed all the feature values for every 147 bp-long subsequence in all the NCSs/LCSs. As the feature value generally conforms
to normal distribution, we discretized each feature into m levels (m=4)
using cutoffs as μ – (m/2-1)σ, μ – (m/2+1)σ, …, μ, …, μ + (m/2-1)σ.
With the discretized feature values, every 147 bp subsequence in an
NCS/LCS was replaced by the combination of its discretized feature
values, called feature profiles.
Table 1. Top 30 features
1
Tip
11 AAT/ATT
21 GAC/GTC
2
Minor groove mobility
12 ATTA/TAAT
22 CCCC/GGGG
3
Tilt
13 TAA/TTA
23 ACAC/GTGT
4
Z-DNA free energy
14 TAATA/TATTA
24 AATTA/TAATT
5
Persistence length 1
15 AAAA/TTTT
25 ATAT
6
Slide
16 CGCC/GGCG
26 A/T
7
Major groove mobility
17 AAG/CTT
27 TA
8
Propeller twist
18 ACA/TGT
28 AAA/TTT
9
10
ATA/TAT
ATAA/TTAT
19 AAATA/TATTT
20 CCGCC/GGCGG
29 AT
30 AATA/TATT
30 features used in pattern discovery
2.3
Scoring of a potential feature pattern
To determine the discriminating power of a potential feature pattern, we
defined the z-score of a specific feature pattern in nucleosome-forming
sequences as Zn, and Zl in linker-forming sequences. So the discriminative score D Score = Zn – Zl
zscore 
profileCount  a1b 2c 3  
 E  a1b 2c3 
 E (a1b 2c3 )  ( E ( a1b 2c3) )
2
2
(1)
Feature pattern discovery to predict nucleosome occupancy in yeast and human
The expectation of pattern a1b2c3 is calculated using the following
formula:
patterns a1b1c3, b1c3d4, a1c3d4 and one tetra-feature pattern
a1b1c3d4. Note that the occurrence frequencies of these merged
patterns may vary, but should be all greater than α% of the total
E ( a1b 2c3)   a1 *  b 2 *  c 3
NCS subsequences. After the pattern extension step, FFN evaluhi
k
k
 k
 (2) ates the statistical significance of the patterns’ enrichment in the
r 1
r 1
r 1
  (  ax (Ta ) axa1 )(   bx ' (Tb )bx ' bx )(   cx '' (Tc ) cx '' a1 ) 
NCS and LCS subsequences using z-scores (see Methods sec
r  2  x 1
x ' 1
x '' 1
tion). The D-score for a given pattern is then computed to dehi 1 hi
termine whether it has sufficient discriminative power to distinE ((a1b 2c3)2 )  E (a1b 2c3)  2 *   (
guish NCSs from LCSs. In this way, we can identify NCSr 1 j  r 1
specific feature patterns that are frequently found in NCSs but
k
(  ax (Tar 1 )axa1 (Taj  r )a1a1 )
(3) less frequently found in LCSs. Similarly, we can start FFN from
x 1
LCS data to identify LCS-specific feature patterns.
k
k
x ' 1
x '' 1
*(  bx ' (Tbr 1 )bx ' b 2 (Tb j  r )b 2b 2 ) * (   cx '' (Tcr 1 )cx '' c 3 (Tc j  r )c 3c 3 ))
Algorithm 1. FFN algorithm
While Ta is the transition matrix for the Markov chain modeled feature a transitioning between its categories across all the windows in the
m sequences, and π is calculated as following:
 a *Ta   a
(4) Ck: candidate pattern set of length k
Lk: frequent pattern set of length k
While πax is between [0, 1] for all x, and meets the requirement that:
k

x 1
ax
1
(5)
In the end we keep the patterns with score above 3√2 as discriminative patterns. As the ranges of D-Score in different species are different,
in each situation, we normalize the D-Score of each pattern by dividing
the original D-Score by the highest D-Score of all patterns to make the
score in range [0, 1].
3. EXPERIMENTS AND RESULTS
To identify potential features that combinatorially characterize
nucleosome/linker-containing sequences, we developed FFN
algorithm (Algorithm 1. FFN algorithm). We then applied the
FFN to discover feature and feature combinations in yeast and
human T-cell data.
3.1
Input: NCS profile, LCS profile
Output: a set of discriminative patterns
The FFN algorithm
The FFN algorithm aims to discover combinations of features
that are relevant to NCS/LCS-forming and can be used to distinguish NCSs from LCSs. Given all of the 147 bp-long subsequences in the obtained NCSs and LCSs, the FFN starts from
enumerating all the possible two-feature combinations, called difeature patterns. Only those di-feature patterns whose occurrence is larger than α percent of the total number of NCS subsequences (e.g. α=20) are kept. The algorithm then searches for
frequently co-occurred di-feature patterns using frequent pattern
mining techniques (cite FIM). Next, FFN performs an extension
step, in which FFN investigates whether some of these di-feature
patterns in the same cluster can be further extended into trifeature or even longer patterns. This extension step is implemented by a pattern merging procedure. For example, given four
di-feature patterns a1b1, a1c3, b1c3 and c3d4 in one cluster,
meaning they are frequently co-occurring in the input sequences,
we observe that they can in fact be merged into three tri-feature
Initial discovered pattern set: R is Ø
Start from length 1 pattern set L1 : {all features}
While Lk is not null
Determine candidate pattern set Ck+1 by merging patterns in Lk
FOR each profile item p in Input
FOR each candidate pattern c in Ck+1
IF p contains pattern c
Increment support(c)
Generate Lk+1 with all candidate patterns in Ck+1 with support > alpha
For each candidate patterns P in Lk
Calculate Zn:zScore(p) based on formula (1-5);
Calculate Zl:zScore(p) in LCS profiles
D-Score = Zn - Zl
If (DScore > cutoff (3√2)) {
put the pattern in the result set R
Return discovered pattern set R.
3.2
Feature patterns identified in yeast data
Applying the FFN algorithm to the yeast nucleosome data, we
identified 88 NCS-specific patterns with D-score larger than 3√2
(Table 2, Supplementary Table 4). For example, “Minor grove
mobility” level 1, “Z-DNA free energy” level 2, “Persistence
length 1” level 1 and “A/T” level 2 form the pattern with highest
D-score 80.56. This pattern occurs 228845 times in NCSs with a
zscore 526.30 and 162726 times in LCSs with a zscore 445.74.
A length-3 subpattern of this pattern containing “Z-DNA free
energy” level 2, “Persistence length 1” level 1 and ”A/T” level 2
is also identified as a discriminative pattern with D-score 57.04.
We also identified several patterns with same feature combination but different feature levels, for example, pattern “Tip” level
2, “CCGCC/GGCGG” level 0, “TA” level 2 (D-score=35.09)
and “Tip” level 1, “CCGCC/GGCGG” level 0, “TA” level 1 (Dscore=13.23). These identified patterns suggest that different
combinations of features at various levels can be considered for
NCS prediction.
3
Yiyu Zheng et al.
Table 2. Top 10 Yeast Patterns Identified
Rank
1
2
3
4
5
6
7
8
9
Pattern
Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence
length 1" level 1, "A/T" level 2
Z-DNA free energy level 2, "Persistence length 1" level 1, "CCCC/GGGG"
level 0, "A/T" level 2
Z-DNA free energy level 2, "Persistence length 1" level 1,
"CCGCC/GGCGG" level 0, "A/T" level 2
Z-DNA free energy level 2, "Persistence length 1" level 1, "A/T" level 2
Minor groove mobility level 1, "Z-DNA free energy" level 2,
"CCCC/GGGG" level 0, "A/T" level 2
Minor groove mobility level 1, "Persistence length 1" level 1,
"CCGCC/GGCGG" level 0, "A/T" level 2
Minor groove mobility level 1, "Z-DNA free energy" level 2, "Persistence
length 1" level 1, "CCGCC/GGCGG" level 0
Minor groove mobility level 1, "Z-DNA free energy" level 2,
"CCGCC/GGCGG" level 0, "A/T" level 2
Z-DNA free energy level 2, "CGCC/GGCG" level 0, "CCGCC/GGCGG"
level 0, "A/T" level 2
10 Minor groove mobility level 1, "Persistence length 1" level 1, "A/T" level 2
By applying FFN algorithm we identified 88 NCS-specific patterns (NLPs). Three
structural features frequently occur in the top 10 patterns.
We also compared the occurrences of features in NCSspecific patterns and LCS-specific patterns and found different
feature preferences. Of all the 30 features, 17 features existed in
NCS-specific patterns, and 11 existed in the LCS-specific patterns. There are 13 features that exclusively appear in NCSspecific patterns and 7 features only in LCS-specific patterns.
We found that among the 13 NCS exclusive features, the feature
of “A/T” is identified 40 out of the total 88 NCP-specific patterns but never in LCP-specific patterns. Especially, the feature
“A/T” at level 2 frequently appears in the top scored feature
patterns (9 out of top 10 patterns contains this feature, Table 2).
This feature has been identified as the most useful feature to
distinguish NCSs from LCSs in Peckham et al. It frequently
forms patterns with structural features such as Minor groove
mobility (11 times), Z-DNA free energy (12 times) and Persistence length 1 (11 times). All top 20 patterns contain at least one
of these four features. This observation is consistent with the
discovery that structural features will help nucleosome occupancy prediction in yeast [18]. “Z-DNA free energy”, which is related to the free energy required for transition from B-DNA to
Z-DNA transition (Ho et al., 1990), has been identified 39 times
(10 times in level 1 and 29 times in level 2). This features is one
of the structural features that are mostly negatively correlated
with nucleosome occupancy (Gan et al., 2012). It usually cooccur with at least one of the other three features mentioned
above in the discovered patterns (33 out of 39).
3.3
The distribution of patterns in yeast promoters
Since NCS-forming in the promoter regions can affect transcriptional activities, we investigated the relationship between the
distribution of the identified feature patterns and NCSs in the
4
yeast promoter regions. For each NCS/LCS in the promoter
regions, we assigned it a pattern score equal to the largest Dscore received by the feature patterns exhibited by its containing
subsequences. We found that in general, the pattern scores are
averagely higher in NCSs/LCSs closer to TSSs than those s
farther away from TSS (Fig. 1-A). For the -1st and the 0th NCSs
and the neighboring LCSs, the average score of NCSs is slightly
higher than that of the LCSs.
To investigate whether the distribution of feature patterns are
influenced by genes’ transcriptional activities, we further divided genes based on their expression levels. We investigated the
pattern scores of the NCSs/LCSs in the promoters of the 1000
most highly expressed genes (expression level between 4.56 and
2.79) and 1000 most low-expressed genes (expression level
between 0.0040 and 0.956). We found that for every nucleosomal location, the score is averagely larger for the NCSs in the
low-expressed genes than that in the highly expressed genes (Fig
1-B). For example, for the low-expressed gene “YOR258W”
(Fig. 1-C), nearly all the NCSs in its promoter region contain the
top-scored patterns while the neighboring LCSs contain lowscored patterns. Similarly, the low-expressed gene “YMR126C”
(Fig. 1-D) shows this trend. On the other hand, highly expressed
genes such as the gene “YDR002W” often do not contain many
high score patterns in the promoter region (Fig. 1-E). These
observations implicate nucleosomes in promoters tend to be less
sequence-predictable when there is significant gene transcriptional activities [18].
We found that the feature combinations in NCSs are not specific to individual genes or promoters since nucleosomes in the
same promoter regions often contain different features combination. For example, the promoter region of gene YDR002W contains 4 nucleosomes with different patterns (Fig. 1-E).
Fig. 1. Distribution of patterns in yeast. (A) Average pattern score for
nucleosomes and linkers around the TSS. (0th, 1st…). The scores for the 2nd, -1st, 0th, 1st, 2nd NCSs are 0.458, 0.515, 0.524, 0.429, 0.444 respectively and scores for NCS/LCS near TSS are higher. (B) Average
Feature pattern discovery to predict nucleosome occupancy in yeast and human
pattern score for nucleosome in top1000 high expressed genes and top
1000 low expressed genes. (C-E) Pattern score distribution for the promoter region of gene YOR258W (C), YMR126C (D) and YDR002W
(E).
3.4
Identified feature and feature patterns in human T-cell
Applying the FFN algorithm to human T-cell resting and activated data separately, we identified 2328 NCS-specific pattern
in human T cell resting data, and 589 NCS-specific patterns in
human T cell activated data (Supplementary Table 5). For example, the pattern "minor groove mobility" level 1, "Z-DNA
free energy" level 2, "persistence length 1" level 1, "A/T" level 2
form a pattern (D-score=302.71) in human T-cell, which has
also been identified in yeast. Pattern "Tip" level 1,
"AATTA/TAATT" level 0, "TA" level 1 is another pattern (D
Score=39.9) being identified in both human and yeast
There are 576 patterns conserved in both resting and activated
status (total 589 patterns). For example, the pattern comprising
of "minor groove mobility" level 1, "Z-DNA free energy" level
2, "persistence length 1" level 1, "major groove mobility" level
2, "CCGCC/GGCGG" level 0, "A/T" level 2 is identified in both
conditions with high D score. This number becomes 580 if the
difference of feature levels is not considered. Almost all the
patterns ranked in top 100 contain the feature “A/T”, which
frequently appears together with the features such as "minor
groove mobility”, "Z-DNA free energy", "persistence length 1",
and "major groove mobility". One example showing the difference of the patterns in the two conditions is the pattern comprising of "minor groove mobility" level 2, "Z-DNA free energy"
level 1, "persistence length 1" level 2, "A/T" level 1 (rank 6 with
zscore 538), which is discovered in activated status only. In
resting status the zscore for this pattern is -6.24. Comparing
features in human resting and activated T cells, we found they
contain nearly all features except AAG/CTT, GAC/GTC, and
ACAC/GTGT. The four features we mentioned in yeast are also
frequently co-occurring in top-ranked patterns, and another feature “AAA/TTT” is identified 647 out of the total 2328 NCSspecific patterns in human resting status and 183 out of total 589
patterns in human activated status. This feature is identified by
[21] as a significant 3-mer feature in distinguishing the nucleosomes and linkers.
3.5
The distribution of patterns in human
We analyzed the distribution of patterns in the NCSs and LCSs
in the human promoters. We observed that in both resting and
activated status, the patterns in both NCSs and LCSs have lower
scores when closer to TSSs comparing with when farther away
from TSSs (Fig. 2-A for resting, Supplementary Fig. 1-A for
activated)., which is different from the trend in yeast. We also
observed that the average scores of the NCSs are higher than its
neighboring LCSs near the TSS and in the gene body, while the
average scores for LCSs in the core promoter region (the -1st and
0th nucleosome locations) are higher than that of the paired nucleosomes. In general, the -1st nucleosome and 1st nucleosome
near TSS without a 0th nucleosome have a low score compared
to these with a 0th nucleosome in human T cell. Additionally, the
0th nucleosomes in human score the lowest among all nucleosomes (Fig. 2-A, Supplementary Figure 1-A).
When the gene expression levels are taken into account in
both resting and activation status, the patterns in NCSs closer to
TSSs of low-expressed genes have averagely higher scores than
those obtained in NCSs of highly expressed genes, while the
difference is not that significant for those nucleosomes farther
from TSS (Fig. 2-B for resting and Supplementary Fig. 1-B for
activated). Also, the score differences between NCSs and LCSs
become larger in the low-expressed genes compared to that in
the highly expressed genes (See supplementary figure). This is
consistent with the hypothesis that the lack of transcriptional
activities can lead to sequence-determined nucleosome-forming
events (Segal et al., 2006). Fig. 2 C&D are two examples of
pattern distribution around the TSS.
Similar to yeast, nucleosomes in the same promoters can exhibit different feature patterns in human T-cell data. For example, the four nucleosomes in the promoter region of gene
TARDBP contain very different patterns (Fig. 2-E).
Fig. 2. Distribution of patterns in human resting status. (A) Average
pattern score for nucleosomes and linkers around the TSS. (B) Average
pattern score for nucleosome in top1000 high expressed genes and top
1000 low expressed genes. (C) Pattern score distribution curve for the
lowly expressed gene NM_032738/ FCRLA in human resting status
shows that -4th nucleosome to +2nd nucleosome contains patterns ranked
(1, 1, 4, 3, 1, 16, 4, 311). (D) Pattern score distribution curve for the
highly expressed gene NM_033251/RPL13 in resting status shows that –
4th nucleosome to +2nd nucleosome contain patterns ranked (1, 1, 2,
1406, 1109, 1413, 39). (E) The 4 nucleosome in the promoter region of
TARDBP in human resting status contain different patterns.
To investigate the pattern distribution in disturbed genes and
the pattern changes before and after T-cell activation, we assigned the pattern scores to all of the NCSs/LCSs in the promoter regions of the 299 induced genes and 393 repressed genes.
For the repressed genes and induced genes in resting status, the
5
Yiyu Zheng et al.
scores for the -1st and 0th NCSs are generally lower than their
neighboring LCSs (Fig. 3) For induced and repressed genes in
activated status, the average scores for -2nd, -1st, 0th, 1st, 2nd are
both relative lower compared to the average score in high expressed genes (Supplementary Fig. 2). Take the repressed gene
NM_173485 (Fig. 3) for example, the nucleosome occupancy
near the TSS position changed after activation while the pattern
distribution did not change much. In the resting status, the -1st
nucleosome positioned in a region with high pattern score with a
low score region in the upstream of it. We found that after TCR
signaling activation, that nucleosome has moved into that low
score region. For perturbed genes in activated status, the nucleosome-forming is more likely to be predicted by factors related to
gene activity and TCR signaling rather than the sequence features.
in yeast, while in human resting and activated, the pattern discovered is "Z-DNA free energy" at level 2, "CGCC/GGCG" at
level 0, "CCGCC/GGCGG" at level 0, and "A/T" at level 2. This
might be caused by the different feature value ranges in different
species. Yeast genome has a lower GC content (38%) compared
to human (41%). The GC content for the promoter [-1000, 0] of
yeast and human genes is 38.32% and 53.19% respectively.
Also, there are 53 patterns discovered in yeast (Supplementary Table 6) that does not discovered in human. For example,
the yeast pattern "minor groove mobility" level 1, "Z-DNA free
energy" level 2, "CCCC/GGGG" level 0, "A/T" level 2 (rank 6)
is not discovered in human data. It demonstrates that in different
species different feature combination will help nucleosome occupancy prediction.
4. DISCUSSIONS AND CONCLUSIONS
Fig. 3. Pattern distribution of perturbed genes. (A) Average pattern
score for nucleosomes and linkers in repressed genes in human resting
status. (B) Average pattern score for nucleosomes and linkers in induced
genes in human resting status. (C-D) Pattern score curve for gene
NM_173485 TSHZ2 in resting status and activated status.
3.6
Yeast Patterns compare with Human Patterns
There are 35 exactly same patterns are conserved in yeast, human resting and activated T cells (Supplementary Table 6). For
example, "Minor groove mobility" level 1, "Z-DNA free energy" level 2, "Persistence length 1" level 1, "A/T" level 2 form a
conserved pattern with high ranks in both yeast and human (rank
1st in yeast, 9th in human resting and 20th in human activated),
indicating the three structural features together with the sequence “A/T” feature are very important factors influencing
nucleosome-forming in both species. Note that because of the
different feature value distributions in yeast and human genome,
the discretization level of these features in yeast and human can
be different.
Only considering the features but not their levels, we discovered 41 conserved feature combinations across yeast, human
Resting and human Activated T cells (Supplementary Table 6).
For example, the feature combination of "Z-DNA free energy",
"CGCC/GGCG", "CCGCC/GGCGG" and "A/T" is conserved
but at different feature levels in yeast and human. In detail, "ZDNA free energy" at level 1, "CGCC/GGCG" at level 0,
"CCGCC/GGCGG" at level 0, and "A/T" at level 1 is discovered
6
Understanding the interaction between DNA sequence and nucleosome occupancy is important to uncovering gene regulatory
mechanisms. Whether DNA sequence directly determines nucleosome occupancy and nucleosome positioning, and if yes,
how much, is still under debate. We have developed an efficient
method to study DNA features and their combinations that are
useful for nucleosome occupancy prediction. Applying our
method to the yeast and human T-cell data, we discovered thousands of feature combination patterns that have different enrichment between NCSs and LCSs. These discovered feature
patterns involve both DNA structural features and sequence
features and provide multiple possibilities for nucleosomeforming. Comparison between feature patterns between yeast
and human, we found that different patterns might prevail in
different species.
One important observation is that nucleosome-occupancy prediction accuracy is location-dependent. The farther away from
TSSs, the more accurate is the sequence-based prediction in
human. Another related observation is that nucleosomeoccupancy tends to be hard to predict from the discovered feature patterns when the containing genes are transcriptional active.
Our results also show that feature levels can be important indicators for NCS/LCS-forming. In yeast patterns most features
have two levels (level 1 and 2) appearing. For example, “ZDNA free energy” is identified 39 times in 88 patterns with 10
times in level 1 and 29 times in level 2. We observe that all patterns containing “Z-DNA free energy” at level 1 rank lower than
44, while most of the patterns (25 out of 29) containing “Z-DNA
free energy” at level 2 rank in the top. . Also we noticed that two
features "Z-DNA free energy" level 2, "A/T" level 2 co-occur 12
times in all the patterns, and "Z-DNA free energy" level 1 with
"A/T" level 1 18 times.
By changing the frequency parameter alpha (we used alpha =
20% in the paper) we can get different numbers of patterns, as in
frequent pattern find step of the algorithm, if the alpha is smaller, we will include more patterns, and if the alpha is larger, we
will only keep the patterns that occurs more frequently. For
example, in yeast data, if we use alpha = 30%, there will be only
39 frequent patterns left, and none of them are discriminative.
The pattern with largest D-Score is “ATAA/TTAT” level 1,
“CGCC/GGCG” level 0, “CCGCC/GGCGG” level 0 of score
Feature pattern discovery to predict nucleosome occupancy in yeast and human
3.407 which does not meet our criteria of the discriminative
patterns (see 2.4). Also all the patterns contain no structural
feature, and contain at most one valid feature. It is because the
k-mer frequency features especially the 4-mer and 5-mer ones
have high possibility that does not exist in the sequence, which
makes these patterns more frequent than the patterns with valid
features. When we try to use the alpha = 10%, it will include
more patterns. We finally get 2203 patterns (Supplementary
Table 7) compared to the 88 patterns discovered using alpha =
0.2. All the previously 88 patterns are included in the newly
discovered patterns, while we all discovered some new patterns
such as "minor groove mobility" level 3, "Z-DNA free energy"
level 0, "persistence length 1" level 3, "A/T" level 0. This one
contains the same feature combination as the rank1 patterns in
the yeast while the features values are in different level. By
changing the alpha, we can discover more/ less patterns.
Also different discretization method will affect the pattern
discover as different bin dividing method will make the frequency of the pattern changes. If we use the {μ – 2σ, μ, μ + 2σ} as
three cutoffs as the new discretization method in yeast, then the
patterns we discovered will be different, because the feature
value have more possibility to fall in the level 1 and level 2 bins,
thus making the patterns with level 1 and level 2 bins more frequent. With the same parameter alpha = 0.2, we discovered 2447
patterns (Supplementary Table 7) using the new discretization
cutoffs compared to the 88 patterns discovered using the discretization method {μ – σ, μ, μ + σ}. The patterns with highest
DScore is "minor groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "slide" level 1, "major
groove mobility" level 2, "propellertwist" level 1,
"CCGCC/GGCGG" level 0, "A/T" level 2. This pattern is discovered 220920 times (20.15%) in NCS profile and 226873
times (27.04%) in LCS profile. This pattern is composed mostly
by the structural features as the distribution of structural features
are more conform to normal distribution that K-mer frequency
features, and “A/T” is also well conform to normal distribution
compared to other k-mer frequency features. 82 out of 88 patterns are included in the new patterns, while the other 6 we can
find corresponding patterns that contain same features with different level. For example the previous pattern "Z-DNA free
energy" level 1, "AAAA/TTTT" level 1, "A/T" level 1, we discovered a pattern with same features but all in level 2 ("Z-DNA
free energy" level 2, "AAAA/TTTT" level 2, "A/T" level 2).
We also discovered lots of long patterns (combination of more
than 4 features) while there are only length 3 and 4 patterns
using the previous method. These long patterns are generally
composited by the frequent features mentioned in 3.2 with some
other features. For example, we discover a new pattern "minor
groove mobility" level 1, "Z-DNA free energy" level 2, "persistence length 1" level 1, "major groove mobility" level 2,
"CGCC/GGCG" level 0, "A/T" level 2, and it contains same
features and same level as the previous discovered pattern "minor groove mobility" level 1, "Z-DNA free energy" level 2,
"persistence length 1" level 1, "A/T" level 2 with two more features "major groove mobility" level 2 and "CGCC/GGCG" level
0.
Funding:
REFERENCES
Gan,Y. et al. (2012) Structural features based genome-wide characterization and
prediction of nucleosome organization. BMC bioinformatics, 13, 49.
Hall,M. et al. The WEKA Data Mining Software : An Update. 11, 10–18.
Ho,P.S. et al. (1990) Polarized electronic spectra of Z-DNA single crystals. Biopolymers, 30, 151–63.
Abeel, T., Saeys, Y., Bonnet, E., Rouze, P. and Van de Peer, Y. (2008) Generic
eukaryotic core promoter prediction using structural features of DNA, Genome
research, 18, 310-323.
Albert, I., Mavrich, T.N., Tomsho, L.P., Qi, J., Zanton, S.J., Schuster, S.C. and
Pugh, B.F. (2007) Translational and rotational settings of H2A.Z nucleosomes
across the Saccharomyces cerevisiae genome, Nature, 446, 572-576.
Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G.,
Chepelev, I. and Zhao, K. (2007) High-resolution profiling of histone methylations in the human genome, Cell, 129, 823-837.
Daenen, F., van Roy, F. and De Bleser, P.J. (2008) Low nucleosome occupancy is
encoded around functional human transcription factor binding sites, BMC genomics, 9, 332.
David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones,
T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome, Proceedings of the National Academy of Sciences of the United States of America, 103, 5320-5325.
Field, Y., Kaplan, N., Fondufe-Mittendorf, Y., Moore, I.K., Sharon, E., Lubling,
Y., Widom, J. and Segal, E. (2008) Distinct modes of regulation by chromatin
encoded through nucleosome positioning signals, PLoS computational biology,
4, e1000216.
Friedman, J., Hastie, T. and Tibshirani, R. (2000) Additive logistic regression: a
statistical view of boosting, The Annals of Statistics, 28, 337–407.
Friedman, J.H. (2001) Greedy function approximation: A gradient boosting machine, The Annals of Statistics, 29, 1189–1232.
Gupta, S., Dennis, J., Thurman, R.E., Kingston, R., Stamatoyannopoulos, J.A. and
Noble, W.S. (2008) Predicting human nucleosome occupancy from primary
sequence, PLoS computational biology, 4, e1000134.
Ioshikhes, I.P., Albert, I., Zanton, S.J. and Pugh, B.F. (2006) Nucleosome positions
predicted through comparative genomics, Nature genetics, 38, 1210-1215.
Jiang, C. and Pugh, B.F. (2009) Nucleosome positioning and gene regulation:
advances through genomics, Nature reviews, 10, 161-172.
Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Tillo, D., Field, Y.,
LeProust, E.M., Hughes, T.R., Lieb, J.D., Widom, J. and Segal, E. (2009) The
DNA-encoded nucleosome organization of a eukaryotic genome, Nature, 458,
362-366.
Kornberg, R.D. and Lorch, Y. (1999) Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome, Cell, 98, 285-294.
Lee, W., Tillo, D., Bray, N., Morse, R.H., Davis, R.W., Hughes, T.R. and Nislow,
C. (2007) A high-resolution atlas of nucleosome occupancy in yeast, Nature
genetics, 39, 1235-1244.
Li, B., Carey, M. and Workman, J.L. (2007) The role of chromatin during transcription, Cell, 128, 707-719.
Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F. and Richmond, T.J.
(1997) Crystal structure of the nucleosome core particle at 2.8 A resolution,
Nature, 389, 251-260.
Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. (2007) High-throughput mapping of the chromatin structure of human promoters, Nature biotechnology, 25,
244-248.
Peckham, H.E., Thurman, R.E., Fu, Y., Stamatoyannopoulos, J.A., Noble, W.S.,
Struhl, K. and Weng, Z. (2007) Nucleosome positioning signals in genomic
DNA, Genome research, 17, 1170-1177.
Reynolds, S.M., Bilmes, J.A. and Noble, W.S. (2010) Learning a weighted sequence model of the nucleosome core and linker yields more accurate predictions in Saccharomyces cerevisiae and Homo sapiens, PLoS computational biology, 6, e1000834.
Schones, D.E., Cui, K., Cuddapah, S., Roh, T.Y., Barski, A., Wang, Z., Wei, G.
and Zhao, K. (2008) Dynamic regulation of nucleosome positioning in the human genome, Cell, 132, 887-898.
ACKNOWLEDGEMENTS
7
Yiyu Zheng et al.
Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I.K.,
Wang, J.P. and Widom, J. (2006) A genomic code for nucleosome positioning,
Nature, 442, 772-778.
Sekinger, E.A., Moqtaderi, Z. and Struhl, K. (2005) Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of
promoter regions in yeast, Molecular cell, 18, 735-748.
Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing, Nature biotechnology, 26, 1135-1145.
Tillo, D., Kaplan, N., Moore, I.K., Fondufe-Mittendorf, Y., Gossett, A.J., Field, Y.,
Lieb, J.D., Widom, J., Segal, E. and Hughes, T.R. (2010) High nucleosome occupancy is encoded at human regulatory sequences, PloS one, 5, e9129.
Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng,
K., Malek, J.A., Costa, G., McKernan, K., Sidow, A., Fire, A. and Johnson,
S.M. (2008) A high-resolution, nucleosome position map of C. elegans reveals
a lack of universal sequence-dictated positioning, Genome research, 18, 10511063.
Wang, Z., Zang, C., Rosenfeld, J.A., Schones, D.E., Barski, A., Cuddapah, S., Cui,
K., Roh, T.Y., Peng, W., Zhang, M.Q. and Zhao, K. (2008) Combinatorial patterns of histone acetylations and methylations in the human genome, Nature
genetics, 40, 897-903.
Yuan, G.C., Liu, Y.J., Dion, M.F., Slack, M.D., Wu, L.F., Altschuler, S.J. and
Rando, O.J. (2005) Genome-scale identification of nucleosome positions in S.
cerevisiae, Science (New York, N.Y, 309, 626-630.
Zhang, Y., Moqtaderi, Z., Rattner, B.P., Euskirchen, G., Snyder, M., Kadonaga,
J.T., Liu, X.S. and Struhl, K. (2009) Intrinsic histone-DNA interactions are not
the major determinant of nucleosome positions in vivo, Nature structural &
molecular biology, 16, 847-852.
Zhang, Y., Shin, H., Song, J.S., Lei, Y. and Liu, X.S. (2008) Identifying positioned
nucleosomes with epigenetic marks in human from ChIP-Seq, BMC genomics,
9, 537.
Zhu, Z. and Thiele, D.J. (1996) A specialized nucleosome modulates transcription
factor access to a C. glabrata metal responsive promoter, Cell, 87, 459-470.
[1]
LUGER, K., ET AL., CRYSTAL STRUCTURE OF THE NUCLEOSOME CORE PARTICLE AT 2.8 A
RESOLUTION. NATURE, 1997. 389(6648): P. 251-60.
[2]
KORNBERG, R.D. AND Y. LORCH, TWENTY-FIVE YEARS OF THE NUCLEOSOME,
FUNDAMENTAL PARTICLE OF THE EUKARYOTE CHROMOSOME. CELL, 1999. 98(3): P. 285-94.
[3]
TILLO, D., ET AL., HIGH NUCLEOSOME OCCUPANCY IS ENCODED AT HUMAN REGULATORY
SEQUENCES. PLOS ONE, 2010. 5(2): P. E9129.
[4]
DAENEN, F., F. VAN ROY, AND P.J. DE BLESER, LOW NUCLEOSOME OCCUPANCY IS
ENCODED AROUND FUNCTIONAL HUMAN TRANSCRIPTION FACTOR BINDING SITES. BMC
GENOMICS, 2008. 9: P. 332.
[5]
SEKINGER, E.A., Z. MOQTADERI, AND K. STRUHL, INTRINSIC HISTONE-DNA INTERACTIONS
AND LOW NUCLEOSOME DENSITY ARE IMPORTANT FOR PREFERENTIAL ACCESSIBILITY
OF PROMOTER REGIONS IN YEAST. MOL CELL, 2005. 18(6): P. 735-48.
[6]
ZHU, Z. AND D.J. THIELE, A SPECIALIZED NUCLEOSOME MODULATES TRANSCRIPTION
FACTOR ACCESS TO A C. GLABRATA METAL RESPONSIVE PROMOTER. CELL, 1996. 87(3):
P. 459-70.
[7]
JIANG, C. AND B.F. PUGH, NUCLEOSOME POSITIONING AND GENE REGULATION:
ADVANCES THROUGH GENOMICS. NAT REV GENET, 2009. 10(3): P. 161-72.
[8]
LI, B., M. CAREY, AND J.L. WORKMAN, THE ROLE OF CHROMATIN DURING TRANSCRIPTION.
CELL, 2007. 128(4): P. 707-19.
[9]
SHENDURE, J. AND H. JI, NEXT-GENERATION DNA SEQUENCING. NAT BIOTECHNOL, 2008.
26(10): P. 1135-45.
[10]
FIELD, Y., ET AL., DISTINCT MODES OF REGULATION BY CHROMATIN ENCODED THROUGH
NUCLEOSOME POSITIONING SIGNALS. PLOS COMPUT BIOL, 2008. 4(11): P. E1000216.
[11]
ALBERT, I., ET AL., TRANSLATIONAL AND ROTATIONAL SETTINGS OF H2A.Z
NUCLEOSOMES ACROSS THE SACCHAROMYCES CEREVISIAE GENOME. NATURE, 2007.
446(7135): P. 572-6.
8
Feature pattern discovery to predict nucleosome occupancy in yeast and human
[12]
ZHANG, Y., ET AL., INTRINSIC HISTONE-DNA INTERACTIONS ARE NOT THE MAJOR
DETERMINANT OF NUCLEOSOME POSITIONS IN VIVO. NAT STRUCT MOL BIOL, 2009. 16(8):
P. 847-52.
[13]
KAPLAN, N., ET AL., THE DNA-ENCODED NUCLEOSOME ORGANIZATION OF A EUKARYOTIC
GENOME. NATURE, 2009. 458(7236): P. 362-6.
[14]
VALOUEV, A., ET AL., A HIGH-RESOLUTION, NUCLEOSOME POSITION MAP OF C. ELEGANS
REVEALS A LACK OF UNIVERSAL SEQUENCE-DICTATED POSITIONING. GENOME RES, 2008.
18(7): P. 1051-63.
[15]
BARSKI, A., ET AL., HIGH-RESOLUTION PROFILING OF HISTONE METHYLATIONS IN THE
HUMAN GENOME. CELL, 2007. 129(4): P. 823-37.
[16]
SCHONES, D.E., ET AL., DYNAMIC REGULATION OF NUCLEOSOME POSITIONING IN THE
HUMAN GENOME. CELL, 2008. 132(5): P. 887-98.
[17]
YUAN, G.C., ET AL., GENOME-SCALE IDENTIFICATION OF NUCLEOSOME POSITIONS IN S.
CEREVISIAE. SCIENCE, 2005. 309(5734): P. 626-30.
[18]
LEE, W., ET AL., A HIGH-RESOLUTION ATLAS OF NUCLEOSOME OCCUPANCY IN YEAST.
NAT GENET, 2007. 39(10): P. 1235-44.
[19]
WANG, Z., ET AL., COMBINATORIAL PATTERNS OF HISTONE ACETYLATIONS AND
METHYLATIONS IN THE HUMAN GENOME. NAT GENET, 2008. 40(7): P. 897-903.
[20]
SEGAL, E., ET AL., A GENOMIC CODE FOR NUCLEOSOME POSITIONING. NATURE, 2006.
442(7104): P. 772-8.
[21]
PECKHAM, H.E., ET AL., NUCLEOSOME POSITIONING SIGNALS IN GENOMIC DNA. GENOME
RES, 2007. 17(8): P. 1170-7.
[22]
REYNOLDS, S.M., J.A. BILMES, AND W.S. NOBLE, LEARNING A WEIGHTED SEQUENCE
MODEL OF THE NUCLEOSOME CORE AND LINKER YIELDS MORE ACCURATE PREDICTIONS
IN SACCHAROMYCES CEREVISIAE AND HOMO SAPIENS. PLOS COMPUT BIOL, 2010. 6(7): P.
E1000834.
[23]
GUPTA, S., ET AL., PREDICTING HUMAN NUCLEOSOME OCCUPANCY FROM PRIMARY
SEQUENCE. PLOS COMPUT BIOL, 2008. 4(8): P. E1000134.
[24]
IOSHIKHES, I.P., ET AL., NUCLEOSOME POSITIONS PREDICTED THROUGH COMPARATIVE
GENOMICS. NAT GENET, 2006. 38(10): P. 1210-5.
[25]
OZSOLAK, F., ET AL., HIGH-THROUGHPUT MAPPING OF THE CHROMATIN STRUCTURE OF
HUMAN PROMOTERS. NAT BIOTECHNOL, 2007. 25(2): P. 244-8.
[26]
ZHANG, Y., ET AL., IDENTIFYING POSITIONED NUCLEOSOMES WITH EPIGENETIC MARKS IN
HUMAN FROM CHIP-SEQ. BMC GENOMICS, 2008. 9: P. 537.
[27]
DAVID, L., ET AL., A HIGH-RESOLUTION MAP OF TRANSCRIPTION IN THE YEAST GENOME.
PROC NATL ACAD SCI U S A, 2006. 103(14): P. 5320-5.
[28]
ABEEL, T., ET AL., GENERIC EUKARYOTIC CORE PROMOTER PREDICTION USING
STRUCTURAL FEATURES OF DNA. GENOME RES, 2008. 18(2): P. 310-23.
9
Yiyu Zheng et al.
[29]
FRIEDMAN, J., T. HASTIE, AND R. TIBSHIRANI, ADDITIVE LOGISTIC REGRESSION: A
STATISTICAL VIEW OF BOOSTING. THE ANNALS OF STATISTICS, 2000. 28: P. 337–407.
[30]
FRIEDMAN, J.H., GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE.
THE ANNALS OF STATISTICS, 2001. 29: P. 1189–1232.
10
Download