A continuous-index Bayesian hidden Markov model for

advertisement
Biostatistics (2010), 1, 1, pp. 1–24
doi:10.1093/biostatistics/bio000
A continuous-index Bayesian hidden Markov model for
prediction of nucleosome positioning in genomic DNA
RITEN MITRA
Department of Biostatistics,
University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A.
rmitra@bios.unc.edu
MAYETRI GUPTA∗
Department of Biostatistics,
Boston University, MA 02118, U.S.A.
gupta@bu.edu
S UMMARY
Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called histones. Nucleosomes occur at variable intervals throughout genomic DNA, and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal
locations would enable researchers to detect TF binding sites with greater efficiency. Our objective
is to construct an accurate genomic map of nucleosome-free regions (NFRs), based on data from
high-throughput genomic tiling arrays in yeast. These high volume data typically have a complex
structure, in the form of dependence on neighboring probes as well as underlying DNA sequence,
∗ To
whom correspondence should be addressed.
c The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.
2
R. M ITRA AND M. G UPTA
variable sized gaps and missing data. We propose a novel continuous index model appropriate for
non-equispaced tiling array data, that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay
demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher
order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.
Keywords: data augmentation; chromatin structure; tiling arrays; FAIRE
1. I NTRODUCTION
Genomic DNA in the cell nucleus exists as a complex called chromatin, where DNA is folded
and compacted through a series of interactions with proteins called histones. At the first level, a
stretch of DNA of about 147 base-pairs (bp), is wrapped around an octamer of histones, yielding
a “nucleosome”. Nucleosome structure has been determined by X-ray crystallography (Richmond
and Davey, 2003). The presence of nucleosomes prevents transcription factors (TFs) from binding
to the DNA at those sites (Widom, 1992), resulting in interruption of gene regulatory processes.
Thus, a well-defined map of nucleosome positions could be used to improve the predictive power
of current TF binding site (TFBS) search algorithms. Most commonly used TFBS or motif search
algorithms use the sequence conservation at binding sites as the only criterion for TFBS presence
(Liu et al., 1995; Gupta and Liu, 2003). However, in complex genomes, these methods often suffer
from poor predictability and high false positive rates. Recent methods have used gene expression
measurements through regression-type models (Conlon et al., 2003; Gupta and Ibrahim, 2007) or
ChIP-chip (Chromatin Immunoprecipitation) experimental data (Buck and Lieb, 2004) to improve
predictions, but weak relationships in the first approach often lead to missing important TFBSs,
and the second approach is limited in applicability to a small set of TFs. The connection of nucleosome positioning to TF-binding has motivated some researchers to make use of information
Bayesian HMMs for nucleosome positioning prediction
3
from nucleosomal mapping for motif search. Narlikar et al. (2007) have used nucleosome mapping
information to create a motif occurrence prior. However, these methods will not work well unless
nucleosome positions are based on high-quality predictions from experimental data. Next, we discuss the various experimental assays for nucleosome detection and statistical analysis methods.
1.1 Genomic assays for nucleosome position detection
Tiling arrays are microarrays of short overlapping probes, covering a genomic region of interest, used to measure positions of genomic-level events, such as TF binding and histone methylation. In Yuan et al. (2005), based on the nucleosome-free DNA’s susceptibility to micrococcal
nuclease (MNase), nucleosomal DNA from yeast was isolated and labeled with green (cy3) fluorescent dye while the total genomic DNA was labeled with red (cy5) dye. These segments were
hybridized to microarrays printed with 50-mer oligonucleotide probes tiled at 20 bp overlaps. After
pre-processing and normalization, the ratio of green-to-red measurement intensity values indicated
how likely that locus was to contain a nucleosome. A more recent experimental technology is
Formaldehyde assisted isolation of regulatory elements or FAIRE (Hogan et al., 2006). FAIRE is
initiated by fixing whole yeast cells in a growth medium by formaldehyde, harvesting them by centrifugation, sonicating the extracts, labeling the purified DNA and hybridizing them to microarrays.
In FAIRE, regions of the genome cross-linked with histones (nucleosomes) are less susceptible
to TF binding, and get retained at the organic-aqueous interphase, while the NFRs get enriched.
Hogan et al. (2006) used 50-mer probes overlapping every 20bp, to tile yeast chromosome III using
four microarrays (3 biological and 1 technical replicate). The measured enrichment throughout the
genome, in terms of median logarithmic intensity ratios, represented NFR positioning. In Yuan et
al. (2005), the enrichment represents nucleosomes– hence the raw data roughly exhibits a negative
correlation with FAIRE enrichment values. Both used identical probes in the same region of the
genome and can be used for comparison. However, since FAIRE enriches NFRs, the exact model
framework used in Yuan et al. (2005) is not appropriate for FAIRE.
4
R. M ITRA AND M. G UPTA
1.2 Computational approaches for nucleosome detection
We now discuss computational methods for detection of nucleosome positioning from genomic assays. Most methods are adapted for MNase digestion data (Yuan et al., 2005). In Yuan et al. (2005),
a hidden Markov model (HMM) was implemented for classifying the genomic regions into nucleosomal and non-nucleosomal regions; a fully Bayesian approach was developed by Gupta (2007).
High-throughput tiling array data has complications such as amplification bias, inaccurate reads,
and varying degrees of non-random noise and measurement error. Relying solely on signal intensities for classification could propagate errors if the nucleosome predictions are used as input data
for motif discovery. Recent work has linked DNA sequence features with nucleosome formation
(Sekinger et al., 2005). Segal et al. (2006) constructed a nucleosomal map of yeast from the distribution of dinucleotides in nucleosomal DNA. However, neglecting NFR sequence information
weakened their classification approach. Other methods include Peckham et al. (2007), who used
support vector machines; and Lee et al. (2007) who used Lasso regression on sequence features.
Yuan and Liu (2008) used wavelet decomposition to represent the sequence signal as covariates
in a logistic model. However, in all these approaches, restricting the sequence features to dinucleotides is a potential reason for high misclassification rates, in spite of clear biological evidence
that sequence composition strongly affects nucleosome positioning (Sekinger et al., 2005). Also,
sequence features were not used in prediction– the relationship was explored only by obtaining
nucleosome predictions from intensity data and inferring potential sequence influences.
In FAIRE, employing an HMM with equal probe spacings would misspecify the model and
give erroneous state predictions due to gaps and missing probes. In ChIP-chip assays to detect
TFBSs, missing probes are not as serious a problem as TF-bound enriched probes constitute a tiny
fraction of the total. The conformability of DNA structure has been shown to depend on a large
set of nucleotide combinations (Sekinger et al., 2005). Here we propose a novel model framework
for FAIRE data extending HMMs in two directions: (i) an underlying continuous-index Markov
Bayesian HMMs for nucleosome positioning prediction
5
process allowing gaps and missing probes, and (ii) incorporation of sequence features, not limited to dinucleotides, into nucleosomal state prediction, through dimension reduction techniques.
This unified model uses signal intensities and sequence features to predict nucleosome locations.
Continuous-index Markov processes have been applied to genomic problems such as copy-number
variation detection (Stjernqvist et al., 2007) and genetic linkage analysis (Lander and Green, 1987),
but the present problem presents some unique characteristics. In Sections 2 and 3, we describe the
new model framework and estimation. In Section 4 we show that including sequence features of a
higher order leads to superior predictive power, through real and simulated data applications.
2. M ODEL FRAMEWORK
Next, we propose a model to extract positional estimates of NFRs from FAIRE enrichment data
(Hogan et al., 2006). Although our model is developed with the FAIRE data structure in mind, it is
a general model which can be used for a variety of high-throughput microarray settings. Also, from
a biological standpoint, since the FAIRE technology is simpler to use than the MNase procedure
(Yuan et al., 2005), it is likely to be useful in nucleosomal studies for a wide variety of organisms.
2.1 Data structure
Due to experimental and imaging errors, the data often has missing probe measurements, with the
intervals between measurements varying in length. For each fully observed probe, the following
data are present: (i) Yij = log-ratio of the intensity of the replicate j of probe i ; (ii) ti = Nucleotide
index of the center of probe i; and (iii) X i = d-dimensional vector of covariate measurements for
probe i; for i = 1, . . . , P (probes), and j = 1, . . . , R (replicates). We define ta,b = tb − ta as the
gap between the centers of probes indexed a and b, which is the genomic distance between two
probes (measured at a base pair scale). If all probes are of equal length and equispaced (which is
typically the case), the gaps are solely due to the existence of missing probes, leading to t(a, b)
being a constant multiple of the number of missing probes in between. In this case, it is equivalent
to work with the metric of the number of missing probes as a distance measure.
6
R. M ITRA AND M. G UPTA
2.2 Continuous index hidden Markov process
Due to missing probes, the underlying nucleosome occupancy status can be considered an unobserved stochastic process evolving over a continuous index, denoted by [Z(t), t > 0]. We assume
Z(t) to be a two-state Markov process; Z(t) = 1 (0) represents an NFR (non-NFR) state, i.e.,
P [Z(ti )|Z(ti−1 ), . . . , Z(t1 ), Y i−1 , . . . , Y 1 ] = P [Z(ti )|Z(ti−1 )] = Pzi−1 zi (ti − ti−1 ), (2.1)
where Pzi−1 zi (·) denotes the transition probability that state zi was occupied at index ti given that
the process was in state zi−1 at index ti−1 . First, let us assume that the transition probabilities of Z
are stationary. Also, conditional on the state of Z at index ti , let the observed measurement Y i be
independent of the hidden process and all measurements prior to index ti , that is,
P [Y i |Z(ti ), . . . , Z(t1 ), Y i−1 , . . . , Y 1 ] = P [Y i |Z(ti )] = f (Y i |zi ).
(2.2)
Expressions (2.1) and (2.2) constitute a continuous index hidden Markov process. For notational
simplicity, we shall henceforth refer to Z(ti ) as simply Zi . Next, assume that the probability of
staying in a state is linear with respect to index for an infinitesimal interval. The two-state hidden
process can be parameterized by the transition rate from a nucleosomal state to a NFR (λ), and the
λ
.
reverse transition rate (µ). The matrix of instantaneous transition rates is given by Q = −λ
µ −µ
The matrix of transition probabilities P (t) over an interval t is generated by the matrix exponential
of Q, that is, P (t) = exp(Qt). Indexing the nucleosomal (non-NFR) state by 0 and NFR state by
1, we then have the following transition probabilities:
P00 [t] =
µ
λ −(λ+µ)t
+
e
,
λ+µ λ+µ
P11 [t] =
λ
µ −(λ+µ)t
+
e
.
λ+µ λ+µ
(2.3)
For sampling parameters under the MCMC framework, a parameterization in terms of the logintensities is useful, i.e. we take log λ = θl0 and log µ = θm0 . The initial distributions of the hidden
states can be taken to be uniform, that is, π = (π0 , π1 ) = (.5, .5).
Bayesian HMMs for nucleosome positioning prediction
7
We next specify the emission densities. Since the mean intensity may vary between probe replicates in the same state, we implemented a hierarchical Gaussian model with separate means for
each probe, centered around a state-specific mean with a state-specific variance:
Yij |Zi = k ∼ N (νki , σk2 );
and
νki ∼ N (νk0 , τk2 ),
(2.4)
where k ∈ {0, 1}. A more general form of the model would involve probe-specific variances. Although it is not possible to estimate the variances belonging to probes in different states (since the
states are unobserved), we found that (i) variances within probes were almost uniform throughout
the data set, (the variance of probe-specific variances was less than 0.01), and (ii) for regions in the
FAIRE data corresponding to strongly predicted nucleosomes or NFRs in the Yuan et al. (2005)
data, the variability of probe variances was very small. So we settled for a less complex model
which accounted for diversity among probes only in terms of the mean structure.
2.3 Adding covariate effects to the model
Underlying characteristics of the DNA sequence that affect chromatin rigidity may affect positioning of nucleosomes. In previous studies, the prevalence of polynucleotide sequences such as
poly-A (repetitions of “A” ), poly-T have been seen to affect nucleosome positioning. Hence sequence features may affect the correct prediction of nucleosomal state. We use three levels of
models, relating the covariates to the transition rates, or emission probabilities, via link functions.
• Model M0: the original or “base” model (Eqns. 2.3 and 2.4 ) assuming that the covariates do
not affect either the state transitions or emissions.
• Model M1 (“transition model”): Here a multiplicative intensity model associates the covariates
to the transition rates, assuming that the transition rate during the interval between two points
depends on the value of the covariates at the end of the interval, i.e. the last probe. Let Xij
denote covariate j (j = 1, . . . , d) for probe i. The transition rates over the interval [ti−1 , ti ] are:
log λi (X i ) = θl0 +
d
X
j=1
θlj Xij ;
log µi (X i ) = θm0 +
d
X
j=1
θmj Xij .
(2.5)
8
R. M ITRA AND M. G UPTA
In the later sections, to simplify notation, we shall denote λi (X i ) and µi (X i ) as λi and µi .
• Model M2 (“emission model”): Here the probability distribution of the observations conditional
on the hidden state can be affected by the covariate measurements. We assume that the statespecific probe measurement mean can be modeled as a linear function of the covariates, i.e.,
d
X
2
νki ∼ N νk0 +
βkj Xij , τk .
(2.6)
j=1
• Model M3 (“full model”): Here we assume both the transition intensities and state emission
probabilities are affected by the covariates, that is, both expressions (2.5) and (2.6) hold.
None of these models lead to a closed-form analytical expression for the likelihood due to the
hidden state variable Z, although this can be computed numerically through recursive techniques,
as discussed in Section 3. Before the model estimation procedure, we briefly discuss identifiability
in the proposed models, as it relates to the validity of inference under minimally informative priors.
2.4 Identifiability of the nucleosomal prediction models
Non-identifiability of a model implies that more than one set of parameter values lead to the same
likelihood. In a Bayesian framework, using an informative prior usually ensures identifiability.
When using flat priors to minimize prior effects, non-identifiability can lead to serious convergence
problems in Markov chain Monte Carlo (MCMC) and bias inference. To establish identifiability for
our proposed models, we extend the results of Teicher (1967), after stating the following results.
Definition. Let fφ (y) be a parametric family of densities of Y with respect to a common dominating measure µ and parameter φ in some set Φ. If π is a probability measure on Φ, then the density
R
fπ (y) = Φ fφ (y)π(dφ) is called a mixture density. We say that the class of (all) mixtures of fφ is
identifiable if fπ = fπ′ ,
µ − a.e. iff π = π ′ . Further, we say that the class of finite mixtures of fφ
is identifiable if for all measures π and π ′ with finite support, fπ = fπ′ µ a.e iff π = π ′ .
Proposition 1 (Teicher, 1967). The class of joint finite normal mixtures is identifiable.
Proposition 2 (Teicher, 1967). Assume that the class of finite mixtures of the family fφ of densi-
Bayesian HMMs for nucleosome positioning prediction
9
ties of Y with parameter φ ∈ Φ is identifiable. Then the class of finite mixtures of n-fold product
(n)
densities fφ (y) = fφ1 (y1 ) . . . fφn (yn ) with φ = (φ1 , . . . , φn ) ∈ Φn is identifiable.
Proposition 2 was proved by induction on n (Teicher, 1967). Since any HMM is a finite mixture
of n-fold product densities, where the the weights of the mixture are functions of the transition
probabilities, we applied the above results to prove the identifiability of models M0-M3.
Theorem 1 Models M0, M1, M2 and M3 are identifiable. In other words, if η denotes the total
set of all parameters in any of the four models, and L(η; y) denotes the likelihood of η, there does
not exist any η ′ 6= η such that L(η; y) = L(η ′ ; y) for all y.
Details for the proof of Theorem 1 are provided in the online Supplementary Materials.
3. M ODEL FITTING AND ESTIMATION PROCEDURE
Let η = (θ, β, ν, σ 2 ) denote the set of all parameters. (For model M0, θ = (λ, µ), ν = (ν00 , ν10 ),
and for all models except M3, at least one parameter becomes void.) The likelihood for N probes,
conditional on the covariates X = (X 1 , . . . , X N ) and locations T = (t1 , . . . , tN ), is
L(η)=P (Y 1 , . . . , Y N |X, T , η)
= [π0 P (Y 1 |z1 = 0)+π1 P (Y 1 |z1 = 1)]
N
X Y
Pzi−1 ,zi |xi (ti −ti−1 )f (Y i |X i , zi )
(3.1)
z1 ,...,zN i=2
Direct evaluation of (3.1) would involve summing over all possible sequences of hidden states
z1 , . . . , zN . Under the Markovian assumption, this can be computed recursively by a forward sum.
Expectation-maximization or EM (Dempster et al., 1977) is often used to estimate HMM
parameters. However, due to the complex transition probability expressions here (the complete
data likelihood consists of terms arising from the transition equations (2.3)), no closed form Mstep exists, necessitating computationally intensive numerical optimization. Although analytically
tractable EM steps exist for some continuous time HMMs (Roberts and Ephraim, 2008), we chose
to avoid this due to model complexity and the tendency of EM to converge to local optima. A
Bayesian MCMC sampling-based approach is feasible due to the log-concave forms of many of
10
R. M ITRA AND M. G UPTA
the conditional distributions. MCMC also allows us to obtain the full posterior probability landscape of the parameters rather than a single maximum likelihood-based point estimate. Next, we
discuss prior elicitation and the recursion-based MCMC algorithm.
3.1 Prior elicitation and sensitivity analysis
It can be deduced from the likelihood that the conditional posteriors assume proper densities with
a flat prior. The emission densities are normal, so marginalizing the joint density along the axes
of emission parameters is equivalent to integrating out normal densities, eventually yielding a
weighted summation of finite terms. The transition rate expressions are a finite summation of
terms of the type f (a) exp(−g(a)) where f is a function bounded by [0, 1] and g is a positive scalar
multiple in model M0 and a monotonic function in M1. For all such functions, the integral is finite,
leading to a finite summation of finite terms. However in two cases, if all probes are classified into
either the 0 or 1 state, integrating out with a flat prior would yield infinity, leading to an improper
posterior where Gibbs sampling may fail. This can be avoided through a constrained hidden state
model with the state space excluding these two cases, yielding a proper posterior. When sampling
from the constrained model, the conditional probabilities of the hidden states (excluding the all-0
or all-1 case) would differ from the true conditional probabilities (including the extreme states)
by a negligible margin, because the probabilities of these two conditions is infinitesimally low.
In our genomic data application, the log intensities are so distributed that the sampler can almost
never assign zeros to all the positions simultaneously (except when we initialize one transition
rate to be unrealistically small so that the corresponding transition probability is 0). In practical
implementation, our sampling scheme can reach one of these two states with a low probability– if
so, we can ignore the sample and re-sample. This is a valid step under the constrained sampling setup– using a Gibbs sampler as a proposal for a Metropolis-Hastings step that rejects the proposed
move if the state sequence is constant.
Based on the above, we assumed non-informative (uniform) priors for the logarithm of the tran-
Bayesian HMMs for nucleosome positioning prediction
11
sition rates and the emission mean and variance (on the logarithmic scale) parameters of our model.
We also fit the models assuming a minimally informative normal prior (with a variance of 5) on the
log-transition rates. Then we sequentially reduced the variance by .25 units, to assess the change.
In the first case, there was no identifiable difference in the results compared to the flat prior. When
the prior variance was reduced beyond 1, the log transition rates were pushed towards 0. Hence
we chose the emission and transition parameter priors to be minimally informative, specifically:
p(log(λ)) ∝ 1; p(log(µ)) ∝ 1; p(νk0 ) ∝ 1 (k = 0, 1); p(σk2 ) ∝ σk−2 (k = 0, 1); p(τk2 ) ∝ τk−2 ;
p(θmj ) ∝ 1, p(θlj ) ∝ 1 (j = 1, . . . , d); and p(βkj ) ∝ 1 (k = 0, 1; j = 1, . . . , d).
3.2 The MCMC sampling algorithm
Let us denote Pjk (ti+1 |ti ) as the transition probability from state j to state k, for adjacent probes at
positions ti+1 and ti , which is implicitly dependent on the transition rates λ and µ. The following
steps constitute one iteration of the MCMC sampler.
1. Use the forward algorithm (Rabiner, 1989) to recursively calculate the likelihood. This algorithm iteratively computes the likelihood of the data till a step in the sequence, with the
last step being in a particular state. The first application of this algorithm for HMMs was
by Baum et al. (1970) and later extended for a variety of models (Rabiner, 1989). Let Fji
denote the “partial” likelihood until position i of the sequence, where position i is in state
j, i.e., Fji = P (Zi = j, Y1 , . . . , Yi |X1 , . . . , Xi ). In our model, this is equivalent to: Fji =
P
Qi
l=2 Pzl−1 ,zl |xl (tl |tl−1 )f (Y l |X l , zl , η), j = 0, 1 representing the nucleoz1 ,...,zi−1 P (Y 1 |z1 )
somal and NFR states. The recursive procedure to calculate the likelihood is given by
Fji=[F0i−1 P0j (ti |ti−1 ) + F1i−1 P1j (ti |ti−1 )]f (Y i |X i , Zi = j, η),
i = 2, . . . , N .
(3.2)
The initial conditions are: Fj1 = πj f (Y 1 |X 1 , Z1 = j, η), (j = 0, 1), and the full likelihood of
the entire sequence is given by F0N + F1N .
2. Next, we employ a backward sampling procedure (Chib, 1996; Scott, 2002) to get a sample of
12
R. M ITRA AND M. G UPTA
the hidden states. Conditional on the sampled states at positions ti+1 , . . . , tN , the probability
that position i is in state j is P (Zi = j|Zi+1 , . . . , ZN , Y , X, η) ∝ Fji Pj,Zi+1 (ti+1 |ti ). The
probability that the last position is in state j is proportional to FjN . In application, the above
procedures are reformulated in terms of logarithms to avoid computational underflow.
3. Conditional on the other parameters, hidden states, and data, update the transition parameters
using a Metropolis-Hastings procedure. For models M0 and M2, this is done directly for θl0 and
θm0 , while for models M1 and M3 this is done for each component in turn.
4. Conditional on the other parameters, hidden state path, and observed data, update the emission
parameters νk0 (k = 0, 1) for M0 and M1 and additionally, β k , for M2 and M3.
Details of the MCMC steps are provided in the online Supplementary Materials.
4. A PPLICATION TO YEAST NUCLEOSOME ARRAY DATA
The estimation procedure for the base, transition and emission models was applied to the FAIRE
data set from Hogan et al. (2006), that comprised a total of 13947 probes. The data structure
consists of two elements:
• Signal: This gives the log of the hybridization ratios obtained from the microarrays. For each
probe we have 3 replicates each giving a measure of the intensity data at the particular probe.
The probes are evenly distributed (i.e. with constant spacing) in the data set. The higher the
intensity, the greater the chance for the probe to be in a nucleosome free region. The data
was preprocessed by a z-score standardization to remove skewness (Hogan et al., 2006). The
missingness pattern in the probes was random, and the maximum length of a missing block was
13. The non-missing signal values numbered 12760, about 88% of the total data set.
• Sequence: Each probe was of length 50 nucleotides, with an overlap of 20 nucleotides with
the adjacent probe. As covariates which may potentially influence the nucleosomal signal, we
first extracted an initial feature set consisting of counts of all oligomers upto length 5. For
example, a measurement of a covariate corresponding to the dinucleotide TA would be the
Bayesian HMMs for nucleosome positioning prediction
13
number of occurrences of this dinucleotide throughout the length of the probe. Each probe thus
corresponds to a point in a 1364-dimensional Euclidean space, the coordinate for each probe
given by the corresponding oligomer word counts. These covariates, especially for longer size
oligomers, have very sparse counts, and tend to be highly correlated with counts of words which
are smaller segments. To avoid collinearity and also reduce dimensionality of the covariate
space, we performed principal components analysis (PCA) on these 1364 covariates for all the
probes in our data set. The first five principal components (PCs) explained about 95% of the
variability in the covariate space, and were retained as the final collection of model covariates.
4.1 Model-fitting using three models
Next, we compared the performance of the models by applying them to the nucleosomal array
data. All analyses were run in the statistical software R (http://www.r-project.org/). The MCMC
algorithm was initialized at values from a N(0,1) distribution for log(λ), log(µ), θ, β, ν00 and
ν10 , while the variance parameters were initialized to 1. Multiple starting points made no significant difference in the results. Autocorrelation plots indicated that convergence was attained for all
models within 1000-5000 iterations, which was taken as the burn-in period. 10000 iterations (after
burn-in) were used for inference. We present a subset of the analyses that indicate most strongly
the power of the new approach; more details are presented in the online Supplement.
Table 1 shows that only the second PC was significant at a 5% level, in models M1 and M2.
For M3, MCMC convergence was very slow, hence we did not use this model further. In M0,
the λ and µ estimates imply that if a particular probe was covered by an NFR, the probability of remaining in the state is very high (0.98). This supports the hypothesis that nucleosomes
are separated by long NFRs. The predicted nucleosomal states also appear to be long; which is
natural as the low resolution of the data leads to detecting enrichment of long nucleosome-free
regulatory regions, but not of short nucleosome-free linker regions (< 10 bp) between adjacent
nucleosomes. In M1, the second PC was significant in both nucleosomal and NFR categories.
14
R. M ITRA AND M. G UPTA
The opposing signs of the estimates imply that the AT oligomers associated with this covariate help in continuation of the NFR subsequences and are detrimental to nucleosome formation.
Computing the weights of the dinucleotide counts for this PC show that oligomers that are important in differentiating between nucleosomal and NFR regions are: A,T, AA, TT, AT,TA,
AAT,ATA,ATT,TTA,TAT,TAA,TTTTA,TTTAT (all have weights > .1: AT and TA were highest; tri-nucleotides were intermediate). In M2, the second PC was significant only in the NFR state.
We re-fitted models M1 and M2 using the 14 oligomer counts associated with the second PC
(Tables 3 and 4 in the online Supplement). AA and TT were the most significant variables for the
NFR state in M2, while none of the nucleosome parameters were significant. This suggests that
the intensity data within the nucleosomes does not depend on local DNA sequence– however, in
NFRs, the intensity mean is a variable function of the the AA and TT dinucleotide counts. We
observed that the effect of the other oligomer combinations, which were a part of the second PC,
vanish when we re-fit the model with oligomer counts. This implies that the effect of the second
PC was borne primarily by the counts of these two dinucleotide categories. These results reiterate
the major importance of the AA and TT dinucleotides in influencing the state lengths. Here, as
in the original M1, these two features were important for both nucleosomal and the NFR states.
Figure 1 shows the empirical CDF of the posterior probabilities of probes being classified into the
nucleosomal state in the three models. The flatness of the center, with sharp rises at the two ends,
demonstrate that most probes have a very high or low chance of being classified as an NFR, and are
robust to the actual cutoff. The qq-plots of the nucleosomal and NFR state (Supplementary Figure
1) demonstrate no significant departures from the fitted model.
Model comparison. An important question is which model is most appropriate for the data.
Computing the Bayes factor analytically would be difficult due to model complexity. A simpler
alternative (approximation) is the Bayesian Information Criterion (BIC). It is straightforward to
compute the BIC using BIC = −2 log(L̂) + k log(n), where L̂ is the modal posterior likelihood,
Bayesian HMMs for nucleosome positioning prediction
15
k is the number of parameters, and n is the number of data points. The BIC for the three models
are 17154.27 (M0), 14386.92 (M1) and 13337.19 (M2), showing that incorporation of sequence
features is an essential part in determining the structural classification. Model M2 gave the best fit,
indicating that local sequence features indeed influence the measured intensities in nucleosomal
regions and NFRs; however, the sequence does not exhibit as strong an effect in determining the
lengths of the state of neighboring regions.
4.2 Biological validation with known NFR regions
For biological validation, we extracted a set of yeast regulatory regions from the UCSC database
(http://genome.ucsc.edu/). Although the set of all validated NFRs is not available, the presence of
active TFBSs is a useful indicator that the region is likely to be nucleosome-free. We compared
the overlap of predicted NFRs with known TFBS locations (Figure 2, and Table 12 in the online
Supplement). The continuous-index models perform excellently compared with the Yuan et al.
(2005) method: M0 and M1 have a >90% correct classification rate, M2 has 70%. Since we do
not have a corresponding database of nucleosome regions, the false positive rate is underestimated.
This may be why M2 has a lower sensitivity than M0 in this comparison, although it fits the data
better. Figure 3 shows the Receiver Operator Characteristic (ROC) for all 3 models, using the
UCSC genome database as our reference, assuming the known TFBSs to be the only NFRs. The
ROCs show that all models perform equivalently well. Figure 3 shows that M1 has the highest
specificity in prediction. M1 allows the sequence information to locally influence the length of the
nucleosomal and non-nucleosomal states. The biological significance of this mechanism needs to
be explored further to elucidate the relationship between nucleosomes and sequence signals.
5. S IMULATION STUDIES
We next performed simulation studies to determine the (i) consistency of model estimates under
different parameter settings, (ii) detection of the correct model under model misspecification and
(iii) importance of the continuous index assumption in modeling gapped tiling array data.
16
R. M ITRA AND M. G UPTA
5.1 Consistency of model parameter estimation
Data sets were generated first under models M0, M1 and M2. Ten data sets of size 5000 probes
were generated for each parameter setting for each model, which were fit using MCMC (Section 3).
In each case, the estimated bias and MSE showed consistent estimation of the model parameters.
Here we describe one study in detail. For M0, the emission distributions were assumed to be normal
with means -0.4 and 0.4, λ and µ were set to be -3, and emission variances were set to be 0.35. For
M1 and M2, the intercept parameter for both states were set as 1. Each regression coefficient (θ for
M1, and β for M2) for the 5 covariates was fixed either as 1 or 0, for both nucleosomal and NFR
states. Thus a typical β (or θ) vector would be, for example, (1, 0, 1, 0, 0, 0) for the nucleosomal
and (1, 0, 0, 0, 0, 0) for the NFR state. For all 3 models, 2500 iterations were run for each parameter
setting, with a burn-in of 500. For evaluating misclassification rates, probes having an average state
membership probability greater than 0.5 were rounded off to 1, while others were set at 0. For all
data sets under model M0, the maximum MSE of the emission (transition) parameters were less
than 0.0001 (0.03), and the maximum misclassification rate was 0, indicating the model was fit
accurately in each case. For M1, the maximum MSE for the corresponding parameters was also
0.03, with the maximum misclassification rate being 0.1; the MSE for θ ranged between 0.05 and
0.09. For M2, the maximum MSE for emission and transition parameters was slightly higher (0.08
for µ), the maximum misclassification rate was 0.1, and the range of MSE for β was .001 to .004.
5.2 Cross comparison under different models
Next, we simulated data under each model, and estimated parameters and states from all models,
in order to judge whether in each case the correct model was most accurate. Five data sets were
simulated under each model. The variances were fixed to 0.48 and 0.64 for the nucleosomal and
NFR states. For M0, the transition rates were -3 and -3. For models M1 and M2, the coefficient
for the first PC in the NFR state was fixed to 1, while the others were set to 0. The simulated data
sets were of size 5000; the sequence features of the first 5000 probes of the FAIRE data set were
Bayesian HMMs for nucleosome positioning prediction
17
used as covariates. When the simulation and estimation models matched, the method performed
very accurately, with the maximum MSE of the emission parameters µ and σ being less than 0.01,
and for λ and µ, less than 0.05; and the averaged misclassification rates also the lowest (Table 2).
In each case, the BIC chose the correct model. Also for a simulation model, say MA , when we
estimated it by a model which contains it, say MB , we obtained a similar value of the maximal
log likelihood. In an ideal scenario, the extra components in the bigger model would turn out
to be 0, and we would get the same likelihood– however, this is typically not achieved in most
data sets, necessitating the use of criteria like the BIC. However, in our case, we see that the log
likelihood values are very close (the error margin is less than .01) and the expected log-likelihood,
obtained from the averages of the MCMC iterations are equal. One possible reason is the large
size of the data set, leading to the errors in estimation being smoothened to a considerable degree,
and the parameters equaling the true simulation parameters with high probability. Due to MCMC
convergence problems, model M3 could not be tested. However we generated 5 data sets from
this model (with similar parameter settings as above), and ran the other three estimation models
on these data sets. The percent of correctly classified probes under models M0, M1 and M2 were
81%, 92% and 94% respectively. In 2 of the 5 datasets, model M1 achieved the lowest BIC and
MSE, while in the other 3, model M2 was best; model M0 was weakest.
5.3 Importance of the continuous index model
In order to show that the continuous index HMM was an essential improvement required to fit
the gapped probe data, we simulated a data set of length 5000, where the gap distribution was the
same as in the FAIRE data HMM. We also simulated a second data set of 5000 measurements
where the gap structure was assigned randomly. Both data sets were analyzed using both discrete
and continuous index HMMs (Table 3). Comparing the number of predicted matches with the
simulated state set, it is clearly seen that the discrete-index HMM fails to achieve as high a correct
classification rate as the continuous-index model, in the gapped data framework.
18
R. M ITRA AND M. G UPTA
6. D ISCUSSION
We propose a novel extension of the nucleosomal region prediction method with two major improvements: (i) a continuous-index hidden Markov model to accommodate missing or variablegapped probes; and (ii) using underlying DNA sequence features that influence transition rates
and emission densities. We have demonstrated the model advantages on FAIRE data (Hogan et al.,
2006), now widely used for prediction of NFRs. Our methods are efficient, given their complexity–
the time taken (in seconds) for 1000 MCMC iterations for models M0, M1, and M2 are 15343s,
47007s, and 18401s respectively. We proved identifiability conditions that allow construction of
a straightforward MCMC procedure with minimally informative priors. In applications, models
M0 and M1 performed similarly, and were efficient at predicting regulatory regions, indicating
that a non-homogeneous transition rate is probably most accurate for predicting NFRs. Estimates
from models M1 and M2 show that DNA nucleotide combinations play a significant role in determining nucleosome positions: Adenosine and Thymine (A and T) appear most influential. Of
the different combinations, the dinucleotides (AT and TA) carry the maximum weight, followed by
tri-nucleotides. These results support the biological hypothesis that the presence of A and T-based
nucleotide combinations contributes to the rigidity of the DNA, favoring positioning of NFRs
(Sekinger et al., 2005). Our models successfully extend the HMM methodology to gapped data
sets, opening up the possibility for synergistically modeling multiple complex data sets having
varying probe intervals. Recent efforts have focused on combining data from multiple probe replicates into a hidden Markov framework (Johnson et al., 2009), and to develop sophisticated technologies to overcome major problems of ChIP-on-chip technology by performing massive parallel
sequencing (Park, 2009). However, such technologies are in preliminary stages of development
and critical refinement is necessary before fully realizing their potential and limitations.
S UPPLEMENTARY M ATERIALS
The on-line Supplementary Materials provide technical appendices and additional simulations.
Bayesian HMMs for nucleosome positioning prediction
19
ACKNOWLEDGMENTS
This work was partially supported by the NIH (NHGRI) award HG004946 to M. G. The authors
are grateful to Jason Lieb and Paul Giresi for many helpful discussions and the FAIRE data, and to
two anonymous reviewers whose suggestions made significant improvements to the article.
R EFERENCES
BAUM , L. E., P ETRIE , T., S OULES , G. AND W EISS , N. (1970). A maximization technique occurring in the statistical analysis of
probabilistic functions of markov chains. Ann. Math. Statist. 41, 164–170.
B UCK , M.J. AND L IEB , J.D. (2004). ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin
immunoprecipitation experiments. Genomics 83(3), 349–60.
C HIB , S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. J. Econometrics 75(1), 79–97.
C ONLON , E. M., L IU , X. S., L IEB , J. D.
AND
L IU , J. S. (2003). Integrating regulatory motif discovery and genome-wide
expression analysis. Proc. Natl Acad. Sci. USA 100(6), 3339–3344.
D EMPSTER , A. P., L AIRD , N. M
AND
RUBIN , D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.
J. Roy. Stat. Soc. B 39(1), 1–38.
G UPTA , M. (2007). Generalized hierarchical Markov models for the discovery of length-constrained sequence features from
genome tiling arrays. Biometrics 63(3), 797–805.
G UPTA , M
AND I BRAHIM ,
J G. (2007). Variable selection in regression mixture modeling for the discovery of gene regulatory
networks. J. Am. Stat. Assoc. 102(479), 867–880.
G UPTA , M.
AND
L IU , J.S. (2003).
Discovery of conserved sequence patterns using a stochastic dictionary model.
J.Am.Stat.Assoc 98, 55–56.
H OGAN , G. J., L EE , C.-K. AND L IEB , J. D. (2006). Cell cycle-specified fluctuation of nucleosome occupancy at gene promoters.
PLoS Genet 2(9), e158.
J OHNSON , W., L IU , X. AND L IU , J.S. (2009). Doubly-stochastic continuous-time hidden Markov approach for analyzing genome
tiling arrays. Ann. Appl. Stat. 3, 1183–1203.
L ANDER , E. S.
AND
G REEN , P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci.
U.S.A. 84, 2363–2367.
L EE , W., T ILLO , D., B RAY, N., M ORSE , R. H., DAVIS , R. W., H UGHES , T. R.
AND
N ISLOW, C. (2007). A high-resolution
20
R. M ITRA AND M. G UPTA
atlas of nucleosome occupancy in yeast. Nat. Genet. 39, 1235–1244.
L IU , J. S., N EUWALD , A. F.
AND
L AWRENCE , C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs
sampling strategies. J. Am. Stat. Assoc. 90, 1156–1170.
NARLIKAR, L., G ORDAN , R.
AND
H ARTEMINK , A. J. (2007). A nucleosome-guided map of transcription factor binding sites in
yeast. PLoS Comput. Biol. 3, e215.
PARK , P. J. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680.
P ECKHAM, H. E., T HURMAN , R. E., F U , Y., S TAMATOYANNOPOULOS , J. A., N OBLE , W. S., S TRUHL , K.
AND
W ENG , Z.
(2007). Nucleosome positioning signals in genomic DNA. Genome Res. 17, 1170–1177.
R ABINER , L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2),
257–286.
R ICHMOND , T. J. AND DAVEY, C. A. (2003). The structure of DNA in the nucleosome core. Nature 423, 145–150.
ROBERTS , W.J.
AND
E PHRAIM , Y. (2008). An EM algorithm for ion-channel current estimation. IEEE Trans. Sig. Proc. 56,
26–33.
S COTT, S. L. (2002). Bayesian methods for hidden Markov models. J. Am. Stat. Assoc. 97, 337–351.
S EGAL , E., F ONDUFE -M ITTENDORF, Y., C HEN , L., T HASTROM , A., F IELD , Y., M OORE , I. K., WANG , J. P. AND W IDOM , J.
(2006). A genomic code for nucleosome positioning. Nature 442, 772–778.
S EKINGER, E. A., M OQTADERI , Z.
AND
S TRUHL , K. (2005). Intrinsic histone-DNA interactions and low nucleosome density
are important for preferential accessibility of promoter regions in yeast. Mol. Cell 18, 735–748.
S TJERNQVIST, S., RYDEN , T., S KOLD , M.
AND
S TAAF, J. (2007). Continuous-index hidden Markov modelling of array CGH
copy number data. Bioinformatics 23, 1006–1014.
T EICHER , H. (1967). Identifiability of mixtures of product measures . Ann. Math. Stat. 38, 1300–1302.
W IDOM , J. (1992). A relationship between the helical twist of DNA and the ordered positioning of nucleosomes in all eukaryotic
cells. Proc. Natl. Acad. Sci. U.S.A. 89, 1095–1099.
Y UAN , G. C. AND L IU , J. S. (2008). Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4,
e13.
Y UAN , G. C., L IU , Y. J., D ION , M. F., S LACK , M. D., W U , L. F., A LTSCHULER , S. J.
scale identification of nucleosome positions in S. cerevisiae. Science 309, 626–630.
AND
R ANDO , O. J. (2005). Genome-
Bayesian HMMs for nucleosome positioning prediction
21
Table 1. 95% Credible Intervals (CI) for parameter estimates for models M0 (base), M1 (transition), and M2 (emission). The indices l (or 0) and m (or 1) indicate the nucleosomal and NFR states; for instance, the term θl0 refers to
the intercept term for the nucleosomal state in the transition model M1, ν10 refers to the intercept term for the NFR
state in the emission model M2. The CIs for parameters significant at a 95% level are given in bold fonts.
Parameter
log(λ)
log(µ)
M0
ν00
ν10
σ0
σ1
τ0
τ1
θl0
θm0
θl1
θm1
θl2
θm2
M1
θl3
θm3
θl4
θm4
θl5
θm5
ν00
ν10
σ0
σ1
τ0
τ1
ν00
ν10
β01
β11
β02
M2
β12
β03
β13
β04
β14
β05
β15
log(λ)
log(µ)
σ0
σ1
τ0
τ1
95% CI
(-3.52,-3.75)
(-3.77,-4.19)
(-.52,-.67)
(.85,.99)
(.25,.30)
(.42,.48)
(.45,.50)
(.72,.79)
(-3.63,-3.75)
(-3.91,-.4.07)
(.-.01,.02)
(.-.05,.05)
(-.82,-.54)
(.78,.80)
(-.04,.05)
(-.08,.07)
(-.03,.02)
(-.05,.04)
(-.02,.02)
(-.03,.03)
(-.54,-.63)
(.88,.95)
(.28,.33)
(.41,.46)
(.42,.51)
(.72,.80)
(-.68,-..72)
(.81,.85)
(-.05,.05)
(-.06,.05)
(-.03,.03)
(.20,.22)
(-.02,.01)
(-.03,.03)
(-.01,.01)
(-.04,.04)
(-.01,.01)
(-.02,.01)
(-3.56,-3.72)
(-3.93-4.15)
(.31,.33)
(.29,.31)
(.40,.44)
(.71,.75)
SE
.5
.38
.08
.07
.02
.03
.02
.03
.05
.05
.01
.04
.68
.1
.03
.07
.02
.04
.02
.03
.05
.03
.02
.02
.04
.04
.02
.02
.05
.05
.02
.05
.01
.02
.01
.04
.01
.01
.08
.09
.01
.01
.02
.01
22
R. M ITRA AND M. G UPTA
Table 2. Cross-tabulation of average correct state classification percentage under the three models.
Simulation model
Base Transition Emission
(M0)
(M1)
(M2)
Base (M0)
Estimation Model Transition (M1)
Emission (M2)
100
96
98
85
88
88
86
80
91
Table 3. Simulation study to compare the performance of discrete-index and continuous-index HMMs under two gap
scenarios. The numbers “1” and “2” in the column “Set” refer to results obtained from data sets simulated under the
FAIRE gap structure and the arbitrary gap structure respectively. “Match” refers to the percentage of probes classified
into their correct state.
1.0
Model Set M SE(λ) M SE(µ) AveM SE (ν00 , ν10 ) AveM SE (σ0 , σ1 ) Match (%)
Continuous 1
.01
.01
.01
.01
99
Discrete 1
.03
.01
.01
.01
85
Continuous 2
.01
.01
.01
.01
99
Discrete 2
.08
.09
.05
.03
68
0.0
0.2
0.4
cdf
0.6
0.8
base
transition
emission
0.0
0.2
0.4
0.6
0.8
1.0
x
Fig. 1. Empirical CDF of the posterior probabilities of probes being classified into the nucleosomal state for the base model (M0)
transition model (M1), and emission model (M2).
Bayesian HMMs for nucleosome positioning prediction
0.2
0.3
0.4
0.5
base
transition
emission
yuan
0.0
0.1
Cumulative Proportion Matched
23
0
20
40
60
80
100
120
Index
Fig. 2. Comparison between Yuan et al. (2005) and FAIRE predictions with the three continuous-index models. On the vertical
axis, the proportion of a genomic region “matched” with the database is given by the ratio
(Number of probes predicted to be in NFR state in region)
Proportion matched =
. A cutoff of 0.5 was used for the conditional
(Total number of probes in all transcriptionally active regions)
probabilities for state classification.
24
R. M ITRA AND M. G UPTA
1.0
ROC curve
0.6
0.4
0.0
0.2
1−specificity
0.8
base
transition
emission
0.0
0.2
0.4
0.6
0.8
1.0
sensitivity
Fig. 3. ROC curves for base model (M0), transition model (M1), and emission model (M2). Each point in the ROC curve corresponded to a particular cutoff of the posterior probabilities under which the classification was done. The cutoff ranges from 0 to 1
with an increment of .01. Sensitivity is given by the ratio of the number of true NFRs found, to the total number of “true” NFRs;
specificity equals the ratio of the number of true NFRs found to the number of predicted NFRs. In terms of the ROC, all three
continuous-index models appear to perform equally well.
Download