Biostatistics (2010), 1, 1, pp. 1–24 doi:10.1093/biostatistics/bio000 A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA RITEN MITRA Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A. rmitra@bios.unc.edu MAYETRI GUPTA∗ Department of Biostatistics, Boston University, MA 02118, U.S.A. gupta@bu.edu S UMMARY Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called histones. Nucleosomes occur at variable intervals throughout genomic DNA, and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs), based on data from high-throughput genomic tiling arrays in yeast. These high volume data typically have a complex structure, in the form of dependence on neighboring probes as well as underlying DNA sequence, ∗ To whom correspondence should be addressed. c The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org. 2 R. M ITRA AND M. G UPTA variable sized gaps and missing data. We propose a novel continuous index model appropriate for non-equispaced tiling array data, that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs. Keywords: data augmentation; chromatin structure; tiling arrays; FAIRE 1. I NTRODUCTION Genomic DNA in the cell nucleus exists as a complex called chromatin, where DNA is folded and compacted through a series of interactions with proteins called histones. At the first level, a stretch of DNA of about 147 base-pairs (bp), is wrapped around an octamer of histones, yielding a “nucleosome”. Nucleosome structure has been determined by X-ray crystallography (Richmond and Davey, 2003). The presence of nucleosomes prevents transcription factors (TFs) from binding to the DNA at those sites (Widom, 1992), resulting in interruption of gene regulatory processes. Thus, a well-defined map of nucleosome positions could be used to improve the predictive power of current TF binding site (TFBS) search algorithms. Most commonly used TFBS or motif search algorithms use the sequence conservation at binding sites as the only criterion for TFBS presence (Liu et al., 1995; Gupta and Liu, 2003). However, in complex genomes, these methods often suffer from poor predictability and high false positive rates. Recent methods have used gene expression measurements through regression-type models (Conlon et al., 2003; Gupta and Ibrahim, 2007) or ChIP-chip (Chromatin Immunoprecipitation) experimental data (Buck and Lieb, 2004) to improve predictions, but weak relationships in the first approach often lead to missing important TFBSs, and the second approach is limited in applicability to a small set of TFs. The connection of nucleosome positioning to TF-binding has motivated some researchers to make use of information Bayesian HMMs for nucleosome positioning prediction 3 from nucleosomal mapping for motif search. Narlikar et al. (2007) have used nucleosome mapping information to create a motif occurrence prior. However, these methods will not work well unless nucleosome positions are based on high-quality predictions from experimental data. Next, we discuss the various experimental assays for nucleosome detection and statistical analysis methods. 1.1 Genomic assays for nucleosome position detection Tiling arrays are microarrays of short overlapping probes, covering a genomic region of interest, used to measure positions of genomic-level events, such as TF binding and histone methylation. In Yuan et al. (2005), based on the nucleosome-free DNA’s susceptibility to micrococcal nuclease (MNase), nucleosomal DNA from yeast was isolated and labeled with green (cy3) fluorescent dye while the total genomic DNA was labeled with red (cy5) dye. These segments were hybridized to microarrays printed with 50-mer oligonucleotide probes tiled at 20 bp overlaps. After pre-processing and normalization, the ratio of green-to-red measurement intensity values indicated how likely that locus was to contain a nucleosome. A more recent experimental technology is Formaldehyde assisted isolation of regulatory elements or FAIRE (Hogan et al., 2006). FAIRE is initiated by fixing whole yeast cells in a growth medium by formaldehyde, harvesting them by centrifugation, sonicating the extracts, labeling the purified DNA and hybridizing them to microarrays. In FAIRE, regions of the genome cross-linked with histones (nucleosomes) are less susceptible to TF binding, and get retained at the organic-aqueous interphase, while the NFRs get enriched. Hogan et al. (2006) used 50-mer probes overlapping every 20bp, to tile yeast chromosome III using four microarrays (3 biological and 1 technical replicate). The measured enrichment throughout the genome, in terms of median logarithmic intensity ratios, represented NFR positioning. In Yuan et al. (2005), the enrichment represents nucleosomes– hence the raw data roughly exhibits a negative correlation with FAIRE enrichment values. Both used identical probes in the same region of the genome and can be used for comparison. However, since FAIRE enriches NFRs, the exact model framework used in Yuan et al. (2005) is not appropriate for FAIRE. 4 R. M ITRA AND M. G UPTA 1.2 Computational approaches for nucleosome detection We now discuss computational methods for detection of nucleosome positioning from genomic assays. Most methods are adapted for MNase digestion data (Yuan et al., 2005). In Yuan et al. (2005), a hidden Markov model (HMM) was implemented for classifying the genomic regions into nucleosomal and non-nucleosomal regions; a fully Bayesian approach was developed by Gupta (2007). High-throughput tiling array data has complications such as amplification bias, inaccurate reads, and varying degrees of non-random noise and measurement error. Relying solely on signal intensities for classification could propagate errors if the nucleosome predictions are used as input data for motif discovery. Recent work has linked DNA sequence features with nucleosome formation (Sekinger et al., 2005). Segal et al. (2006) constructed a nucleosomal map of yeast from the distribution of dinucleotides in nucleosomal DNA. However, neglecting NFR sequence information weakened their classification approach. Other methods include Peckham et al. (2007), who used support vector machines; and Lee et al. (2007) who used Lasso regression on sequence features. Yuan and Liu (2008) used wavelet decomposition to represent the sequence signal as covariates in a logistic model. However, in all these approaches, restricting the sequence features to dinucleotides is a potential reason for high misclassification rates, in spite of clear biological evidence that sequence composition strongly affects nucleosome positioning (Sekinger et al., 2005). Also, sequence features were not used in prediction– the relationship was explored only by obtaining nucleosome predictions from intensity data and inferring potential sequence influences. In FAIRE, employing an HMM with equal probe spacings would misspecify the model and give erroneous state predictions due to gaps and missing probes. In ChIP-chip assays to detect TFBSs, missing probes are not as serious a problem as TF-bound enriched probes constitute a tiny fraction of the total. The conformability of DNA structure has been shown to depend on a large set of nucleotide combinations (Sekinger et al., 2005). Here we propose a novel model framework for FAIRE data extending HMMs in two directions: (i) an underlying continuous-index Markov Bayesian HMMs for nucleosome positioning prediction 5 process allowing gaps and missing probes, and (ii) incorporation of sequence features, not limited to dinucleotides, into nucleosomal state prediction, through dimension reduction techniques. This unified model uses signal intensities and sequence features to predict nucleosome locations. Continuous-index Markov processes have been applied to genomic problems such as copy-number variation detection (Stjernqvist et al., 2007) and genetic linkage analysis (Lander and Green, 1987), but the present problem presents some unique characteristics. In Sections 2 and 3, we describe the new model framework and estimation. In Section 4 we show that including sequence features of a higher order leads to superior predictive power, through real and simulated data applications. 2. M ODEL FRAMEWORK Next, we propose a model to extract positional estimates of NFRs from FAIRE enrichment data (Hogan et al., 2006). Although our model is developed with the FAIRE data structure in mind, it is a general model which can be used for a variety of high-throughput microarray settings. Also, from a biological standpoint, since the FAIRE technology is simpler to use than the MNase procedure (Yuan et al., 2005), it is likely to be useful in nucleosomal studies for a wide variety of organisms. 2.1 Data structure Due to experimental and imaging errors, the data often has missing probe measurements, with the intervals between measurements varying in length. For each fully observed probe, the following data are present: (i) Yij = log-ratio of the intensity of the replicate j of probe i ; (ii) ti = Nucleotide index of the center of probe i; and (iii) X i = d-dimensional vector of covariate measurements for probe i; for i = 1, . . . , P (probes), and j = 1, . . . , R (replicates). We define ta,b = tb − ta as the gap between the centers of probes indexed a and b, which is the genomic distance between two probes (measured at a base pair scale). If all probes are of equal length and equispaced (which is typically the case), the gaps are solely due to the existence of missing probes, leading to t(a, b) being a constant multiple of the number of missing probes in between. In this case, it is equivalent to work with the metric of the number of missing probes as a distance measure. 6 R. M ITRA AND M. G UPTA 2.2 Continuous index hidden Markov process Due to missing probes, the underlying nucleosome occupancy status can be considered an unobserved stochastic process evolving over a continuous index, denoted by [Z(t), t > 0]. We assume Z(t) to be a two-state Markov process; Z(t) = 1 (0) represents an NFR (non-NFR) state, i.e., P [Z(ti )|Z(ti−1 ), . . . , Z(t1 ), Y i−1 , . . . , Y 1 ] = P [Z(ti )|Z(ti−1 )] = Pzi−1 zi (ti − ti−1 ), (2.1) where Pzi−1 zi (·) denotes the transition probability that state zi was occupied at index ti given that the process was in state zi−1 at index ti−1 . First, let us assume that the transition probabilities of Z are stationary. Also, conditional on the state of Z at index ti , let the observed measurement Y i be independent of the hidden process and all measurements prior to index ti , that is, P [Y i |Z(ti ), . . . , Z(t1 ), Y i−1 , . . . , Y 1 ] = P [Y i |Z(ti )] = f (Y i |zi ). (2.2) Expressions (2.1) and (2.2) constitute a continuous index hidden Markov process. For notational simplicity, we shall henceforth refer to Z(ti ) as simply Zi . Next, assume that the probability of staying in a state is linear with respect to index for an infinitesimal interval. The two-state hidden process can be parameterized by the transition rate from a nucleosomal state to a NFR (λ), and the λ . reverse transition rate (µ). The matrix of instantaneous transition rates is given by Q = −λ µ −µ The matrix of transition probabilities P (t) over an interval t is generated by the matrix exponential of Q, that is, P (t) = exp(Qt). Indexing the nucleosomal (non-NFR) state by 0 and NFR state by 1, we then have the following transition probabilities: P00 [t] = µ λ −(λ+µ)t + e , λ+µ λ+µ P11 [t] = λ µ −(λ+µ)t + e . λ+µ λ+µ (2.3) For sampling parameters under the MCMC framework, a parameterization in terms of the logintensities is useful, i.e. we take log λ = θl0 and log µ = θm0 . The initial distributions of the hidden states can be taken to be uniform, that is, π = (π0 , π1 ) = (.5, .5). Bayesian HMMs for nucleosome positioning prediction 7 We next specify the emission densities. Since the mean intensity may vary between probe replicates in the same state, we implemented a hierarchical Gaussian model with separate means for each probe, centered around a state-specific mean with a state-specific variance: Yij |Zi = k ∼ N (νki , σk2 ); and νki ∼ N (νk0 , τk2 ), (2.4) where k ∈ {0, 1}. A more general form of the model would involve probe-specific variances. Although it is not possible to estimate the variances belonging to probes in different states (since the states are unobserved), we found that (i) variances within probes were almost uniform throughout the data set, (the variance of probe-specific variances was less than 0.01), and (ii) for regions in the FAIRE data corresponding to strongly predicted nucleosomes or NFRs in the Yuan et al. (2005) data, the variability of probe variances was very small. So we settled for a less complex model which accounted for diversity among probes only in terms of the mean structure. 2.3 Adding covariate effects to the model Underlying characteristics of the DNA sequence that affect chromatin rigidity may affect positioning of nucleosomes. In previous studies, the prevalence of polynucleotide sequences such as poly-A (repetitions of “A” ), poly-T have been seen to affect nucleosome positioning. Hence sequence features may affect the correct prediction of nucleosomal state. We use three levels of models, relating the covariates to the transition rates, or emission probabilities, via link functions. • Model M0: the original or “base” model (Eqns. 2.3 and 2.4 ) assuming that the covariates do not affect either the state transitions or emissions. • Model M1 (“transition model”): Here a multiplicative intensity model associates the covariates to the transition rates, assuming that the transition rate during the interval between two points depends on the value of the covariates at the end of the interval, i.e. the last probe. Let Xij denote covariate j (j = 1, . . . , d) for probe i. The transition rates over the interval [ti−1 , ti ] are: log λi (X i ) = θl0 + d X j=1 θlj Xij ; log µi (X i ) = θm0 + d X j=1 θmj Xij . (2.5) 8 R. M ITRA AND M. G UPTA In the later sections, to simplify notation, we shall denote λi (X i ) and µi (X i ) as λi and µi . • Model M2 (“emission model”): Here the probability distribution of the observations conditional on the hidden state can be affected by the covariate measurements. We assume that the statespecific probe measurement mean can be modeled as a linear function of the covariates, i.e., d X 2 νki ∼ N νk0 + βkj Xij , τk . (2.6) j=1 • Model M3 (“full model”): Here we assume both the transition intensities and state emission probabilities are affected by the covariates, that is, both expressions (2.5) and (2.6) hold. None of these models lead to a closed-form analytical expression for the likelihood due to the hidden state variable Z, although this can be computed numerically through recursive techniques, as discussed in Section 3. Before the model estimation procedure, we briefly discuss identifiability in the proposed models, as it relates to the validity of inference under minimally informative priors. 2.4 Identifiability of the nucleosomal prediction models Non-identifiability of a model implies that more than one set of parameter values lead to the same likelihood. In a Bayesian framework, using an informative prior usually ensures identifiability. When using flat priors to minimize prior effects, non-identifiability can lead to serious convergence problems in Markov chain Monte Carlo (MCMC) and bias inference. To establish identifiability for our proposed models, we extend the results of Teicher (1967), after stating the following results. Definition. Let fφ (y) be a parametric family of densities of Y with respect to a common dominating measure µ and parameter φ in some set Φ. If π is a probability measure on Φ, then the density R fπ (y) = Φ fφ (y)π(dφ) is called a mixture density. We say that the class of (all) mixtures of fφ is identifiable if fπ = fπ′ , µ − a.e. iff π = π ′ . Further, we say that the class of finite mixtures of fφ is identifiable if for all measures π and π ′ with finite support, fπ = fπ′ µ a.e iff π = π ′ . Proposition 1 (Teicher, 1967). The class of joint finite normal mixtures is identifiable. Proposition 2 (Teicher, 1967). Assume that the class of finite mixtures of the family fφ of densi- Bayesian HMMs for nucleosome positioning prediction 9 ties of Y with parameter φ ∈ Φ is identifiable. Then the class of finite mixtures of n-fold product (n) densities fφ (y) = fφ1 (y1 ) . . . fφn (yn ) with φ = (φ1 , . . . , φn ) ∈ Φn is identifiable. Proposition 2 was proved by induction on n (Teicher, 1967). Since any HMM is a finite mixture of n-fold product densities, where the the weights of the mixture are functions of the transition probabilities, we applied the above results to prove the identifiability of models M0-M3. Theorem 1 Models M0, M1, M2 and M3 are identifiable. In other words, if η denotes the total set of all parameters in any of the four models, and L(η; y) denotes the likelihood of η, there does not exist any η ′ 6= η such that L(η; y) = L(η ′ ; y) for all y. Details for the proof of Theorem 1 are provided in the online Supplementary Materials. 3. M ODEL FITTING AND ESTIMATION PROCEDURE Let η = (θ, β, ν, σ 2 ) denote the set of all parameters. (For model M0, θ = (λ, µ), ν = (ν00 , ν10 ), and for all models except M3, at least one parameter becomes void.) The likelihood for N probes, conditional on the covariates X = (X 1 , . . . , X N ) and locations T = (t1 , . . . , tN ), is L(η)=P (Y 1 , . . . , Y N |X, T , η) = [π0 P (Y 1 |z1 = 0)+π1 P (Y 1 |z1 = 1)] N X Y Pzi−1 ,zi |xi (ti −ti−1 )f (Y i |X i , zi ) (3.1) z1 ,...,zN i=2 Direct evaluation of (3.1) would involve summing over all possible sequences of hidden states z1 , . . . , zN . Under the Markovian assumption, this can be computed recursively by a forward sum. Expectation-maximization or EM (Dempster et al., 1977) is often used to estimate HMM parameters. However, due to the complex transition probability expressions here (the complete data likelihood consists of terms arising from the transition equations (2.3)), no closed form Mstep exists, necessitating computationally intensive numerical optimization. Although analytically tractable EM steps exist for some continuous time HMMs (Roberts and Ephraim, 2008), we chose to avoid this due to model complexity and the tendency of EM to converge to local optima. A Bayesian MCMC sampling-based approach is feasible due to the log-concave forms of many of 10 R. M ITRA AND M. G UPTA the conditional distributions. MCMC also allows us to obtain the full posterior probability landscape of the parameters rather than a single maximum likelihood-based point estimate. Next, we discuss prior elicitation and the recursion-based MCMC algorithm. 3.1 Prior elicitation and sensitivity analysis It can be deduced from the likelihood that the conditional posteriors assume proper densities with a flat prior. The emission densities are normal, so marginalizing the joint density along the axes of emission parameters is equivalent to integrating out normal densities, eventually yielding a weighted summation of finite terms. The transition rate expressions are a finite summation of terms of the type f (a) exp(−g(a)) where f is a function bounded by [0, 1] and g is a positive scalar multiple in model M0 and a monotonic function in M1. For all such functions, the integral is finite, leading to a finite summation of finite terms. However in two cases, if all probes are classified into either the 0 or 1 state, integrating out with a flat prior would yield infinity, leading to an improper posterior where Gibbs sampling may fail. This can be avoided through a constrained hidden state model with the state space excluding these two cases, yielding a proper posterior. When sampling from the constrained model, the conditional probabilities of the hidden states (excluding the all-0 or all-1 case) would differ from the true conditional probabilities (including the extreme states) by a negligible margin, because the probabilities of these two conditions is infinitesimally low. In our genomic data application, the log intensities are so distributed that the sampler can almost never assign zeros to all the positions simultaneously (except when we initialize one transition rate to be unrealistically small so that the corresponding transition probability is 0). In practical implementation, our sampling scheme can reach one of these two states with a low probability– if so, we can ignore the sample and re-sample. This is a valid step under the constrained sampling setup– using a Gibbs sampler as a proposal for a Metropolis-Hastings step that rejects the proposed move if the state sequence is constant. Based on the above, we assumed non-informative (uniform) priors for the logarithm of the tran- Bayesian HMMs for nucleosome positioning prediction 11 sition rates and the emission mean and variance (on the logarithmic scale) parameters of our model. We also fit the models assuming a minimally informative normal prior (with a variance of 5) on the log-transition rates. Then we sequentially reduced the variance by .25 units, to assess the change. In the first case, there was no identifiable difference in the results compared to the flat prior. When the prior variance was reduced beyond 1, the log transition rates were pushed towards 0. Hence we chose the emission and transition parameter priors to be minimally informative, specifically: p(log(λ)) ∝ 1; p(log(µ)) ∝ 1; p(νk0 ) ∝ 1 (k = 0, 1); p(σk2 ) ∝ σk−2 (k = 0, 1); p(τk2 ) ∝ τk−2 ; p(θmj ) ∝ 1, p(θlj ) ∝ 1 (j = 1, . . . , d); and p(βkj ) ∝ 1 (k = 0, 1; j = 1, . . . , d). 3.2 The MCMC sampling algorithm Let us denote Pjk (ti+1 |ti ) as the transition probability from state j to state k, for adjacent probes at positions ti+1 and ti , which is implicitly dependent on the transition rates λ and µ. The following steps constitute one iteration of the MCMC sampler. 1. Use the forward algorithm (Rabiner, 1989) to recursively calculate the likelihood. This algorithm iteratively computes the likelihood of the data till a step in the sequence, with the last step being in a particular state. The first application of this algorithm for HMMs was by Baum et al. (1970) and later extended for a variety of models (Rabiner, 1989). Let Fji denote the “partial” likelihood until position i of the sequence, where position i is in state j, i.e., Fji = P (Zi = j, Y1 , . . . , Yi |X1 , . . . , Xi ). In our model, this is equivalent to: Fji = P Qi l=2 Pzl−1 ,zl |xl (tl |tl−1 )f (Y l |X l , zl , η), j = 0, 1 representing the nucleoz1 ,...,zi−1 P (Y 1 |z1 ) somal and NFR states. The recursive procedure to calculate the likelihood is given by Fji=[F0i−1 P0j (ti |ti−1 ) + F1i−1 P1j (ti |ti−1 )]f (Y i |X i , Zi = j, η), i = 2, . . . , N . (3.2) The initial conditions are: Fj1 = πj f (Y 1 |X 1 , Z1 = j, η), (j = 0, 1), and the full likelihood of the entire sequence is given by F0N + F1N . 2. Next, we employ a backward sampling procedure (Chib, 1996; Scott, 2002) to get a sample of 12 R. M ITRA AND M. G UPTA the hidden states. Conditional on the sampled states at positions ti+1 , . . . , tN , the probability that position i is in state j is P (Zi = j|Zi+1 , . . . , ZN , Y , X, η) ∝ Fji Pj,Zi+1 (ti+1 |ti ). The probability that the last position is in state j is proportional to FjN . In application, the above procedures are reformulated in terms of logarithms to avoid computational underflow. 3. Conditional on the other parameters, hidden states, and data, update the transition parameters using a Metropolis-Hastings procedure. For models M0 and M2, this is done directly for θl0 and θm0 , while for models M1 and M3 this is done for each component in turn. 4. Conditional on the other parameters, hidden state path, and observed data, update the emission parameters νk0 (k = 0, 1) for M0 and M1 and additionally, β k , for M2 and M3. Details of the MCMC steps are provided in the online Supplementary Materials. 4. A PPLICATION TO YEAST NUCLEOSOME ARRAY DATA The estimation procedure for the base, transition and emission models was applied to the FAIRE data set from Hogan et al. (2006), that comprised a total of 13947 probes. The data structure consists of two elements: • Signal: This gives the log of the hybridization ratios obtained from the microarrays. For each probe we have 3 replicates each giving a measure of the intensity data at the particular probe. The probes are evenly distributed (i.e. with constant spacing) in the data set. The higher the intensity, the greater the chance for the probe to be in a nucleosome free region. The data was preprocessed by a z-score standardization to remove skewness (Hogan et al., 2006). The missingness pattern in the probes was random, and the maximum length of a missing block was 13. The non-missing signal values numbered 12760, about 88% of the total data set. • Sequence: Each probe was of length 50 nucleotides, with an overlap of 20 nucleotides with the adjacent probe. As covariates which may potentially influence the nucleosomal signal, we first extracted an initial feature set consisting of counts of all oligomers upto length 5. For example, a measurement of a covariate corresponding to the dinucleotide TA would be the Bayesian HMMs for nucleosome positioning prediction 13 number of occurrences of this dinucleotide throughout the length of the probe. Each probe thus corresponds to a point in a 1364-dimensional Euclidean space, the coordinate for each probe given by the corresponding oligomer word counts. These covariates, especially for longer size oligomers, have very sparse counts, and tend to be highly correlated with counts of words which are smaller segments. To avoid collinearity and also reduce dimensionality of the covariate space, we performed principal components analysis (PCA) on these 1364 covariates for all the probes in our data set. The first five principal components (PCs) explained about 95% of the variability in the covariate space, and were retained as the final collection of model covariates. 4.1 Model-fitting using three models Next, we compared the performance of the models by applying them to the nucleosomal array data. All analyses were run in the statistical software R (http://www.r-project.org/). The MCMC algorithm was initialized at values from a N(0,1) distribution for log(λ), log(µ), θ, β, ν00 and ν10 , while the variance parameters were initialized to 1. Multiple starting points made no significant difference in the results. Autocorrelation plots indicated that convergence was attained for all models within 1000-5000 iterations, which was taken as the burn-in period. 10000 iterations (after burn-in) were used for inference. We present a subset of the analyses that indicate most strongly the power of the new approach; more details are presented in the online Supplement. Table 1 shows that only the second PC was significant at a 5% level, in models M1 and M2. For M3, MCMC convergence was very slow, hence we did not use this model further. In M0, the λ and µ estimates imply that if a particular probe was covered by an NFR, the probability of remaining in the state is very high (0.98). This supports the hypothesis that nucleosomes are separated by long NFRs. The predicted nucleosomal states also appear to be long; which is natural as the low resolution of the data leads to detecting enrichment of long nucleosome-free regulatory regions, but not of short nucleosome-free linker regions (< 10 bp) between adjacent nucleosomes. In M1, the second PC was significant in both nucleosomal and NFR categories. 14 R. M ITRA AND M. G UPTA The opposing signs of the estimates imply that the AT oligomers associated with this covariate help in continuation of the NFR subsequences and are detrimental to nucleosome formation. Computing the weights of the dinucleotide counts for this PC show that oligomers that are important in differentiating between nucleosomal and NFR regions are: A,T, AA, TT, AT,TA, AAT,ATA,ATT,TTA,TAT,TAA,TTTTA,TTTAT (all have weights > .1: AT and TA were highest; tri-nucleotides were intermediate). In M2, the second PC was significant only in the NFR state. We re-fitted models M1 and M2 using the 14 oligomer counts associated with the second PC (Tables 3 and 4 in the online Supplement). AA and TT were the most significant variables for the NFR state in M2, while none of the nucleosome parameters were significant. This suggests that the intensity data within the nucleosomes does not depend on local DNA sequence– however, in NFRs, the intensity mean is a variable function of the the AA and TT dinucleotide counts. We observed that the effect of the other oligomer combinations, which were a part of the second PC, vanish when we re-fit the model with oligomer counts. This implies that the effect of the second PC was borne primarily by the counts of these two dinucleotide categories. These results reiterate the major importance of the AA and TT dinucleotides in influencing the state lengths. Here, as in the original M1, these two features were important for both nucleosomal and the NFR states. Figure 1 shows the empirical CDF of the posterior probabilities of probes being classified into the nucleosomal state in the three models. The flatness of the center, with sharp rises at the two ends, demonstrate that most probes have a very high or low chance of being classified as an NFR, and are robust to the actual cutoff. The qq-plots of the nucleosomal and NFR state (Supplementary Figure 1) demonstrate no significant departures from the fitted model. Model comparison. An important question is which model is most appropriate for the data. Computing the Bayes factor analytically would be difficult due to model complexity. A simpler alternative (approximation) is the Bayesian Information Criterion (BIC). It is straightforward to compute the BIC using BIC = −2 log(L̂) + k log(n), where L̂ is the modal posterior likelihood, Bayesian HMMs for nucleosome positioning prediction 15 k is the number of parameters, and n is the number of data points. The BIC for the three models are 17154.27 (M0), 14386.92 (M1) and 13337.19 (M2), showing that incorporation of sequence features is an essential part in determining the structural classification. Model M2 gave the best fit, indicating that local sequence features indeed influence the measured intensities in nucleosomal regions and NFRs; however, the sequence does not exhibit as strong an effect in determining the lengths of the state of neighboring regions. 4.2 Biological validation with known NFR regions For biological validation, we extracted a set of yeast regulatory regions from the UCSC database (http://genome.ucsc.edu/). Although the set of all validated NFRs is not available, the presence of active TFBSs is a useful indicator that the region is likely to be nucleosome-free. We compared the overlap of predicted NFRs with known TFBS locations (Figure 2, and Table 12 in the online Supplement). The continuous-index models perform excellently compared with the Yuan et al. (2005) method: M0 and M1 have a >90% correct classification rate, M2 has 70%. Since we do not have a corresponding database of nucleosome regions, the false positive rate is underestimated. This may be why M2 has a lower sensitivity than M0 in this comparison, although it fits the data better. Figure 3 shows the Receiver Operator Characteristic (ROC) for all 3 models, using the UCSC genome database as our reference, assuming the known TFBSs to be the only NFRs. The ROCs show that all models perform equivalently well. Figure 3 shows that M1 has the highest specificity in prediction. M1 allows the sequence information to locally influence the length of the nucleosomal and non-nucleosomal states. The biological significance of this mechanism needs to be explored further to elucidate the relationship between nucleosomes and sequence signals. 5. S IMULATION STUDIES We next performed simulation studies to determine the (i) consistency of model estimates under different parameter settings, (ii) detection of the correct model under model misspecification and (iii) importance of the continuous index assumption in modeling gapped tiling array data. 16 R. M ITRA AND M. G UPTA 5.1 Consistency of model parameter estimation Data sets were generated first under models M0, M1 and M2. Ten data sets of size 5000 probes were generated for each parameter setting for each model, which were fit using MCMC (Section 3). In each case, the estimated bias and MSE showed consistent estimation of the model parameters. Here we describe one study in detail. For M0, the emission distributions were assumed to be normal with means -0.4 and 0.4, λ and µ were set to be -3, and emission variances were set to be 0.35. For M1 and M2, the intercept parameter for both states were set as 1. Each regression coefficient (θ for M1, and β for M2) for the 5 covariates was fixed either as 1 or 0, for both nucleosomal and NFR states. Thus a typical β (or θ) vector would be, for example, (1, 0, 1, 0, 0, 0) for the nucleosomal and (1, 0, 0, 0, 0, 0) for the NFR state. For all 3 models, 2500 iterations were run for each parameter setting, with a burn-in of 500. For evaluating misclassification rates, probes having an average state membership probability greater than 0.5 were rounded off to 1, while others were set at 0. For all data sets under model M0, the maximum MSE of the emission (transition) parameters were less than 0.0001 (0.03), and the maximum misclassification rate was 0, indicating the model was fit accurately in each case. For M1, the maximum MSE for the corresponding parameters was also 0.03, with the maximum misclassification rate being 0.1; the MSE for θ ranged between 0.05 and 0.09. For M2, the maximum MSE for emission and transition parameters was slightly higher (0.08 for µ), the maximum misclassification rate was 0.1, and the range of MSE for β was .001 to .004. 5.2 Cross comparison under different models Next, we simulated data under each model, and estimated parameters and states from all models, in order to judge whether in each case the correct model was most accurate. Five data sets were simulated under each model. The variances were fixed to 0.48 and 0.64 for the nucleosomal and NFR states. For M0, the transition rates were -3 and -3. For models M1 and M2, the coefficient for the first PC in the NFR state was fixed to 1, while the others were set to 0. The simulated data sets were of size 5000; the sequence features of the first 5000 probes of the FAIRE data set were Bayesian HMMs for nucleosome positioning prediction 17 used as covariates. When the simulation and estimation models matched, the method performed very accurately, with the maximum MSE of the emission parameters µ and σ being less than 0.01, and for λ and µ, less than 0.05; and the averaged misclassification rates also the lowest (Table 2). In each case, the BIC chose the correct model. Also for a simulation model, say MA , when we estimated it by a model which contains it, say MB , we obtained a similar value of the maximal log likelihood. In an ideal scenario, the extra components in the bigger model would turn out to be 0, and we would get the same likelihood– however, this is typically not achieved in most data sets, necessitating the use of criteria like the BIC. However, in our case, we see that the log likelihood values are very close (the error margin is less than .01) and the expected log-likelihood, obtained from the averages of the MCMC iterations are equal. One possible reason is the large size of the data set, leading to the errors in estimation being smoothened to a considerable degree, and the parameters equaling the true simulation parameters with high probability. Due to MCMC convergence problems, model M3 could not be tested. However we generated 5 data sets from this model (with similar parameter settings as above), and ran the other three estimation models on these data sets. The percent of correctly classified probes under models M0, M1 and M2 were 81%, 92% and 94% respectively. In 2 of the 5 datasets, model M1 achieved the lowest BIC and MSE, while in the other 3, model M2 was best; model M0 was weakest. 5.3 Importance of the continuous index model In order to show that the continuous index HMM was an essential improvement required to fit the gapped probe data, we simulated a data set of length 5000, where the gap distribution was the same as in the FAIRE data HMM. We also simulated a second data set of 5000 measurements where the gap structure was assigned randomly. Both data sets were analyzed using both discrete and continuous index HMMs (Table 3). Comparing the number of predicted matches with the simulated state set, it is clearly seen that the discrete-index HMM fails to achieve as high a correct classification rate as the continuous-index model, in the gapped data framework. 18 R. M ITRA AND M. G UPTA 6. D ISCUSSION We propose a novel extension of the nucleosomal region prediction method with two major improvements: (i) a continuous-index hidden Markov model to accommodate missing or variablegapped probes; and (ii) using underlying DNA sequence features that influence transition rates and emission densities. We have demonstrated the model advantages on FAIRE data (Hogan et al., 2006), now widely used for prediction of NFRs. Our methods are efficient, given their complexity– the time taken (in seconds) for 1000 MCMC iterations for models M0, M1, and M2 are 15343s, 47007s, and 18401s respectively. We proved identifiability conditions that allow construction of a straightforward MCMC procedure with minimally informative priors. In applications, models M0 and M1 performed similarly, and were efficient at predicting regulatory regions, indicating that a non-homogeneous transition rate is probably most accurate for predicting NFRs. Estimates from models M1 and M2 show that DNA nucleotide combinations play a significant role in determining nucleosome positions: Adenosine and Thymine (A and T) appear most influential. Of the different combinations, the dinucleotides (AT and TA) carry the maximum weight, followed by tri-nucleotides. These results support the biological hypothesis that the presence of A and T-based nucleotide combinations contributes to the rigidity of the DNA, favoring positioning of NFRs (Sekinger et al., 2005). Our models successfully extend the HMM methodology to gapped data sets, opening up the possibility for synergistically modeling multiple complex data sets having varying probe intervals. Recent efforts have focused on combining data from multiple probe replicates into a hidden Markov framework (Johnson et al., 2009), and to develop sophisticated technologies to overcome major problems of ChIP-on-chip technology by performing massive parallel sequencing (Park, 2009). However, such technologies are in preliminary stages of development and critical refinement is necessary before fully realizing their potential and limitations. S UPPLEMENTARY M ATERIALS The on-line Supplementary Materials provide technical appendices and additional simulations. Bayesian HMMs for nucleosome positioning prediction 19 ACKNOWLEDGMENTS This work was partially supported by the NIH (NHGRI) award HG004946 to M. G. The authors are grateful to Jason Lieb and Paul Giresi for many helpful discussions and the FAIRE data, and to two anonymous reviewers whose suggestions made significant improvements to the article. R EFERENCES BAUM , L. E., P ETRIE , T., S OULES , G. AND W EISS , N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist. 41, 164–170. B UCK , M.J. AND L IEB , J.D. (2004). ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83(3), 349–60. C HIB , S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. J. Econometrics 75(1), 79–97. C ONLON , E. M., L IU , X. S., L IEB , J. D. AND L IU , J. S. (2003). Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA 100(6), 3339–3344. D EMPSTER , A. P., L AIRD , N. M AND RUBIN , D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39(1), 1–38. G UPTA , M. (2007). Generalized hierarchical Markov models for the discovery of length-constrained sequence features from genome tiling arrays. Biometrics 63(3), 797–805. G UPTA , M AND I BRAHIM , J G. (2007). Variable selection in regression mixture modeling for the discovery of gene regulatory networks. J. Am. Stat. Assoc. 102(479), 867–880. G UPTA , M. AND L IU , J.S. (2003). Discovery of conserved sequence patterns using a stochastic dictionary model. J.Am.Stat.Assoc 98, 55–56. H OGAN , G. J., L EE , C.-K. AND L IEB , J. D. (2006). Cell cycle-specified fluctuation of nucleosome occupancy at gene promoters. PLoS Genet 2(9), e158. J OHNSON , W., L IU , X. AND L IU , J.S. (2009). Doubly-stochastic continuous-time hidden Markov approach for analyzing genome tiling arrays. Ann. Appl. Stat. 3, 1183–1203. L ANDER , E. S. AND G REEN , P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. U.S.A. 84, 2363–2367. L EE , W., T ILLO , D., B RAY, N., M ORSE , R. H., DAVIS , R. W., H UGHES , T. R. AND N ISLOW, C. (2007). A high-resolution 20 R. M ITRA AND M. G UPTA atlas of nucleosome occupancy in yeast. Nat. Genet. 39, 1235–1244. L IU , J. S., N EUWALD , A. F. AND L AWRENCE , C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 90, 1156–1170. NARLIKAR, L., G ORDAN , R. AND H ARTEMINK , A. J. (2007). A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Comput. Biol. 3, e215. PARK , P. J. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680. P ECKHAM, H. E., T HURMAN , R. E., F U , Y., S TAMATOYANNOPOULOS , J. A., N OBLE , W. S., S TRUHL , K. AND W ENG , Z. (2007). Nucleosome positioning signals in genomic DNA. Genome Res. 17, 1170–1177. R ABINER , L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286. R ICHMOND , T. J. AND DAVEY, C. A. (2003). The structure of DNA in the nucleosome core. Nature 423, 145–150. ROBERTS , W.J. AND E PHRAIM , Y. (2008). An EM algorithm for ion-channel current estimation. IEEE Trans. Sig. Proc. 56, 26–33. S COTT, S. L. (2002). Bayesian methods for hidden Markov models. J. Am. Stat. Assoc. 97, 337–351. S EGAL , E., F ONDUFE -M ITTENDORF, Y., C HEN , L., T HASTROM , A., F IELD , Y., M OORE , I. K., WANG , J. P. AND W IDOM , J. (2006). A genomic code for nucleosome positioning. Nature 442, 772–778. S EKINGER, E. A., M OQTADERI , Z. AND S TRUHL , K. (2005). Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast. Mol. Cell 18, 735–748. S TJERNQVIST, S., RYDEN , T., S KOLD , M. AND S TAAF, J. (2007). Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics 23, 1006–1014. T EICHER , H. (1967). Identifiability of mixtures of product measures . Ann. Math. Stat. 38, 1300–1302. W IDOM , J. (1992). A relationship between the helical twist of DNA and the ordered positioning of nucleosomes in all eukaryotic cells. Proc. Natl. Acad. Sci. U.S.A. 89, 1095–1099. Y UAN , G. C. AND L IU , J. S. (2008). Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4, e13. Y UAN , G. C., L IU , Y. J., D ION , M. F., S LACK , M. D., W U , L. F., A LTSCHULER , S. J. scale identification of nucleosome positions in S. cerevisiae. Science 309, 626–630. AND R ANDO , O. J. (2005). Genome- Bayesian HMMs for nucleosome positioning prediction 21 Table 1. 95% Credible Intervals (CI) for parameter estimates for models M0 (base), M1 (transition), and M2 (emission). The indices l (or 0) and m (or 1) indicate the nucleosomal and NFR states; for instance, the term θl0 refers to the intercept term for the nucleosomal state in the transition model M1, ν10 refers to the intercept term for the NFR state in the emission model M2. The CIs for parameters significant at a 95% level are given in bold fonts. Parameter log(λ) log(µ) M0 ν00 ν10 σ0 σ1 τ0 τ1 θl0 θm0 θl1 θm1 θl2 θm2 M1 θl3 θm3 θl4 θm4 θl5 θm5 ν00 ν10 σ0 σ1 τ0 τ1 ν00 ν10 β01 β11 β02 M2 β12 β03 β13 β04 β14 β05 β15 log(λ) log(µ) σ0 σ1 τ0 τ1 95% CI (-3.52,-3.75) (-3.77,-4.19) (-.52,-.67) (.85,.99) (.25,.30) (.42,.48) (.45,.50) (.72,.79) (-3.63,-3.75) (-3.91,-.4.07) (.-.01,.02) (.-.05,.05) (-.82,-.54) (.78,.80) (-.04,.05) (-.08,.07) (-.03,.02) (-.05,.04) (-.02,.02) (-.03,.03) (-.54,-.63) (.88,.95) (.28,.33) (.41,.46) (.42,.51) (.72,.80) (-.68,-..72) (.81,.85) (-.05,.05) (-.06,.05) (-.03,.03) (.20,.22) (-.02,.01) (-.03,.03) (-.01,.01) (-.04,.04) (-.01,.01) (-.02,.01) (-3.56,-3.72) (-3.93-4.15) (.31,.33) (.29,.31) (.40,.44) (.71,.75) SE .5 .38 .08 .07 .02 .03 .02 .03 .05 .05 .01 .04 .68 .1 .03 .07 .02 .04 .02 .03 .05 .03 .02 .02 .04 .04 .02 .02 .05 .05 .02 .05 .01 .02 .01 .04 .01 .01 .08 .09 .01 .01 .02 .01 22 R. M ITRA AND M. G UPTA Table 2. Cross-tabulation of average correct state classification percentage under the three models. Simulation model Base Transition Emission (M0) (M1) (M2) Base (M0) Estimation Model Transition (M1) Emission (M2) 100 96 98 85 88 88 86 80 91 Table 3. Simulation study to compare the performance of discrete-index and continuous-index HMMs under two gap scenarios. The numbers “1” and “2” in the column “Set” refer to results obtained from data sets simulated under the FAIRE gap structure and the arbitrary gap structure respectively. “Match” refers to the percentage of probes classified into their correct state. 1.0 Model Set M SE(λ) M SE(µ) AveM SE (ν00 , ν10 ) AveM SE (σ0 , σ1 ) Match (%) Continuous 1 .01 .01 .01 .01 99 Discrete 1 .03 .01 .01 .01 85 Continuous 2 .01 .01 .01 .01 99 Discrete 2 .08 .09 .05 .03 68 0.0 0.2 0.4 cdf 0.6 0.8 base transition emission 0.0 0.2 0.4 0.6 0.8 1.0 x Fig. 1. Empirical CDF of the posterior probabilities of probes being classified into the nucleosomal state for the base model (M0) transition model (M1), and emission model (M2). Bayesian HMMs for nucleosome positioning prediction 0.2 0.3 0.4 0.5 base transition emission yuan 0.0 0.1 Cumulative Proportion Matched 23 0 20 40 60 80 100 120 Index Fig. 2. Comparison between Yuan et al. (2005) and FAIRE predictions with the three continuous-index models. On the vertical axis, the proportion of a genomic region “matched” with the database is given by the ratio (Number of probes predicted to be in NFR state in region) Proportion matched = . A cutoff of 0.5 was used for the conditional (Total number of probes in all transcriptionally active regions) probabilities for state classification. 24 R. M ITRA AND M. G UPTA 1.0 ROC curve 0.6 0.4 0.0 0.2 1−specificity 0.8 base transition emission 0.0 0.2 0.4 0.6 0.8 1.0 sensitivity Fig. 3. ROC curves for base model (M0), transition model (M1), and emission model (M2). Each point in the ROC curve corresponded to a particular cutoff of the posterior probabilities under which the classification was done. The cutoff ranges from 0 to 1 with an increment of .01. Sensitivity is given by the ratio of the number of true NFRs found, to the total number of “true” NFRs; specificity equals the ratio of the number of true NFRs found to the number of predicted NFRs. In terms of the ROC, all three continuous-index models appear to perform equally well.