Determination of DNA sequence features from genome tiling arrays Mayetri Gupta University of North Carolina at Chapel Hill ENAR 2006 gupta@bios.unc.edu Outline of talk Biological problem: inferring gene regulation Whole genome tiling arrays for genomic feature detection Hierarchical generalized HMM framework Recursive data augmentation Application to yeast arrays ENAR 2006 gupta@bios.unc.edu Upstream regulation ↔ downstream expression Fundamental question: how can we understand the biological mechanisms leading to disease? ...gtggtTAGAATagcgactgttttt... gene 1 ...taggTATAATacagtctgacaaaa... gene 2 ...cagcaacattgaTATAATtgccat... gene 3 ...ctaaaacaatTATTATttatcagg... gene 4 |TATAAT| G 1 2 3 4 5 6 bits 0 G 1 CT 2 Co-regulated genes share similar motif patterns ENAR 2006 gupta@bios.unc.edu Mechanisms behind gene expression First step in studying regulatory process is binding of transcription factors (TFs) to binding sites Many transcription factors work in co-ordination to regulate genes ENAR 2006 gupta@bios.unc.edu Transcription regulation: DNA motifs Proteins bind to DNA to activate transcription G 1 2 3 4 5 6 bits 0 |TATAAT| G 1 CT 2 Position specific weight matrix (PSWM) or Motif ENAR 2006 gupta@bios.unc.edu Motif discovery- statistical model Every non-site position multinomial with θ0 = (θ01 , . . . , θ04 ) Every motif position i multinomial with · · · θ0 θ0 θ1 · · · θ 6 θ0 θ0 · · · θi = (θi1 , . . . , θi4 ) EM (Lawrence 1990), Gibbs sampler (Liu 1995), BIOPROSPECTOR (Liu 2001) Dictionary (Bussemaker 2000) Stochastic Dictionary (Gupta, Liu 2003) ENAR 2006 gupta@bios.unc.edu Using auxiliary information in motif discovery Most motif discovery methods typically yield high proportion of false positives in complex genomes Modeling additional genomic characteristics has proved useful to improve motif search Phylogenetic conservation (Wasserman 00; Kellis 03; Li 05) Motif “clustering”: regulatory modules (Thompson 04; Gupta 05) Special structures- palindromes, uni- and bi-modal profiles (Kechris 04; Zhou 04) ENAR 2006 gupta@bios.unc.edu Chromatin structure “Beads on a string” DNA is NOT a linear strand, local chromatin structure may affect binding of proteins ENAR 2006 gupta@bios.unc.edu Nucleosome positioning ↔ TF binding Previous developments assume DNA is a linear strand with equal propensity for binding sites to be located anywhere DNA has been recently proved to be extremely dynamic- nucleosome occupancy varies throughout the genome (Lee et al. Nat. Genet. 2004) Empirical studies suggest TFs more likely to bind to nucleosome free regions (linkers) Unlike other auxiliary data, is potentially a more “universal” mechanism, less dependent on species/TF/gene set being considered ENAR 2006 gupta@bios.unc.edu Genome-wide mapping of nucleosome occupancy High-density DNA microarray that utilizes the differential susceptibility of nucleosomal and linker DNA to micrococcal nuclease Yeast + formaldehyde + MNase genomic DNA ↓ ↓ purify nucleosomal DNA ↓ (Gel electrophoresis) ↓ ↓ ↓ label with Cy3 label with Cy5 & . mix and concentrate, hybridize to microarray (from Yuan et al. 2005) ENAR 2006 gupta@bios.unc.edu Data: Yuan et al. (Science, 2005) Entire sequence fragmented into 50-mers overlapping every 20 bp, 8 replicate measurements for each probe ENAR 2006 gupta@bios.unc.edu Data: Yuan et al. (Science, 2005) Entire sequence fragmented into 50-mers overlapping every 20 bp, 8 replicate measurements for each probe About 146 bp of DNA wrapped around a single nucleosome- corresponds to about 6-8 overlapping probes on microarray ENAR 2006 gupta@bios.unc.edu Data: Yuan et al. (Science, 2005) Entire sequence fragmented into 50-mers overlapping every 20 bp, 8 replicate measurements for each probe About 146 bp of DNA wrapped around a single nucleosome- corresponds to about 6-8 overlapping probes on microarray Measurements derived from a large population of asynchronous cells, so temporally unstable nucleosomes appear “delocalized” (and weaker) than stereotypically positioned ones ENAR 2006 gupta@bios.unc.edu Data: Yuan et al. (Science, 2005) Entire sequence fragmented into 50-mers overlapping every 20 bp, 8 replicate measurements for each probe About 146 bp of DNA wrapped around a single nucleosome- corresponds to about 6-8 overlapping probes on microarray Measurements derived from a large population of asynchronous cells, so temporally unstable nucleosomes appear “delocalized” (and weaker) than stereotypically positioned ones 30 contigs from yeast chromosome III covering about 270 Kbp ENAR 2006 gupta@bios.unc.edu Data: Yuan et al. (Science, 2005) Entire sequence fragmented into 50-mers overlapping every 20 bp, 8 replicate measurements for each probe About 146 bp of DNA wrapped around a single nucleosome- corresponds to about 6-8 overlapping probes on microarray Measurements derived from a large population of asynchronous cells, so temporally unstable nucleosomes appear “delocalized” (and weaker) than stereotypically positioned ones 30 contigs from yeast chromosome III covering about 270 Kbp Data normalized in two steps to account for geographic biases and cross-hybridization ENAR 2006 gupta@bios.unc.edu Hidden Markov model framework Tiling array structure indicates HMM-type model may be desirable ENAR 2006 gupta@bios.unc.edu Hidden Markov model framework Tiling array structure indicates HMM-type model may be desirable Cannot observe the “latent” biological process, only noisy measurements ENAR 2006 gupta@bios.unc.edu Hidden Markov model framework Tiling array structure indicates HMM-type model may be desirable Cannot observe the “latent” biological process, only noisy measurements Spatial dependence between neighboring probes on the array ENAR 2006 gupta@bios.unc.edu Hidden Markov model framework Tiling array structure indicates HMM-type model may be desirable Cannot observe the “latent” biological process, only noisy measurements Spatial dependence between neighboring probes on the array Complications: ENAR 2006 gupta@bios.unc.edu Hidden Markov model framework Tiling array structure indicates HMM-type model may be desirable Cannot observe the “latent” biological process, only noisy measurements Spatial dependence between neighboring probes on the array Complications: biological restrictions on state lengths ENAR 2006 gupta@bios.unc.edu Hidden Markov model framework Tiling array structure indicates HMM-type model may be desirable Cannot observe the “latent” biological process, only noisy measurements Spatial dependence between neighboring probes on the array Complications: biological restrictions on state lengths sequence-specific variability in behavior of probes- median value may be insufficient summary ENAR 2006 gupta@bios.unc.edu Yuan et al (2005) analysis DN1 N1 ... ... DN6 DN7 DN 8 N6 N7 N8 DN 9 L Use a profile Hidden Markov model (PHMM) so that length distributions can vary between states Break up sequence into 40 probe non-overlapping windows Cannot extend PHMM to long sequences, but breaking up sequence arbitrarily can cause loss of information ENAR 2006 gupta@bios.unc.edu Bayesian framework for length-restricted features τ Z1 ... Z2 Y1 Y2 ... ... L2 L1 Yl 1 Y11+1 ... ZN Y1 +l 1 2 ... LN YΣ 1i ... Y Σ 1i +lN Z , L latent random variables, only Y ’s are observed random variables ENAR 2006 gupta@bios.unc.edu Bayesian framework for length-restricted features (1) State duration model Length distribution in state k: (truncated) negative binomial l −1 pk (l) = ck (φk ) (1 − φk )l−rk φrkk , rk − 1 l ∈ Dk = {rk , rk + 1, . . . , sk } Sets Dk based on approximately known physical characteristics ENAR 2006 gupta@bios.unc.edu Bayesian framework for length-restricted features (2) Model for spot measurements Spot measurements: conditional on being in state Zi = k, log-ratios modelled as yi j |Zi = k, µik , σ2ik ∼ N(µik , σ2ik ) Hierarchical framework with 2-level conjugate priors on µik (Normal) and σ2ik (Inv-gamma) ⇒ marginally a non-central t Transition probabilities between states τ = ((τi j )), with all τii = 0. Dirichlet priors on non-zero transitions. ENAR 2006 gupta@bios.unc.edu Model fitting and parameter estimation State length distributions ⇒ computational complexity- Gibbs sampling-based approaches tend to get trapped in local optima ENAR 2006 gupta@bios.unc.edu Model fitting and parameter estimation State length distributions ⇒ computational complexity- Gibbs sampling-based approaches tend to get trapped in local optima Solution: A data augmentation algorithm is formulated with recursive steps to jointly sample state identities and state duration lengths P(Z, L|y) = P(ZN |y)P(LN |ZN , y) . . . P(Lmin |Lmin+1 , Zmin+1 , . . . , ZN , y) ENAR 2006 gupta@bios.unc.edu Model fitting and parameter estimation State length distributions ⇒ computational complexity- Gibbs sampling-based approaches tend to get trapped in local optima Solution: A data augmentation algorithm is formulated with recursive steps to jointly sample state identities and state duration lengths P(Z, L|y) = P(ZN |y)P(LN |ZN , y) . . . P(Lmin |Lmin+1 , Zmin+1 , . . . , ZN , y) Assumes that the length spent in a state and the transition to that state are independent P(l, k|l 0 , k0 ) = pk (l)τk0 k ENAR 2006 gupta@bios.unc.edu Model fitting and parameter estimation State length distributions ⇒ computational complexity- Gibbs sampling-based approaches tend to get trapped in local optima Solution: A data augmentation algorithm is formulated with recursive steps to jointly sample state identities and state duration lengths P(Z, L|y) = P(ZN |y)P(LN |ZN , y) . . . P(Lmin |Lmin+1 , Zmin+1 , . . . , ZN , y) Assumes that the length spent in a state and the transition to that state are independent P(l, k|l 0 , k0 ) = pk (l)τk0 k Posterior estimation of parameters is mostly straightforward. However φk ’s have a non-standard distribution for which we use adaptive rejection Metropolis sampling ENAR 2006 gupta@bios.unc.edu Hierarchical Generalized Hidden Markov Model HGHMM was applied to the longest contiguous mapped region, corresponding to 61 Kbp of yeast chromosome III Data was used from all 8 replicates that had been cleaned and normalized Three states: linker (D1 = 1, 2, . . .) delocalized nucleosome (D2 = {9, . . . , 30}) well-positioned nucleosomes (D3 = {6, 7, 8}) ENAR 2006 gupta@bios.unc.edu ag replacements HGHMM fit Theoretical t Quantiles nucleosome Sample Quantiles delocalized Sample Quantiles Sample Quantiles linker Theoretical t Quantiles Theoretical t Quantiles QQ plots of the three predicted states in nucleosomal array data using a HGHMM with the hierarchical t-distribution ENAR 2006 gupta@bios.unc.edu HGHMM fit Linker Mean µ r/φ Deloc SD Mean Nucleo SD Mean SD -0.833 0.011 0.165 0.005 0.869 0.007 3.969 0.281 12.266 0.245 6.608 0.076 Overall posterior summaries for 3 states for µ: expected probe level measurement r/φ: approximate expected state length ENAR 2006 gupta@bios.unc.edu acements 5 10 20 0.00 0.02 0.04 0.06 0.08 0.10 ρ=5 ρ = 10 ρ = 20 ρ = 30 fraction misclassified 0.00 0.02 0.04 0.06 0.08 0.10 maximum MSE Sensitivity analysis: (1) maximal MSEs and (2) misclassification 30 ρ=5 ρ = 10 ρ = 20 ρ = 30 5 2 30 true ρ fraction misclassified ρ=5 ρ = 20 1 20 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 maximum MSE true ρ 10 3 ρ=5 ρ = 20 1 2 3 range range Top: Misspecification of inverse gamma hyperparameter ρ. Below: Misspecification of truncated negative binomial range. Range 1: (r2 , s2 ) = (9, 30), (r3 , s3 ) = (6, 8) Range 2: (r2 , s2 ) = (10, 25), (r3 , s3 ) = (5, 9) Range 3: (r2 , s2 ) = (8, 25), (r3 , s3 ) = (5, 7) ENAR 2006 gupta@bios.unc.edu Biological validation of predictions How well do nucleosome-free state predictions correlate with the likely location of TFBSs? ENAR 2006 gupta@bios.unc.edu Biological validation of predictions How well do nucleosome-free state predictions correlate with the likely location of TFBSs? Harbison et al (2004) data: Genomewide location analysis (ChIP-chip) to detemine occupancy of DNA-binding transcription regulators under a variety of conditions ENAR 2006 gupta@bios.unc.edu Biological validation of predictions How well do nucleosome-free state predictions correlate with the likely location of TFBSs? Harbison et al (2004) data: Genomewide location analysis (ChIP-chip) to detemine occupancy of DNA-binding transcription regulators under a variety of conditions First, we took segments for chromosome III corresponding to which binding data were highly significant, and used EMCmodule (Gupta and Liu, PNAS 2005) with a starting set of motif weight matrices from SCPD database to determine motif locations ENAR 2006 gupta@bios.unc.edu uu 2 1 2 1 38600 39000 39400 39800 0 0 29000 iYCL058C 29500 30000 30500 31000 31500 28000 5 HSF1 PHO2 SCB 25000 3 2 1 0 0 1 2 uu 3 4 ABF1 GCN4 HSF1 PHO2 uu 24800 27800 midloci[sel.ind] 5 5 4 3 2 1 0 24600 27600 iYCL064C midloci[sel.ind] ABF1 HSF1 24400 27400 iYCL061C midloci[sel.ind] 24200 27200 4 2 uu 78% of sites in linker or delocalized nucleosomal regions 0 1 Comparison with PHMM 22000 22200 22400 22600 16800 17000 17200 17400 • Nucleosome • Delocalized nucleosome • Linker ENAR 2006 gupta@bios.unc.edu Comparisons based on Harbison et al. data Criteria for state assignment: Estimated posterior state probability b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8 P(Z i ENAR 2006 gupta@bios.unc.edu Comparisons based on Harbison et al. data Criteria for state assignment: Estimated posterior state probability b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8 P(Z i Two methods: SDDA (Gupta and Liu, JASA 2003) and BioProspector (Liu et al, PAC SYMP BIOCOMP 2001) were used for de-novo motif discovery in each set of regions ENAR 2006 gupta@bios.unc.edu Comparisons based on Harbison et al. data Criteria for state assignment: Estimated posterior state probability b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8 P(Z i Two methods: SDDA (Gupta and Liu, JASA 2003) and BioProspector (Liu et al, PAC SYMP BIOCOMP 2001) were used for de-novo motif discovery in each set of regions Motif searches were run separately on regions predicted by (1) HGHMM, and (2) profile HMM (PHMM) ENAR 2006 gupta@bios.unc.edu Comparisons based on Harbison et al. data Criteria for state assignment: Estimated posterior state probability b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8 P(Z i Two methods: SDDA (Gupta and Liu, JASA 2003) and BioProspector (Liu et al, PAC SYMP BIOCOMP 2001) were used for de-novo motif discovery in each set of regions Motif searches were run separately on regions predicted by (1) HGHMM, and (2) profile HMM (PHMM) Motif predictions were compared by (i) match to known motifs in the SCPD database and (ii) positional overlap with factor-bound regions from the ChIP-chip data ENAR 2006 gupta@bios.unc.edu Specificity and sensitivity of predictions Specificity (Spec): Proportion of regions predicted in a state overlapping with TF-bound regions from Harbison et al Sensitivity (Sens): % of bound regions not missed. Higher proportion of linker regions than nucleosomal overlap with TF-bound regions; HGHMM (with SDDA) has highest sensitivity HGHMM SDDA PHMM BP SDDA BP Spec Sens Spec Sens Spec Sens Spec Sens Linker 0.61 0.7 0.40 0.87 0.38 0.93 0.23 1 Deloc Nucl 0.19 0.8 0.15 0.63 0.16 0.37 0.11 0.33 Nucleosomal 0.16 0.5 0.09 0.43 0.20 0.8 0.15 0.73 ENAR 2006 gupta@bios.unc.edu 0.6 0.4 * *** ** ** * ** * * * * ** ** * *** * ******* **** * *** *** * * * * ** * * * * * ** * * *** *** ** * * 0.0 0.2 pred state 0.8 1.0 Binding site “hotspots” in predicted linker regions 25000 30000 35000 40000 35000 40000 35000 40000 0.6 0.4 0.0 0.2 pred state 0.8 1.0 coord 25000 30000 0.6 0.4 0.0 0.2 pred state 0.8 1.0 coord 25000 30000 Region of chromosome III with posterior probability of states from HGHMM and corresponding predicted motif sites from SDDA, showing certain binding site “hotspots” in predicted linker regions (blue crosses) which are almost completely devoid of well-positioned nucleosomes. ENAR 2006 gupta@bios.unc.edu Motif predictions and matches to known binding TFs HGHMM motif TF match ExptMatch AT TT GA BAS1 ATTTA bits 2 1 A T T TTT T A GAT C AT TT GAA ACA A GAA AG T TTTTC TT T CAAA CA bits 1 0 bits 2 1 0 bits 2 1 C 1 2 3 4 5 6 7 8 9 10 GT A A C T C G C C AT C G A A T T T 1 0 bits 2 1 0 bits 2 1 0 T T T TACA TGG A C A C C 1 2 3 4 5 6 7 8 9 10 G CC C GTT T C GC C AC A T G T 1 2 3 4 5 6 7 8 9 10 bits bits 2 T T AC GG A T C AA 1 2 3 4 5 6 7 8 9 10 0 C A TA AT GC C C A G G A AT 1 2 3 4 5 6 7 8 9 10 1 A AA T T CT A A G C C GGG G G A C 1 2 3 4 5 6 7 8 9 10 2 C A T G AA CA 0 A C G 1 2 3 4 5 6 7 8 9 10 2 T AT T A TG G AC 1 2 3 4 5 6 7 8 9 10 0 Source PHMM state SCPD L FKH1 TTGTTTACC Harbison L GAT1 GATAA Harbison N HSF1 TTCTAGAA Harbison - RAP1 CACCCAnACAn SCPD STE12 TGAAACA - SCPD - SWI6 TTTCG SCPD - LNO2 ATTTCA Harbison - ExptMatch: match in SCPD database States: L, D, N: linker, delocalized nucleosomal, or nucleosomal ENAR 2006 gupta@bios.unc.edu Comparison based on Harbison data: summary In linker regions, 44.8% of found sites overlapped with “significantly bound” (p < 0.005) regions which had sequences conserved in at least one other species of yeast In nucleosomal regions, only 23% of sites overlapped Linker regions are significantly enriched for poly-‘A’ repeats: “AA-rich” segments are known to increase rigidity of DNA and prevent bending Motif-finder BP misses some sites, possibly due to sensitivity to repeat elements ENAR 2006 gupta@bios.unc.edu Using nucleosome occupancy for motif prediction Experiments are expensive- sequence data is available Do intrinsic sequence factors influence nucleosome occupancy? Controversial- but some recent experiments indicate that sequence plays a role (e.g. Sekinger et al., 2005) Two hypotheses: (i) General features of local sequence composition and/or (ii) positioning “elements” ENAR 2006 gupta@bios.unc.edu Hierarchical Generalized HMM: Summary Hierarchical model robust to various sources of probe variability and measurement error State length model gives principled way of modeling state lengths avoiding arbitrary splitting of a long sequence Data augmentation method adapts recursive likelihood computation techniques for controlling computational complexity issues Length-restricted HMM observed to have stable estimates, avoids label-switching problems typical to HMMs. ENAR 2006 gupta@bios.unc.edu Conclusions and some ongoing work Motif discovery tools can be improved by including auxiliary biological knowledge, such as combinatorial regulation, phylogenetic conservation, chromatin structure ENAR 2006 gupta@bios.unc.edu Conclusions and some ongoing work Motif discovery tools can be improved by including auxiliary biological knowledge, such as combinatorial regulation, phylogenetic conservation, chromatin structure Development of high-resolution experimental technology continues to provide huge amounts of valuable data ENAR 2006 gupta@bios.unc.edu Conclusions and some ongoing work Motif discovery tools can be improved by including auxiliary biological knowledge, such as combinatorial regulation, phylogenetic conservation, chromatin structure Development of high-resolution experimental technology continues to provide huge amounts of valuable data Determining underlying sequence-based factors that determine (some) biological characteristics may be feasible ENAR 2006 gupta@bios.unc.edu Conclusions and some ongoing work Motif discovery tools can be improved by including auxiliary biological knowledge, such as combinatorial regulation, phylogenetic conservation, chromatin structure Development of high-resolution experimental technology continues to provide huge amounts of valuable data Determining underlying sequence-based factors that determine (some) biological characteristics may be feasible Joint models for sequence incorporating chromatin structure for more sensitive motif discovery ENAR 2006 gupta@bios.unc.edu Acknowledgements Jason Lieb and Paul Giresi (UNC Biology) Joe Ibrahim and Jonathan Gelfond (UNC Biostatistics) RD-83272001 (EPA) Some references: Gupta M and Liu JS (2005). De-novo cis-regulatory module elicitation for eukaryotic genomes. Proceedings of the National Academy of Sciences of the USA, 102 (20): 7079-84. Giresi, P. G., Gupta, M. and Lieb, J. D. (2006). Regulation of nucleosome stability as a mediator of chromatin function. Curr. Opin. Genet. Dev., 16 (2): 171-6. Gupta, M. (2006). Generalized hierarchical Markov models for discovery of length-constrained sequence features from genome tiling arrays. (under review) ENAR 2006 gupta@bios.unc.edu