Determination of DNA sequence features from genome tiling arrays Mayetri Gupta

advertisement
Determination of DNA sequence
features from genome tiling arrays
Mayetri Gupta
University of North Carolina at Chapel Hill
ENAR 2006
gupta@bios.unc.edu
Outline of talk
Biological problem: inferring gene regulation
Whole genome tiling arrays for genomic feature detection
Hierarchical generalized HMM framework
Recursive data augmentation
Application to yeast arrays
ENAR 2006
gupta@bios.unc.edu
Upstream regulation ↔ downstream expression
Fundamental question: how can we understand the biological
mechanisms leading to disease?
...gtggtTAGAATagcgactgttttt... gene 1
...taggTATAATacagtctgacaaaa... gene 2
...cagcaacattgaTATAATtgccat... gene 3
...ctaaaacaatTATTATttatcagg... gene 4
|TATAAT|
G
1
2
3
4
5
6
bits
0
G
1
CT
2
Co-regulated genes share
similar motif patterns
ENAR 2006
gupta@bios.unc.edu
Mechanisms behind gene expression
First step in studying regulatory process is binding of transcription
factors (TFs) to binding sites
Many transcription factors work in co-ordination to regulate genes
ENAR 2006
gupta@bios.unc.edu
Transcription regulation: DNA motifs
Proteins bind to DNA to activate transcription
G
1
2
3
4
5
6
bits
0
|TATAAT|
G
1
CT
2
Position specific weight
matrix (PSWM)
or Motif
ENAR 2006
gupta@bios.unc.edu
Motif discovery- statistical model
Every non-site position
multinomial with
θ0 = (θ01 , . . . , θ04 )
Every motif position i
multinomial with
· · · θ0
θ0 θ1 · · · θ 6 θ0
θ0 · · ·
θi = (θi1 , . . . , θi4 )
EM (Lawrence 1990), Gibbs sampler (Liu 1995), BIOPROSPECTOR (Liu 2001)
Dictionary (Bussemaker 2000) Stochastic Dictionary (Gupta, Liu 2003)
ENAR 2006
gupta@bios.unc.edu
Using auxiliary information in motif discovery
Most motif discovery methods typically yield high proportion of false
positives in complex genomes
Modeling additional genomic characteristics has proved useful to
improve motif search
Phylogenetic conservation (Wasserman 00; Kellis 03; Li 05)
Motif “clustering”: regulatory modules (Thompson 04; Gupta 05)
Special structures- palindromes, uni- and bi-modal profiles
(Kechris 04; Zhou 04)
ENAR 2006
gupta@bios.unc.edu
Chromatin structure
“Beads on a string”
DNA is NOT a linear
strand, local chromatin
structure may affect binding
of proteins
ENAR 2006
gupta@bios.unc.edu
Nucleosome positioning ↔ TF binding
Previous developments assume DNA is a linear strand with equal propensity for
binding sites to be located anywhere
DNA has been recently proved to be extremely dynamic- nucleosome
occupancy varies throughout the genome (Lee et al. Nat. Genet. 2004)
Empirical studies suggest TFs more likely to bind to nucleosome free
regions (linkers)
Unlike other auxiliary data, is potentially a more “universal” mechanism, less
dependent on species/TF/gene set being considered
ENAR 2006
gupta@bios.unc.edu
Genome-wide mapping of nucleosome occupancy
High-density DNA microarray that utilizes the
differential susceptibility of nucleosomal and
linker DNA to micrococcal nuclease
Yeast + formaldehyde + MNase
genomic DNA
↓
↓
purify nucleosomal DNA
↓
(Gel electrophoresis)
↓
↓
↓
label with Cy3
label with Cy5
&
.
mix and concentrate, hybridize to microarray
(from Yuan et al. 2005)
ENAR 2006
gupta@bios.unc.edu
Data: Yuan et al. (Science, 2005)
Entire sequence fragmented into 50-mers overlapping every 20 bp, 8
replicate measurements for each probe
ENAR 2006
gupta@bios.unc.edu
Data: Yuan et al. (Science, 2005)
Entire sequence fragmented into 50-mers overlapping every 20 bp, 8
replicate measurements for each probe
About 146 bp of DNA wrapped around a single nucleosome- corresponds to
about 6-8 overlapping probes on microarray
ENAR 2006
gupta@bios.unc.edu
Data: Yuan et al. (Science, 2005)
Entire sequence fragmented into 50-mers overlapping every 20 bp, 8
replicate measurements for each probe
About 146 bp of DNA wrapped around a single nucleosome- corresponds to
about 6-8 overlapping probes on microarray
Measurements derived from a large population of asynchronous cells, so
temporally unstable nucleosomes appear “delocalized” (and weaker) than
stereotypically positioned ones
ENAR 2006
gupta@bios.unc.edu
Data: Yuan et al. (Science, 2005)
Entire sequence fragmented into 50-mers overlapping every 20 bp, 8
replicate measurements for each probe
About 146 bp of DNA wrapped around a single nucleosome- corresponds to
about 6-8 overlapping probes on microarray
Measurements derived from a large population of asynchronous cells, so
temporally unstable nucleosomes appear “delocalized” (and weaker) than
stereotypically positioned ones
30 contigs from yeast chromosome III covering about 270 Kbp
ENAR 2006
gupta@bios.unc.edu
Data: Yuan et al. (Science, 2005)
Entire sequence fragmented into 50-mers overlapping every 20 bp, 8
replicate measurements for each probe
About 146 bp of DNA wrapped around a single nucleosome- corresponds to
about 6-8 overlapping probes on microarray
Measurements derived from a large population of asynchronous cells, so
temporally unstable nucleosomes appear “delocalized” (and weaker) than
stereotypically positioned ones
30 contigs from yeast chromosome III covering about 270 Kbp
Data normalized in two steps to account for geographic biases and
cross-hybridization
ENAR 2006
gupta@bios.unc.edu
Hidden Markov model framework
Tiling array structure indicates HMM-type model may be
desirable
ENAR 2006
gupta@bios.unc.edu
Hidden Markov model framework
Tiling array structure indicates HMM-type model may be
desirable
Cannot observe the “latent” biological process, only noisy
measurements
ENAR 2006
gupta@bios.unc.edu
Hidden Markov model framework
Tiling array structure indicates HMM-type model may be
desirable
Cannot observe the “latent” biological process, only noisy
measurements
Spatial dependence between neighboring probes on the
array
ENAR 2006
gupta@bios.unc.edu
Hidden Markov model framework
Tiling array structure indicates HMM-type model may be
desirable
Cannot observe the “latent” biological process, only noisy
measurements
Spatial dependence between neighboring probes on the
array
Complications:
ENAR 2006
gupta@bios.unc.edu
Hidden Markov model framework
Tiling array structure indicates HMM-type model may be
desirable
Cannot observe the “latent” biological process, only noisy
measurements
Spatial dependence between neighboring probes on the
array
Complications:
biological restrictions on state lengths
ENAR 2006
gupta@bios.unc.edu
Hidden Markov model framework
Tiling array structure indicates HMM-type model may be
desirable
Cannot observe the “latent” biological process, only noisy
measurements
Spatial dependence between neighboring probes on the
array
Complications:
biological restrictions on state lengths
sequence-specific variability in behavior of probes- median
value may be insufficient summary
ENAR 2006
gupta@bios.unc.edu
Yuan et al (2005) analysis
DN1
N1
...
...
DN6
DN7
DN 8
N6
N7
N8
DN 9
L
Use a profile Hidden Markov model (PHMM) so that length
distributions can vary between states
Break up sequence into 40 probe non-overlapping windows
Cannot extend PHMM to long sequences, but breaking up sequence
arbitrarily can cause loss of information
ENAR 2006
gupta@bios.unc.edu
Bayesian framework for length-restricted features
τ
Z1
...
Z2
Y1
Y2
...
...
L2
L1
Yl 1
Y11+1
...
ZN
Y1 +l
1
2
...
LN
YΣ
1i
... Y
Σ 1i +lN
Z , L latent random variables, only Y ’s are observed random variables
ENAR 2006
gupta@bios.unc.edu
Bayesian framework for length-restricted features
(1) State duration model
Length distribution in state k: (truncated) negative binomial
l −1
pk (l) = ck (φk )
(1 − φk )l−rk φrkk ,
rk − 1
l ∈ Dk = {rk , rk + 1, . . . , sk }
Sets Dk based on approximately known physical characteristics
ENAR 2006
gupta@bios.unc.edu
Bayesian framework for length-restricted features
(2) Model for spot measurements
Spot measurements: conditional on being in state Zi = k, log-ratios
modelled as
yi j |Zi = k, µik , σ2ik ∼ N(µik , σ2ik )
Hierarchical framework with 2-level conjugate priors on µik (Normal)
and σ2ik (Inv-gamma) ⇒ marginally a non-central t
Transition probabilities between states τ = ((τi j )), with all τii = 0.
Dirichlet priors on non-zero transitions.
ENAR 2006
gupta@bios.unc.edu
Model fitting and parameter estimation
State length distributions ⇒ computational complexity- Gibbs
sampling-based approaches tend to get trapped in local optima
ENAR 2006
gupta@bios.unc.edu
Model fitting and parameter estimation
State length distributions ⇒ computational complexity- Gibbs
sampling-based approaches tend to get trapped in local optima
Solution: A data augmentation algorithm is formulated with recursive
steps to jointly sample state identities and state duration lengths
P(Z, L|y) = P(ZN |y)P(LN |ZN , y) . . . P(Lmin |Lmin+1 , Zmin+1 , . . . , ZN , y)
ENAR 2006
gupta@bios.unc.edu
Model fitting and parameter estimation
State length distributions ⇒ computational complexity- Gibbs
sampling-based approaches tend to get trapped in local optima
Solution: A data augmentation algorithm is formulated with recursive
steps to jointly sample state identities and state duration lengths
P(Z, L|y) = P(ZN |y)P(LN |ZN , y) . . . P(Lmin |Lmin+1 , Zmin+1 , . . . , ZN , y)
Assumes that the length spent in a state and the transition to that
state are independent
P(l, k|l 0 , k0 ) = pk (l)τk0 k
ENAR 2006
gupta@bios.unc.edu
Model fitting and parameter estimation
State length distributions ⇒ computational complexity- Gibbs
sampling-based approaches tend to get trapped in local optima
Solution: A data augmentation algorithm is formulated with recursive
steps to jointly sample state identities and state duration lengths
P(Z, L|y) = P(ZN |y)P(LN |ZN , y) . . . P(Lmin |Lmin+1 , Zmin+1 , . . . , ZN , y)
Assumes that the length spent in a state and the transition to that
state are independent
P(l, k|l 0 , k0 ) = pk (l)τk0 k
Posterior estimation of parameters is mostly straightforward. However
φk ’s have a non-standard distribution for which we use adaptive
rejection Metropolis sampling
ENAR 2006
gupta@bios.unc.edu
Hierarchical Generalized Hidden Markov Model
HGHMM was applied to the longest contiguous mapped region,
corresponding to 61 Kbp of yeast chromosome III
Data was used from all 8 replicates that had been cleaned and
normalized
Three states:
linker (D1 = 1, 2, . . .)
delocalized nucleosome (D2 = {9, . . . , 30})
well-positioned nucleosomes (D3 = {6, 7, 8})
ENAR 2006
gupta@bios.unc.edu
ag replacements
HGHMM fit
Theoretical t Quantiles
nucleosome
Sample Quantiles
delocalized
Sample Quantiles
Sample Quantiles
linker
Theoretical t Quantiles
Theoretical t Quantiles
QQ plots of the three predicted states in nucleosomal array data using a
HGHMM with the hierarchical t-distribution
ENAR 2006
gupta@bios.unc.edu
HGHMM fit
Linker
Mean
µ
r/φ
Deloc
SD
Mean
Nucleo
SD
Mean
SD
-0.833 0.011
0.165 0.005
0.869 0.007
3.969 0.281
12.266 0.245
6.608 0.076
Overall posterior summaries for 3 states for
µ: expected probe level measurement
r/φ: approximate expected state length
ENAR 2006
gupta@bios.unc.edu
acements
5
10
20
0.00 0.02 0.04 0.06 0.08 0.10
ρ=5
ρ = 10
ρ = 20
ρ = 30
fraction misclassified
0.00 0.02 0.04 0.06 0.08 0.10
maximum MSE
Sensitivity analysis: (1) maximal MSEs and (2) misclassification
30
ρ=5
ρ = 10
ρ = 20
ρ = 30
5
2
30
true ρ
fraction misclassified
ρ=5
ρ = 20
1
20
0.00 0.02 0.04 0.06 0.08 0.10
0.00 0.02 0.04 0.06 0.08 0.10
maximum MSE
true ρ
10
3
ρ=5
ρ = 20
1
2
3
range
range
Top: Misspecification of inverse gamma hyperparameter ρ.
Below: Misspecification of truncated negative binomial range.
Range 1: (r2 , s2 ) = (9, 30), (r3 , s3 ) = (6, 8)
Range 2: (r2 , s2 ) = (10, 25), (r3 , s3 ) = (5, 9)
Range 3: (r2 , s2 ) = (8, 25), (r3 , s3 ) = (5, 7)
ENAR 2006
gupta@bios.unc.edu
Biological validation of predictions
How well do nucleosome-free state predictions correlate with the likely
location of TFBSs?
ENAR 2006
gupta@bios.unc.edu
Biological validation of predictions
How well do nucleosome-free state predictions correlate with the likely
location of TFBSs?
Harbison et al (2004) data: Genomewide location analysis (ChIP-chip) to
detemine occupancy of DNA-binding transcription regulators under a variety
of conditions
ENAR 2006
gupta@bios.unc.edu
Biological validation of predictions
How well do nucleosome-free state predictions correlate with the likely
location of TFBSs?
Harbison et al (2004) data: Genomewide location analysis (ChIP-chip) to
detemine occupancy of DNA-binding transcription regulators under a variety
of conditions
First, we took segments for chromosome III corresponding to which binding
data were highly significant, and used EMCmodule (Gupta and Liu, PNAS
2005) with a starting set of motif weight matrices from SCPD database to
determine motif locations
ENAR 2006
gupta@bios.unc.edu
uu
2
1
2
1
38600
39000
39400
39800
0
0
29000
iYCL058C
29500
30000
30500
31000
31500
28000
5
HSF1
PHO2
SCB
25000
3
2
1
0
0
1
2
uu
3
4
ABF1
GCN4
HSF1
PHO2
uu
24800
27800
midloci[sel.ind]
5
5
4
3
2
1
0
24600
27600
iYCL064C
midloci[sel.ind]
ABF1
HSF1
24400
27400
iYCL061C
midloci[sel.ind]
24200
27200
4
2
uu
78% of sites in linker or delocalized nucleosomal regions
0
1
Comparison with PHMM
22000
22200
22400
22600
16800
17000
17200
17400
• Nucleosome
• Delocalized nucleosome
• Linker
ENAR 2006
gupta@bios.unc.edu
Comparisons based on Harbison et al. data
Criteria for state assignment: Estimated posterior state probability
b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8
P(Z
i
ENAR 2006
gupta@bios.unc.edu
Comparisons based on Harbison et al. data
Criteria for state assignment: Estimated posterior state probability
b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8
P(Z
i
Two methods: SDDA (Gupta and Liu, JASA 2003) and BioProspector
(Liu et al, PAC SYMP BIOCOMP 2001) were used for de-novo motif
discovery in each set of regions
ENAR 2006
gupta@bios.unc.edu
Comparisons based on Harbison et al. data
Criteria for state assignment: Estimated posterior state probability
b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8
P(Z
i
Two methods: SDDA (Gupta and Liu, JASA 2003) and BioProspector
(Liu et al, PAC SYMP BIOCOMP 2001) were used for de-novo motif
discovery in each set of regions
Motif searches were run separately on regions predicted by (1)
HGHMM, and (2) profile HMM (PHMM)
ENAR 2006
gupta@bios.unc.edu
Comparisons based on Harbison et al. data
Criteria for state assignment: Estimated posterior state probability
b i = k|Y ) = ∑Mj=1 I(Z ( j) = k)/M > 0.8
P(Z
i
Two methods: SDDA (Gupta and Liu, JASA 2003) and BioProspector
(Liu et al, PAC SYMP BIOCOMP 2001) were used for de-novo motif
discovery in each set of regions
Motif searches were run separately on regions predicted by (1)
HGHMM, and (2) profile HMM (PHMM)
Motif predictions were compared by (i) match to known motifs in the
SCPD database and (ii) positional overlap with factor-bound regions
from the ChIP-chip data
ENAR 2006
gupta@bios.unc.edu
Specificity and sensitivity of predictions
Specificity (Spec): Proportion of regions predicted in a state overlapping with
TF-bound regions from Harbison et al
Sensitivity (Sens): % of bound regions not missed.
Higher proportion of linker regions than nucleosomal overlap with TF-bound
regions; HGHMM (with SDDA) has highest sensitivity
HGHMM
SDDA
PHMM
BP
SDDA
BP
Spec Sens Spec Sens Spec Sens Spec Sens
Linker
0.61
0.7 0.40
0.87
0.38
0.93
0.23
1
Deloc Nucl
0.19
0.8 0.15
0.63
0.16
0.37
0.11
0.33
Nucleosomal
0.16
0.5 0.09
0.43
0.20
0.8
0.15
0.73
ENAR 2006
gupta@bios.unc.edu
0.6
0.4
*
*** ** ** * **
*
*
*
*
** **
*
***
* ******* **** *
***
*** * *
*
*
**
*
*
*
*
*
** * * ***
***
**
*
*
0.0
0.2
pred state
0.8
1.0
Binding site “hotspots” in predicted linker regions
25000
30000
35000
40000
35000
40000
35000
40000
0.6
0.4
0.0
0.2
pred state
0.8
1.0
coord
25000
30000
0.6
0.4
0.0
0.2
pred state
0.8
1.0
coord
25000
30000
Region of chromosome III with posterior probability of states from HGHMM and
corresponding predicted motif sites from SDDA, showing certain binding site
“hotspots” in predicted linker regions (blue crosses) which are almost completely
devoid of well-positioned nucleosomes.
ENAR 2006
gupta@bios.unc.edu
Motif predictions and matches to known binding TFs
HGHMM motif TF match ExptMatch
AT TT GA
BAS1 ATTTA
bits
2
1
A
T T TTT T
A GAT C
AT TT GAA
ACA A
GAA AG T
TTTTC TT
T CAAA
CA
bits
1
0
bits
2
1
0
bits
2
1
C
1
2
3
4
5
6
7
8
9
10
GT
A
A
C
T
C
G
C
C
AT
C
G
A
A
T
T
T
1
0
bits
2
1
0
bits
2
1
0
T
T
T TACA
TGG
A
C
A
C
C
1
2
3
4
5
6
7
8
9
10
G
CC
C
GTT
T
C
GC C
AC
A
T
G
T
1
2
3
4
5
6
7
8
9
10
bits
bits
2
T
T
AC
GG
A
T
C
AA
1
2
3
4
5
6
7
8
9
10
0
C
A
TA
AT
GC
C
C
A
G
G
A
AT
1
2
3
4
5
6
7
8
9
10
1
A
AA T T
CT
A
A
G
C
C
GGG
G
G
A
C
1
2
3
4
5
6
7
8
9
10
2
C
A
T G
AA
CA
0
A
C
G
1
2
3
4
5
6
7
8
9
10
2
T
AT T
A
TG
G
AC
1
2
3
4
5
6
7
8
9
10
0
Source
PHMM state
SCPD
L
FKH1
TTGTTTACC
Harbison
L
GAT1
GATAA
Harbison
N
HSF1
TTCTAGAA
Harbison
-
RAP1
CACCCAnACAn SCPD
STE12 TGAAACA
-
SCPD
-
SWI6
TTTCG
SCPD
-
LNO2
ATTTCA
Harbison
-
ExptMatch: match in SCPD database
States: L, D, N: linker, delocalized nucleosomal, or nucleosomal
ENAR 2006
gupta@bios.unc.edu
Comparison based on Harbison data: summary
In linker regions, 44.8% of found sites overlapped with “significantly
bound” (p < 0.005) regions which had sequences conserved in at
least one other species of yeast
In nucleosomal regions, only 23% of sites overlapped
Linker regions are significantly enriched for poly-‘A’ repeats: “AA-rich”
segments are known to increase rigidity of DNA and prevent bending
Motif-finder BP misses some sites, possibly due to sensitivity to
repeat elements
ENAR 2006
gupta@bios.unc.edu
Using nucleosome occupancy for motif prediction
Experiments are expensive- sequence data is available
Do intrinsic sequence factors influence nucleosome occupancy?
Controversial- but some recent experiments indicate that sequence
plays a role (e.g. Sekinger et al., 2005)
Two hypotheses: (i) General features of local sequence composition
and/or (ii) positioning “elements”
ENAR 2006
gupta@bios.unc.edu
Hierarchical Generalized HMM: Summary
Hierarchical model robust to various sources of probe variability and
measurement error
State length model gives principled way of modeling state lengths
avoiding arbitrary splitting of a long sequence
Data augmentation method adapts recursive likelihood computation
techniques for controlling computational complexity issues
Length-restricted HMM observed to have stable estimates, avoids
label-switching problems typical to HMMs.
ENAR 2006
gupta@bios.unc.edu
Conclusions and some ongoing work
Motif discovery tools can be improved by including auxiliary biological
knowledge, such as combinatorial regulation, phylogenetic
conservation, chromatin structure
ENAR 2006
gupta@bios.unc.edu
Conclusions and some ongoing work
Motif discovery tools can be improved by including auxiliary biological
knowledge, such as combinatorial regulation, phylogenetic
conservation, chromatin structure
Development of high-resolution experimental technology continues to
provide huge amounts of valuable data
ENAR 2006
gupta@bios.unc.edu
Conclusions and some ongoing work
Motif discovery tools can be improved by including auxiliary biological
knowledge, such as combinatorial regulation, phylogenetic
conservation, chromatin structure
Development of high-resolution experimental technology continues to
provide huge amounts of valuable data
Determining underlying sequence-based factors that determine
(some) biological characteristics may be feasible
ENAR 2006
gupta@bios.unc.edu
Conclusions and some ongoing work
Motif discovery tools can be improved by including auxiliary biological
knowledge, such as combinatorial regulation, phylogenetic
conservation, chromatin structure
Development of high-resolution experimental technology continues to
provide huge amounts of valuable data
Determining underlying sequence-based factors that determine
(some) biological characteristics may be feasible
Joint models for sequence incorporating chromatin structure for more
sensitive motif discovery
ENAR 2006
gupta@bios.unc.edu
Acknowledgements
Jason Lieb and Paul Giresi (UNC Biology)
Joe Ibrahim and Jonathan Gelfond (UNC Biostatistics)
RD-83272001 (EPA)
Some references:
Gupta M and Liu JS (2005). De-novo cis-regulatory module elicitation for
eukaryotic genomes. Proceedings of the National Academy of Sciences of
the USA, 102 (20): 7079-84.
Giresi, P. G., Gupta, M. and Lieb, J. D. (2006). Regulation of nucleosome
stability as a mediator of chromatin function. Curr. Opin. Genet. Dev., 16 (2):
171-6.
Gupta, M. (2006). Generalized hierarchical Markov models for discovery of
length-constrained sequence features from genome tiling arrays. (under
review)
ENAR 2006
gupta@bios.unc.edu
Download