Statistical analysis of transcription regulatory modules Mayetri Gupta

advertisement
Statistical analysis of transcription
regulatory modules
Mayetri Gupta
University of North Carolina at Chapel Hill.
joint work with Jun S. Liu (Harvard University)
gupta@bios.unc.edu -- p.1/24
Upstream regulation ↔ downstream expression
Fundamental question: how can we understand the biological
mechanisms leading to disease?
...gtggtTAGAATagcgactgttttt... gene 1
...taggTATAATacagtctgacaaaa... gene 2
...cagcaacattgaTATAATtgccat... gene 3
...ctaaaacaatTATTATttatcagg... gene 4
|TATAAT|
G
1
2
3
4
5
6
bits
0
G
1
CT
2
Co-regulated genes share
similar motif patterns
gupta@bios.unc.edu -- p.2/24
Transcription regulation: DNA motifs
Proteins bind to DNA to activate transcription
G
1
2
3
4
5
6
bits
0
|TATAAT|
G
1
CT
2
Position specific weight
matrix (PSWM)
or Motif
gupta@bios.unc.edu -- p.3/24
Transcription regulation in eukaryotes
Harder problem: many transcription factors working in
co-ordination
LARGE sequence search space + WEAK motifs: → many
false positives/negatives?
gupta@bios.unc.edu -- p.4/24
Regulatory motif discovery
Every non-site
position multinomial
with
θ0 = (θ01 , . . . , θ04 )
· · · θ0
θ0 θ1 · · · θ 6 θ0
θ0 · · ·
Every motif position i
multinomial with
θi = (θi1 , . . . , θi4 )
Product Multinomial model
Challenge: Find position of sites and θ’s
MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998),
BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003)
gupta@bios.unc.edu -- p.5/24
Motif ≡ PSWM
Position-Specific Weight Matrix
Θ=
A
C
G
T





.2
.0
.0
.8
0
0
0
1
.0
.1
.8
.1
.95
.05
.0
.0
0
.99
.01
0
.9
.0
.0
.1





TTGACA [PROB = 0.5417]
ATGACT [PROB = 0.0150]
TTTAGT [PROB = 7.6e-05]
Columns assumed independent
gupta@bios.unc.edu -- p.6/24
Motif discovery
A1
A2
A3
A N = ??
Motif site location in sequences A = (A1 , . . . AN )
Sample A|Θ, then Θ|A
Variants/extensions
ALIGNACE, Roth 1998
BIOPROSPECTOR, Liu 2001
gupta@bios.unc.edu -- p.7/24
Motif → regulatory module
Combinatorial regulation in complex organisms
Spatial “clusters” of motifs: “modules”
Usual motif search methods may
miss weaker sites
detect low-complexity repeats
(
)
(
(
sequence search space is much
larger
gupta@bios.unc.edu -- p.8/24
)
)
Example: combinatorial regulation in CNS development
experiments determined
Sim:Tango (ACGTG ) binding
sites in Drosophila
midline-expressed genes
at least one more regulatory
element indicated
gupta@bios.unc.edu -- p.9/24
A model-selection-based approach
Gibbs sampler on B. Subtilis data
Θ7
Θ8
Θ9
Θ10
Θ11
Θ12
Θ13
Θ14
Θ15
−73
−23
Θ4
Θ5
Θ6
−123
distance from
transcription
position
of site along sequencestart site
27
Θ1
Θ2
Θ3
1
21
41
61
81
101
121
141
sequence
index
Sequence Index
Determine true module of size K from starting set of {Θ1 , . . . , Θ15 }
(K, Θ unknown)
gupta@bios.unc.edu -- p.10/24
Sequence model if K is known
K PSWMs with Θ = (Θ1 , . . . , ΘK )
Occurrence probabilities ρ;
Transition probabilities between PSWM types T : (τ kl )
Missing data: A = (Ai j ); Ai j = j-th site in sequence i
A ∼ P(·|λ)
Then likelihood of sequence S :
P(S | φ) = ∑ ∑ P(S | φ, A, T )P(A|φ)P(T |A, φ)
A T
φ = (ρ, Θ, τ, λ)
gupta@bios.unc.edu -- p.11/24
Hidden markov model (HMM) for modules
S
(
(
τ 1E
τ S1
)
(
τ S2
E
τ
)
)
M
PSfrag replacements
τ
τ 11
τ 12
1
τ 21
M
2E
2
τ 22
Sequence model is a hidden Markov model (HMM) with A
following a Markov chain
“Clustering”: Site probability f decreases with distance from
nearest site, P(Ai j − Ai, j−1 ) is assumed geometric(λ)
gupta@bios.unc.edu -- p.12/24
Parameter estimation
Problems are two-fold:
K is unknown, we only have D ≥ K “approximate” starting
PSWMs
Estimate φ = (Θ, ρ, λ, τ): dimensionality changes with K
Define (u1 , . . . , uD ); u j = 1 if motif j in the module
With K known, Gibbs sampler (e.g. Thompson et al, Genome Res. 2004;
Zhou and Wong, PNAS 2004)
gupta@bios.unc.edu -- p.13/24
Prior model (K unknown)
Dirichlet, Product Dirichlet, Beta prior distributions on
parameters φ = (ρ, Θ, τ, λ)
u ∼ Bernoulli(·)
Posterior distribution of all parameters:
P(φ, u, A|S ) ∝ P(S |u, φ, A)p(φ, A|u)p(u)
gupta@bios.unc.edu -- p.14/24
Discovery of regulatory modules
CHALLENGES:
?
?
(A) Total number of
?
?
(
(
?
(
)
(
?
)
?
)
Update motif sites
within modules
)
From total of D motifs,
choose d for module
potential motifs
D may be LARGE!
Given Θ, ρ, τ, λ sample A
(B) Naive posterior
computation:
Given d and A,
(
)
(
(
exponential order
)
)
Update ρ,τ, λ
From aligned sites,
update parameters
in sequence length
Θ
gupta@bios.unc.edu -- p.15/24
(A) Evolutionary Monte Carlo to update u
Liang and Wong, Stat. Sinica 2000
Update “Population” using 2 kinds of “moves”
Starting set: 4 possible module types
"units"
(
u1
1
1
1
)
1
0
1
0
(
u3
1
0
unit 2 and 3
at position 3
0
0
1
1
1
1
(
)
u2
1
0
0
0
(
)
1
1
1
u4
Motif selection : choose d out of
crossover
)
0
0
0
(
)
u
1
between units 1
2 and 4 lead
(
to a new
configuration u 2
1
(
)
0
1
u3
1
(
u4
mutation
)
1
)
u1
1
(
u2
(
)
1
1
1
1
)
1
0
0
1
(
u3
)
1
1
1
1
(
u4
)
0
0
0
0
{Θ1 Θ 2 ... Θ D} for module
Metropolis-Hastings-like moves with target distribution:
Z
P(u|S , A, φ)p(φ)dφ
gupta@bios.unc.edu -- p.16/24
(B) Data augmentation: forward-backward algorithm
Taking advantage of HMM formulation, recursively define a
“forward” probability of occurrence of motif type k at position i:
τ
K
Fi,k = ∑ ∑ Fj,l pλ (i− j)τlk P(motif k at i)
j<i l=1
∑ FN,k = P(A|S , φ, u)
k
lk
P (i−j )
λ
Θl
j
F
Θk
i
j1
F ik
F
j2
F
j3
...
Use “backward” sampling to update
AN ∼ P(AN |S , φ), AN−1 ∼ P(AN−1 |AN , S , φ), · · · .
Given A, u, parameter updating is straightforward (conjugate priors)
gupta@bios.unc.edu -- p.17/24
Skeletal muscle sequences (Wasserman et. al, 2000)
24 aligned sequence pairs from human and mouse genomes
involved in skeletal muscle gene regulation
600
sequence 1
800
1000
1.0
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
0.8
0.6
probability
0.4
0.2
0.0
400
PSfrag replacements
200
probability
sequence 1
sequence 2
PSfrag replacements
probability
sequence 1
sequence 3
PSfrag replacements
sequence 2
sequence 3
0
0
500
1000
sequence 2
1500
0
200
400
600
800
sequence 3
Posterior probability of site occurrence in human regulatory
acements
sequences (first three in set)
acements
ability
ability
Red (dotted) line: MEF2 motif
Black (solid) line: MYF motif
gupta@bios.unc.edu -- p.18/24
1000
Skeletal muscle sequences (Wasserman et. al, 2000)
Matches to experimentally derived sites using 5 methods
Method
MEF MYF SP1 SRF Total SENS SPEC MSpec
Time
MEME (EM)
0
1 21
0 161 0.14 0.14
0.20
3927.6
BioProspector
6
1
8
1 155 0.10 0.10
0.36
481.0
CisModule
17
0
6
0 222 0.15 0.10
0.40
2160.0
GMS
6
6
2
1
84 0.10 0.25
0.44 68258.75
14
14
4
6 162 0.25 0.23
0.60 131112.9
GMS p∗
EMCModule
12
12
5
7 180 0.23 0.20
0.67 21943.0
17
13
8
10 108 0.31 0.44
0.80
8536.2
EMCModuleJ
True
32
50 44
28 154
EMCModuleJ : with JASPAR matrices
SENS: sensitivity
SPEC: specificity
MSpec: (True predicted motif types)/(True motif types)
gupta@bios.unc.edu -- p.19/24
Drosophila midline sequences
Available data:
D. melanogaster and D. pseudoobscura for 30 genes
expressed in central nervous system midline cell progenitors.
60 odorant receptor genes (controls)
Experimental determination of sites for breathless,
single-minded, Toll, and rhomboid genes
gupta@bios.unc.edu -- p.20/24
Drosophila midline sequences- analysis
Starting collection of 30
motifs from sequences
weighted by alignment
scores
(Compareprospector)
EMCmodule
detects
3
motifs
including
Sim:Tango (ACGTG)
Mann-Whitney tests for motif scores comparing to the controls are
highly significant ( p < 1e − 5)
Analysis: Emma Huang
gupta@bios.unc.edu -- p.21/24
Error rates compared for 3 data sets
FP: Proportion of “spurious” motifs in final module
FN : proportion of motifs missed
Data set
Method
B. Subtilis
Drosophila
Skeletal muscle
FN
FP
FN
FP
FN
FP
BioProspector
0
9/11
0.4
88/96
0.4
67/81
MEME
0
45/50
0.2
44/50
0.4
46/50
0.5
4/5
0.6
68/70
0.8
8/8
0
1/3
0.2
7/16
0.4
1/5
AlignAce
EMCmodule
FP rates may be overestimated for motifs of currently unknown function
gupta@bios.unc.edu -- p.22/24
Summary and further improvements
No prior information necessary for motifs, inter-site distances
Decreases error rates- motif selection and motif updating
Can combine previous knowledge (e.g. use TRANSFAC, JASPAR)
and also reveal new information
Improving inter-motif distance model (long-range
dependence?)
Comparative genomics?
Local chromatin structure
Chromatin immunoprecipitation tiling array data
Gupta, M. and Liu, J. S. (2005). De-novo cis-regulatory module elicitation for
eukaryotic genomes. PNAS 102(20):7079-84.
gupta@bios.unc.edu -- p.23/24
Acknowledgements
Data: W. Thompson (Wadsworth Lab), S. Crews (UNC),
ENCODE project
J. Lieb (UNC), R. Losick (Harvard)
Emma Huang (UNC): analysis of Midline genes data
For reprints, software, questions, contact gupta@bios.unc.edu
gupta@bios.unc.edu -- p.24/24
Download