Learning the cis regulatory code by predictive modeling

advertisement
Learning the cis regulatory code
by predictive modeling of gene regulation
(MEDUSA)
Christina Leslie
Center for Computational Learning Systems
Columbia University, NY, USA
http://www.cs.columbia.edu/compbio/medusa
Transcriptional Regulation
Nuclear membrane
Transcriptional Regulation
Nuclear membrane
Transcriptional Regulation
Nuclear membrane
Binding site/motif
CCG__CCG
Transcriptional Regulation
Nuclear membrane
Binding site/motif
CCG__CCG
Genome-wide mRNA
transcript data (e.g.
microarrays)
Transcriptional Regulation
Learning problems:
• Understand which
regulators control
which target genes
Binding site/motif
CCG__CCG
Nuclear membrane
• Discover motifs
representing
regulatory elements
Previous work: Clustering
• Cluster-first motif discovery
– Cluster genes by expression profile, annotation, …
to find potentially coregulated genes
– Find overrepresented motifs in promoter
sequences of similar genes (algorithms: MEME,
Consensus, Gibbs sampler, AlignACE, …)
(Spellman et al. 1998)
Previous work: “Structure learning”
• Graphical models (and other methods)
– Learn structure of “regulatory network”, “regulatory
modules”, etc.
– Fit interpretable model to training data
– Model small number of genes or clusters of genes
– Many computational and statistical challenges; often used
for qualitative hypotheses rather than prediction
(Pe’er et al. 2001)
(Segal et al, 2003, 2004)
Our work: “Predictive modeling”
• MEDUSA = Motif Element Discrimination Using
Sequence Agglomeration
What is the prediction problem?
– Predict up/down regulation of target genes under different
experimental conditions
Key ideas:
– Learn motifs and identify regulators that predict differential
expression in different contexts  mechanistic inputs
– Obtain single model for all genes and all experiments:
context-specific, no clusters, no parameter tuning
– Accurate predictions on test data
M. Middendorf, A. Kundaje, M. Shah, Y. Freund, C. Wiggins, C. Leslie. Motif
Discovery through Predictive Modeling of Gene Regulation. RECOMB 2005.
MEDUSA: Different view of training data
Learn regulatory program that makes genomewide, context-specific predictions for differential
(up/down) expression of target genes
MEDUSA – Set up
Target gene
analysis, important
regulators
TPK1, USV1,
AFR1, XBP1, …
Training data – Features
regulator expression
promoter sequence
label
feature vector
Boosting (Freund & Schapire 1995)
Boosting (Freund & Schapire 1995)
distribution over
training data
Boosting (Freund & Schapire 1995)
distribution over
training data
weak rule
Minimize
exponential
loss function

 
Z t   w ge exp  t y ge ht x ge
ge

Boosting (Freund & Schapire 1995)
distribution over
training data
weak rule
updated weights

 
t 1
t
wge
 wge
exp  t y ge ht x ge /Z t

Boosting (Freund & Schapire 1995)
distribution over
training data
updated weights
weak rule
Boosting (Freund & Schapire 1995)
distribution over
training data
updated weights
weak rule
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
GCTATGC
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
dimers (gapped elements)
TTT_AAA
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
dimers (gapped elements)
TTT_AAA
GCTA_GCTA
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
dimers (gapped elements)
TTT_AAA
GCTA_GCTA
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
Regulator expression
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
dimers (gapped elements)
TTT_AAA
GCTA_GCTA
Is AGCTATG present and USV1 up?
Is AGCTATG present and USV1 down?
Is GCTATGC present and USV1 up?
Is GCTATGC present and TPK1 up? …
try all motif-regulator
pairs as weak rules …
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
Regulator expression
k-mers
(k≤7)
AGCTATG
GCTATGC
CTATGCC
dimers (gapped elements)
TTT_AAA
GCTA_GCTA
Is GCTATGC present and USV1 up?
Is AGCTATG present and USV1 up?
Is AGCTATG present and USV1 down?
Is GCTATGC present and USV1 up?
Is GCTATGC present and TPK1 up? …
try all motif-regulator
pairs as weak rules …
boosting loss
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?
Is GCAATGC present and USV1 up?
Is TCTATGC present and USV1 up?
Is GCTTTGC present and USV1 up?
…
boosting loss
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?
Is GCAATGC present and USV1 up?
Is TCTATGC present and USV1 up?
Is GCTTTGC present and USV1 up?
…
Agglomerate
GCTATGC
GCAATGC
GGTATGC
CCTAAGC
GCTATTT
…
…
GGTATGG
PSSMs
…
…
boosting loss
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?
Is GCAATGC present and USV1 up?
Is TCTATGC present and USV1 up?
Is GCTTTGC present and USV1 up?
…
Optimize over offsets when
merging k-mers/PSSMs:
- - GCTATGC
GCTATTT - -
GCTATGC
GCAATGC
GGTATGC
CCTAAGC
GCTATTT
…
…
GGTATGG
PSSMs
…
…
boosting loss
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?
Is GCAATGC present and USV1 up?
Is TCTATGC present and USV1 up?
Is GCTTTGC present and USV1 up?
…
GCTATGC
GCAATGC
GGTATGC
CCTAAGC
GCTATTT
…
…
GGTATGG
PSSMs
…
…
boosting loss
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?
Is GCAATGC present and USV1 up?
Is TCTATGC present and USV1 up?
Is GCTTTGC present and USV1 up?
…
Is
present and USV1 up?
Is
present and USV1 up?
Is
present and USV1 up? …
GCTATGC
GCAATGC
GGTATGC
CCTAAGC
GCTATTT
…
…
GGTATGG
PSSMs
…
…
boosting loss
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?
Is GCAATGC present and USV1 up?
Is TCTATGC present and USV1 up?
Is GCTTTGC present and USV1 up?
…
minimize boosting loss
 final weak rule
Is
present and USV1 up?
Is
present and USV1 up?
Is
present and USV1 up? …
GCTATGC
GCAATGC
GGTATGC
CCTAAGC
GCTATTT
…
…
GGTATGG
PSSMs
…
…
MEDUSA strong rule
• Combine weak rules into a tree-structure
• Alternating decision tree = margin-based generalization of
decision trees
[Freund & Mason 1999]
•
Lower nodes are conditionally
dependent on higher nodes 
can possibly reveal
combinatorial interactions
•
Able to reveal motifs specific
to subsets of target genes
•
Able to learn any boolean
function
Yeast Environmental Stress Response
• Gasch et al. (2000) dataset, 173 microarrays,
13 environmental stresses
• ~5500 target genes, 475 regulators (237 TF+ 250 SM)
• 500bp upstream promoter sequences
• Binning into +1/0/-1 expression levels based on wildtype
vs.
wildtype noise
Statistical validation
• 10-fold cross-validation (held-out experiments), ~60,000
(gene,experiment) training examples, 700 iterations
• (Nk-mers+Ndimers+NPSSMs)*Nreg*2 ~= 107 possible weak rules
at every node
• MEDUSA’s motifs give a better prediction accuracy
on held-out experiments than database motifs
Yeast ESR: Biological Validation
Universal stress
repressor motif
STRE element
Yeast ESR: Biological Validation
Important regulators identified by MEDUSA
Cellular localization
of MSN2/4
Segal et al. 2003
Universal stress
repressor
Visualizing MEDUSA motifs
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
1.
2.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
3.
AAATTT
8.
5.
14.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
TAAGGG
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
16.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Biological validation – Context-specific analysis
• Restrict regulatory program to particular target
genes T, experimental conditions E  smaller
model
• Further statistical pruning of features using
margin-based score:

g T ,e E
 
 
y ge F x ge  Ff  x ge
• Identify most significant context-specific
 and motifs for target set
regulators
Biological validation – Context-specific analysis
• Example: oxygen sensing and regulation in yeast
(collaborator: Li Zhang)
Biological validation – Context-specific analysis
• Example: oxygen and heme inducible targets
Biological validation – Network inference
• Regulator-motif associations in nodes can have
different meanings:
P
Mp
Direct binding
P
TF
MTF
Indirect effect
P
Mp
M
Co-occurrence
• Need other data to confirm binding relationship
between regulator and target (e.g. ChIP chip)
• Still, can determine statistically significant
regulator-target relationships from regulation
program
Biological validation – Network inference
• Example: oxygen sensing and regulatory network
Discussion: What does “predictive” mean?
At least 2 usages:
• Makes accurate quantitative predictions
– Can assess predictions statistically, i.e. on test data
– Gives us confidence that model contains biologically
relevant information
vs.
• Generates biological hypotheses
– Without statistical validation, can only evaluate quality of
hypotheses through experiments
– Issues: How much of model is correct? How many false
positives? Is a network “edge” a meaningful prediction?
(Cf. DREAM initiative)
Discussion: “Predictive” modeling
• “Manifesto”
– We’re interested in hypothesis generation, but still must
give statistical validation on test data, i.e. show that
you’re not overfitting
– Not enough to show that model is non-random, e.g. good
p-values for functional enrichment
• Possible goal: move towards making useful
predictions for actual wet-lab experiments (e.g.
fewer input variables in model)
• MEDUSA: statistically predictive model, can still
interpret to extract biological hypotheses
Ongoing MEDUSA-related projects
• Oxygen sensing and regulation in yeast
(collaborator: Li Zhang, Public Health @ Columbia)
• Regulation of and by microRNAs in humans
(collaborators: Sander group, Sloan Kettering)
• Sequence information controlling tissue-specific
alternative splicing (collaborator: Larry Chasin,
Biology @ Columbia)
• Integration of phosphorylation (“kinome”) data to
reconstruct signaling pathways
• New Java MEDUSA software package – soon to be
released
http://www.cs.columbia.edu/compbio/medusa
Thanks
•
•
•
•
•
•
•
•
•
Manuel Middendorf (Physics)
Anshul Kundaje (CS)
David Quigley (DBMI)
Steve Lianoglou (CS)
Xuejing Li (Physics)
Mihir Shah (CS)
Marta Arias (CCLS)
Chris Wiggins (APAM)
Yoav Freund (CS@UCSD)
Funding: NIH (MAGNet NCBC grant)
Visualizing MEDUSA motifs
• Pruning based on feature dependence statistic:


 
 
y ge F x ge  F  x ge
Biological validation – Binding data
• ChIP chip: genome-wide proteinDNA binding data, i.e. what
promoters are bound by TF?
• Investigate regulatory network
model: use ChIP chip data in
place of motifs (no motif
discovery)
– Features: (regulator, TF-occupancy)
pairs
P1
P2
TF
Biological validation – Target gene analysis
• Restrict to target genes = protein chaperones;
experiments = heat shock, hypo/hyper-osmolarity
– CMK2 with HSF1 occupancy
(CaMKII mammalian ortholog
interacts with HSF1)
Biological validation – Signaling molecules
• Find all SMs that associate as regulators with a
particular TF’s ChIP occupancy in ADT features
• e.g.
TF
Glc7
phosphatase
complex
SM mRNA
Gac1Sds22
Gip1
Hsf1
• Hypothesis: Glc7 phosphatase complex interacts
with Hsf1 in regulation of Hsf1 targets
(Interaction supported in literature)
Update: Protein fold recognition
• SVM classifiers with string kernels for remote
homology detection, fold recognition
YPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGK
protein sequence
prediction of
structural class
profile
I
G
D
I
k-mer based
kernel computation
SVM
R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, C. Leslie. Remote homology
detection and motif extraction using profile-based string kernels. JBCB 2005.
SVM-Fold web server (soon to be deployed)
Download