Computational Biology and Gene Expression

advertisement
Detection and analysis of
transcriptional control sequences
Wyeth Wasserman
October VanBUG Seminar
Centre for Molecular Medicine and Therapeutics
Children’s and Women’s Hospital
University of British Columbia
Transcription Simplified
CMMT
URF
Pol-II
URE
TATA
Overview of Transcription
in Gene Regulation
CMMT
• At the most basic level, transcriptional
regulation is defined by binding of TFs to DNA
• Complexity is increased by TF interactions,
chromatin structure and protein modifications
• How can we advance our understanding of
regulation by computational analysis?
A short history lesson…
Representing Binding Sites
for a TF (HNF1)
• A single HNF1 site
» AAGTTAATGATTAAC
• A set of sites represented as a consensus
» VDRTWRWWSHDWVWH
• A matrix describing a a set of sites
A
C
G
T
14 16 4 0 1 19 20 1
3 0 0 0 0 0 0 0
4 3 17 0 0 2 0 0
0 2 0 21 20 0 1 20
4 13 4 4 13 12 3
7 3 1 0 3 1 12
9 1 3 0 5 2 2
1 4 13 17 0 6 4
CMMT
PFMs to PWMs
CMMT
One would like to add the following features to the model:
1. Correcting for the base frequencies in DNA
2. Weighting for the confidence (depth) in the pattern
3. Convert to log-scale probability for easy arithmetic
w matrix
f matrix
A
C
G
T
5
0
0
0
0
2
3
0
1
2
1
1
0
4
0
1
f(b,i) + N / 4
0
Log
p(b)
0
4
1
(
) AC
1.6
-1.7
G -1.7
T -1.7
-1.7
0.5
1.0
-1.7
-0.2
0.5
-0.2
-0.2
-1.7
1.3
-1.7
-0.2
TGCTG = 0.9
-1.7
-1.7
1.3
-0.2
Performance of Profiles
CMMT
• 95% of predicted sites bound in vitro
(Tronche 1997)
• MyoD binding sites predicted about once
every 600 bp (Fickett 1995)
• The Futility Theorem
– Nearly 100% of predicted transcription factor
binding sites have no function in vivo
A 1 kbp promoter screened
with collection of TF profiles
CMMT
CMMT
Phylogenetic Footprinting
70,000,000 years of evolution reveals
most regulatory regions.
Phylogenetic Footprinting
to Identify Functional Segments
CMMT
% Identity
200 bp Window Start Position (human sequence)
Actin gene compared between human and mouse with DPB.
Regulatory sites are usually conserved
between orthologous genes
CMMT
HUMAN ACGATACGCATCACAGACT.ACAGACTACGGCTAGCA
-|-|||||||||-|---|--|||-------|-|---|
MOUSE GCAATACGCATCGCGATCAGACATCAGCACG.TGTGA
HUMAN ACATCAGCATACACGCAACTACACAGACTACGACTA
---|||||-||||---|-|----||-||-||||--MOUSE CGTTCAGCTTACAGCTAGCATAGCATACGACGATAC
The 1kbp promoter screen
with footprinting
CMMT
Choosing the ”right” species...
(BONUS: What’s the ultimate sin in bioinformatics?)
CMMT
CHICKEN
HUMAN
MOUSE
HUMAN
COW
HUMAN
ConSite (www.phylofoot.org)
CMMT
Performance: Human vs. Mouse
CMMT
• Testing set: 40 experimentally defined sites in 15 well
studied genes
• 85-95% of defined sites detected with conservation filter,
while only 11-16%of total predictions retained
CMMT
de novo Discovery
of TF Binding Sites
Unraveling Transcriptional
Control Mechanisms
CMMT
Given a set of ”co-regulated” genes,
define motifs over-represented in the regulatory regions
Pattern Detection Methods
CMMT
• Exhaustive
– e.g. “Moby Dick” (Bussemaker, Li & Siggia)
– Identify over-represented oligomers in comparison of
“+” and “-” (or complete) promoter collections
• Monte Carlo/Gibbs Sampling
– e.g. AnnSpec (Workman & Stormo)
– Identify strong patterns in “+” promoter collection vs.
background model of expected sequence characteristics
Yeast Regulatory Sequence
Analysis (YRSA) system
CMMT
Yeast tests of YRSA
System
DNA-damage
Classic
PDR3-regulated
cell-cycleresponse
array
genes
data
partially
mediating
by et
MCB
re-clustered
from
array
by Getz
study
al
CMMT
THE PROBLEM:
Pattern Detection in Long Sequences
CMMT
MEF2 SIMILARITY SCORE
18
16
14
MEF2 SET
12
RANDOM SET
10
0
100
200
300
400
SEQUENCE LENGTH
500
600
Four Approaches to
Extend Sensitivity
CMMT
• Phylogenetic Footprinting
– Human-Mouse eliminates ~75% of sequence
• Better background models
– e.g. AnnSpec
• Better definition of co-regulation
– Microarrays occasionally produce noise
• Use biochemical knowledge about TFs
– TFBS patterns are NOT random
CMMT
Some characteristics have been explored…
• Segmentation: informative positions separated
by variable positions (proteins bind as dimers)
• Positional Variance: subset of positions
contain most of the info
• Palindromes are common in the patterns
Our Hypothesis
CMMT
• Point 1: Structurally-related DNA binding
domains interact with similar target sequences
• Exceptions exist (e.g. Zn-fingers)
• Point 2: There are a finite number of binding
domains used in human TFs
• Approximately 20-25
• Idea: We could use the shared binding properties
for each family to focus pattern detection methods
• Constrain the range of patterns sought
Comparison of profiles requires
alignment and a scoring function
CMMT
Frequency
• Scoring function based on sum of
squared differences
• Align frequency matrices with modified
Needleman-Wunsch algorithm
• Calculate empirical p-values based on
simulated set of matrices
Score
Prediction of TF Class
CMMT
Match to bHLH
COMPARE
TF Database
(JASPAR)
Jackknife Test
87% correct
Independent Test Set 93% correct
CMMT
FBPs enhance sensitivity
of pattern detection
CMMT
APPLICATION:
Cancer Protection Response
CMMT
• Detoxification-related enzymes are induced by
compounds present in Broccoli
• Arrays, SSH and hard work have defined a set of
responsive genes
• A known element mediates the response
(Antioxidant Responsive Element)
• Controversy over the type of mediating leucine
zipper TF
• NF-E2/Maf or Jun/Fos
Application (2)
CMMT
Classify
Gibbs
with
New
FBP
TF Motif
Prior
Gibbs
Sampling
Maf (p<0.02)
Jun (p<0.98)
Problem: Given a set of co-regulated genes, determine the
common TFBS. Classify the mediating TF. We expect a
leucine zipper-type TF.
CMMT
Regulatory Modules
TFs do NOT act in isolation
Layers of Complexity in Metazoan Transcription
Chromatin picture used with
permission of Zymogenetics.
Liver Differentiation
(data mostly from studies of hepatocytes)
CMMT
Early Fetal
Stem
CEBPa
HNF3
HNF1
HNF4
Mature
Liver regulatory modules
CMMT
Models for Liver TFs…
(Data that takes 2 months to produce and 10 seconds to present)
(Or, what to do with an astrophysicist new to bioinformatics)
HNF1
C/EBP
CMMT
HNF3
HNF4
Training predictive models
for modules
CMMT
• Limited by small size of positive training
set
• We elected to use logistic regression
analysis for the first models
• Your favorite statistical approach would
probably do equally well
– data limited
Logistic Regression Analysis
CMMT
* a1
* a2
* a3
* a4
Optimize a vector to maximize
the distance between output values
for positive and negative training
data.
S
“logit”
Output value is:
elogit
p(x)=
1 + elogit
UDPGT1 (Gilbert’s Syndrome)
1
0.8
0.6
Series1
Wildtype
Mutant
Series2
0.4
0.2
0
“Window” Position in Sequence
5840
5430
5020
4610
4200
3790
3380
2970
2560
2150
1740
1330
920
510
-0.2
100
Liver Module Model Score
CMMT
PERFORMANCE
CMMT
• Liver (Genome Research, 2001)
– At 1 hit per 35 kbp, identifies 60% of modules
– Limited to genes expressed late in liver
development
LRA Models do not account for
multiple sites for the same TF*
• Skeletal Muscle (JMB, 1998)
– Set to 1 prediction per 35 000 bp
– Identifies 66% of test set correctly
* Side-track: Newer Methods
Combining Phylogenetic
Footprinting with a Module Model
CMMT
Genome Scan
CMMT
• Screened the available mouse genomic
sequences (~300 MB) for modules and
discarded hits for which sequence was not
conserved with human (BLAST)
• Removed regions for which corresponding
human sequence did not score as module
• Of ~100 predicted modules
• 20 annotated genes: 5 from training, 3 additional
modules, 5 liver specific, 3 unknown and 4 not liver
CMMT
de novo Discovery
of Regulatory Modules
Focus on regulatory modules
for pattern detection
CMMT
Cluster Genes
by Expression
Predictive Models
6
2
0
0
0
8
0
0
0
4
4
0
0
7
0
1
7
1
0
0
0
0
8
0
0
2
0
6
Identify and Model
Contributing TFs
Finding binding sites in sets of
co-regulated human genes
• Sequence “space” is too large
– Narrow with Phylogenetic Footprinting
• Identify patterns in conserved blocks via
Gibbs sampling
• Assess quality of patterns based on
biological knowledge
CMMT
Phylogenetic Footprinting to
Identify Conserved Regions
CMMT
Skeletal Muscle Genes
CMMT
• One of the most extensively studied tissues for
transcriptional regulation
– 45 genes partially analyzed
– 26 genes with orthologous genomic sequence from
human and rodent
• Five primary classes of transcription factors
– Principal: Myf (myoD), Mef2, SRF
– Secondary: Sp1 (G/C rich patches), Tef (subset of
skeletal muscle types)
Regulatory regions directing
muscle-specific transcription
MyoD/Myf
SRF
Mef2
Tef
CMMT
de novo Discovery of Skeletal Muscle
Transcription Factor Binding Sites
CMMT
Mef2-Like
SRF-Like
Myf-Like
We will soon be able to define
modules for many contexts…
CMMT
A gene-centric data
integration project...
CMMT
COMING SOON:
The Integrated Module Sampler
CMMT
Gene1
Gene2
Gene3
Gene4
Gene5
Calls to
GeneLynx
Calls to
ensEMBL
Calls to BlastZ
(Switch to Lagan?)
Module Sampler
CMMT
YOU SHOULD HAVE
BEEN THERE… THIS
SLIDE EXCLUDED FROM
THE POSTED FILE
Conclusions
CMMT
• Evolution drives understanding in biology
– Phylogenetic Footprinting
• Biochemistry inspires Bioinformatics
– Regulatory Modules
– Familial Binding Profiles
• Analysis of regulatory sequences is improving
– Given sets of orthologous genes, one can predict regulatory
regions
– Given sets of co-regulated genes, it is possible to infer the
binding profiles for critical transcription factors
• Much more work is needed…
THANKS!
Wasserman Group – CMMT
Danielle Kemmer
Several Newcomers
Wasserman Group - Sweden
Albin Sandelin
Raf Podowski (CA)
Wynand Alkema
Collaborating Students
Malin Andersson (Odeberg)
Öjvind Johansson (Lagergren)
Hui Gao (Dahlman-Wright)
Emily Hodges (Höög)
Collaborators
Chip Lawrence (Wadsworth)
Boris Lenhard (K.I.)
Jens Lagergren (SBC)
Christer Höög (K.I.)
Brenda Gallie (OCI)
Jacob Odeberg (KTH)
Niclas Jareborg (AZ)
William Hayes (AZ)
Group Alumni
Per Engström
Elena Herzog
Annette Höglund
William Krivan
Boris Lenhard
Luis Mendoza
Support: Merck-Frosst, C&W, Pharmacia, EU–Marie Curie,
CGDN, KI-Funder
URLs...
CMMT
• Group: www.cmmt.ubc.ca
• ConSite/DPB: www.phylofoot.org
• GeneLynx: www.genelynx.org
Download