Introduction to TFBS

advertisement
Finding Transcription Factor
Binding Sites
BNFO 602/691
Biological Sequence Analysis
Mark Reimers, VIPBG
Why TFBS?
• Protein transcription factors were among the
first gene regulators identified
• Many TFBS have distinct sequences, which
were suitable for bioinformatics analysis
• Now seen as one layer of mammalian gene
regulation, along with 3D structure of
chromatin, histones, anti-sense lncRNAs,
miRNAs, sequestration and transport
Outline
• Transcription factors and DNA-binding
proteins
• Factors affecting TF binding
• DNA motifs & PSWM
Transcription Factors and DNA-Binding Proteins
• Several dozen distinct families of proteins
have independently evolved binding to
specific DNA configurations (sequences)
• They perform a variety of functions in
organizing DNA or regulating transcription
• Usually several are involved in initiating gene
transcription
Transcription Factor Biology
• Most bind in major groove of DNA
• Many bind as dimers
• Typically TFs expressed at low levels
– Small changes in expression have big effects
• Mostly phosphorylated or otherwise activated
by other proteins
– Typically end-points of signalling cascades from
receptors on cell surface or in cytoplasm
Transcription Factor Families
• Over 40 known families
• In mammals majority of
TFs from three families:
– C2H2 zinc-finger
• (675 TFs in human)
– Homeodomain
• (257 TFs)
– Helix–loop–helix
• (87 TFs)
From Nature education
Transcription Factor Binding Motifs
• Binding of most factors is very specific, and only
significant at a small fraction of sites
• For many factors much of that specificity is captured
by the typical DNA sequence
• Sequence specificity often represented by motif
– Not all sites well represented by motif
NRF1 binding is well represented by motif
CTCF binding not captured well by motif
Transcription Factor Binding Sites
• ChIP-Seq experiments in the human genome
typically find from several hundred to >20,000
locations where a particular TF is binding
• Binding sites may be stronger or weaker
A typical set of ChIP-Seq reads for HNF4a (from BayesPeak paper)
TFs Often Bind Cooperatively
• Most TFBS occur in clusters in promoters or
enhancers/silencers with several others of
different kinds
• Usually only a few of these are actually
functional in any one cell type
• Different clusters operate in different cell
types
Dynamics of TF Binding
• TF comes on and off the DNA site, often
cycling in minutes or seconds
• Cooperative binding stabilizes TF
• Many TFs act in respond to signals or stresses
– Not captured systematically in most samples
TFBS Locations Often Evolve Rapidly
• Most enhancer TFBS in human do not align to
TFBS in mouse
From
Schmidt
et al
Science
2010
From Odom et al, Nature Genetics 2007
Factors Affecting TF Binding - I
• Most TFs occupy less than a few percent of their
consensus target sites in the genome
• Chromatin state
– Most TFs can only recognize their motif(s) if the DNA is
relatively open
– Some ‘pioneer’ factors bind to their sites in 3nm fiber and
open up chromatin for others
Zaret & Carroll, Genes Dev, 2012
Factors Affecting TF Binding - II
• Allosteric hindrance
– Presence of another TF on
opposite side hinders binding by
spreading major groove
• Cooperative Binding
Kim et al Science 2013
– Some TFs provide binding sites or
enhance binding of specific
others to DNA
– Promotes non-linear step-like
expression response to stimuli
Spitz et al Nature Rev Genetics 2012
Transcription Factor Binding Motifs
• Binding of most factors is very specific, and only
significant at a small fraction of sites
• For many factors much of that specificity is captured
by the typical DNA sequence
• Sequence specificity often represented by motif
– Not all sites well represented by motif
NRF1 binding is well represented by motif
CTCF binding not captured well by motif
TFBS Motifs Are Stable Over Evolution
• Most transcription factors favor almost the
same motifs in humans and in mice (and in
lizards … and often even in flies)
From Odom et al, Nature Genetics 2007
Position-Specific Weight Matrices
Represent TFBS Better than Motifs
• Represent log of probability of each base
occurring at each position in TFBS
• Often used to scan along genome calculating
log-likelihood at each position A composite PWSM scan for SP1
(from PEAKS webpage)
TFBS Motif Databases
• JASPAR - http://jaspar.genereg.net/
– High-quality curated public data
• TRANSFAC - http://www.biobaseinternational.com/product/transcriptionfactor-binding-sites
– Commercial product with dated public version
• Several research groups doing genome-wide
characterizations by various means
Finding TFBS and Motifs in Animals
• Sequence-based methods
– Scanning known TFBS motif
– If have several co-regulated genes, use HMM or Gibbs
sampler to identify common motif in them
• Data-based methods
– Use ChIP to identify locations of binding
• Needs good antibody; often picks up indirect binding
– Compare promoters across genomes
• Need depth; miss enhancers and species-related changes
– Look for DNAse footprints
– Use SELEX or DS-DNA microarray to profile TF’s DBD
Other Approaches to Finding TFBS
• Systematic Evolution of Ligands by Exponential
Enrichment (SELEX)
From Jolma et al, Cell, 2013
Generate random DNA sequence
library of moderate length. The
sequences in the library are exposed
to the target ligand, and those that
do not bind the target are removed
by affinity chromatography.
The bound sequences are eluted,
and then amplified by PCR, and the
process is run again under more
stringent elution conditions to purify
the tightest-binding sequences.
Other Approaches to Finding TFBS
Identify recurrent motifs under DNaseI footprints
From Neph et al, Nature, 2012
Integrated Approaches to Identifying TFBS
• Combining Scores and TF-Specific ChIP-Seq
• Combining information from scanning and
PhastCons or PhyloP conservation
• Combining information from DNAse,
conservation and histone marks
– Integrating DGF
• Combining information from DNAse,
conservation and histone marks
Finding TFBS Motif via TF-Specific ChIP-Seq
• ChIP gives
approximate (~200bp)
TFBS locations
• Sequence can identify
loci more specifically
within ChIP peaks
• Use HMM or Gibbs
• Indirect binding won’t
be found
• Weak binding can be
accommodated
From Gelfond et al Biometrics 2009
Finding Active TFBS in Tissues
• Need Bayes model to integrate information
from various sources
• Easiest if have some PSWM for binding site
• We will focus on this situation
• Increasingly being done to discover novel
motifs or PSWMs
Bayesian Hierarchical Models
• Prior probability of binding site set very low or
estimated from TF-specific ChIP data
• In principle binding should be a continuous
variable; we will treat as ‘yes-no’
• Need to estimate probability of various
genomic features – conservation, DNAse,
histone marks – for TFBS and for background
sequence
Bayes Model for Combining Scores and
Conservation
• How to estimate P(conserved | TFBS)?
• Depends on depth of time for which conservation is used
– For mammals ~ 40%; primates ~ 80%
– Varies between promoter and enhancer
• Background state can be estimated from genome-wide
conservation (typically 5 - 10%)
• Then combine by Bayes Formula
P(C & S | B)P(B)
P(B | C, S) =
P(C & S)
• C and S are conditionally independent given B, so
P(C&S|B) = P(C|B)P(S|B) (likewise for ~B)
Bayes Model for Combining Scores and
DNase Sensitivity
• How to estimate P(DHS | TFBS)?
• Almost all (~98%) of known TFBS occur in DHS
• Background state can be estimated from
genome-wide levels (typically 1 or 2%)
• Then combine by Bayes Formula
P(D & S | B)P(B)
P(B | D, S) =
P(D & S)
• D & S are conditionally independent given B, so
P(D&S|B) = P(D|B)P(S|B)
– likewise P(D&S)=P(D)P(S)
What Information from Histone
Marks?
• By themselves histone marks, esp H3K4me3,
H3K4me1, H3K27me3 can be very informative
• After introducing DNAse data, these marks do
not add much direct information
• Could be used to adjust probabilities for DHS
and conservation (not yet done)
Chromia – A Method for Using Histone
Marks and PSWM
• Uses an HMM approach to integrate PSWM
and histone marks (NB P300 ~ H3K27me3)
CENTIPEDE– A Method for Combining
DNAse, Conservation and PSWM Scores
• Combines several
kinds of genomic
information with
PSWM to identify
putative TFBS
• Confirmation by ChIPSeq is quite good
Pique-Regi R et al. Genome Res. 2011;21:447-455
CENTIPEDE– A Method for Combining
DNAse, Conservation and PSWM Scores
Model learned by the CENTIPEDE approach for the transcription factor NRSF. (A) Empirica
density plots for key aspects of the data for sites inferred by CENTIPEDE to be bound (gree
lines, CENTIPEDE posterior probabilities >0.95) and unbound (red lines, probabilities < 0.5
Pique-Regi R et al. Genome Res. 2011;21:447-455
Download