DNA motif

advertisement
The relation between amino-acid substitutions in the
interface of transcription factors and their
recognized DNA motifs
Álvaro Sebastian Yagüe
asebastian@eead.csic.es
Laboratory of Computational Biology
http://www.eead.csic.es/compbio
Estación Experimental de Aula Dei
CSIC, Zaragoza, España
February 2, 2010 - V National Conference BIFI 2011
Content index
• DNA recognition and binding
• 3D footprinting
• footprintDB database
• alignment of DNA motifs
• alignment of protein interfaces
DNA recognition and binding
DNA-binding proteins
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
lac repressor
Tyr 17
Tyr 12
Tyr 7
Jones CE, Olson OM: Sequence-specific DNA-protein interaction: the lac repressor. J Theor Biol 64:323-332, 1977.
DNA-binding proteins
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
lac repressor
Tyr 7
Tyr 12
Tyr 17
Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the
lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996.
DNA-binding proteins
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
lac repressor
Tyr 7
Tyr 12
Tyr 17
Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the
lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996.
DNA-binding proteins
DNA-binding proteins are proteins that are composed of DNA-binding
domains and thus have a specific or general affinity for either single or
double stranded DNA.
Tyr 7
Tyr 12
Tyr 17
3D footprinting
Methods for studying protein-DNA interactions
Method
Advantages
Limitations
Nitrocellulose filter binding assay
Relatively simple handling
No localisation of binding site
Footprinting assays
Technical simplicity
Incomplete binding frequently results in
unclear footprint
Methylation interference
Combined analysis of binding site and
effect of epigenetic variations
Very complex workflow
Electrophoretic mobility shift assay
(EMSA)
Technically simple assay that permits
semi-quantitative studies
In complex analyses, no immediate
information on binding sites or proteins
involved
Chromatin immunoprecipitation (ChIP)
Applicable also for in vivo analyses
Relies very strongly on antibody specificity
DNA adenine methyltransferase
identification (DamID)
In vivo detection
Requirement of exogenous fusion proteins
Systematic evolution of ligands by
exponential enrichment (SELEX)
Real-time recording of association and
dissociation
Enables in vitro selection of optimal
binding partners
Yeast one-hybrid system
In vivo assay
Very complex system
DNA microarrays
High throughput
Analysis process for individual proteins
Protein microarrays
High throughput
Monomer-specificity
Proximity ligation
Highly specific and sensitive down to
single-molecule detection
Complex sample preparation
Atomic force microscopy, X-ray
crystallography, nuclear magnetic
resonance
High-resolution structural information
No use for definition of interaction pairs or
identification of genomic locations
Surface plasmon resonance (SPR)
No high throughput
Only selection of best binding events
Helwa R, Hoheisel JD: Analysis of DNA-protein interactions: from nitrocellulose filter binding assays to microarray
studies. Anal Bioanal Chem 398:2551-2561.
3D Footprinting
3D footprinting is a computational technique developed in our lab that annotates DNAbinding interfaces by analizing 3D published structures from PDB.
3D-footprint calcultated interface:
1D5Y
Interface residues for 1d5y_A TF: 32,34,35,37,38
http://floresta.eead.csic.es/3dfootprint/
footprintDB
footprintDB
We have designed, implemented and curated a database with more than 3000 unique DNAbinding proteins (mostly transcription factors, TFs) and 4000 Position Weight Matrices
(PWMs) extracted from the literature and other repositories.
TF sequences in footprintDB have annotated their DNA-binding interface residues by
aligning their sequences with 3D-footprint templates.
footprintDB
Database
Description
TFs
PWMs
TRANSFAC
Data on transcription factors, their experimentelly-proven binding sites, their
positional weight matrices and regulated genes.
367
608
JASPAR CORE
Curated, non-redundant set of profiles, derived from published collections of
experimentally defined transcription factor binding sites for eukaryotes.
443
465
RegulonDB
Curated data of the transcriptional regulatory network of Escherichia coli K12.
70
70
3D-footprint
Database of DNA-binding protein structures that is updated weekly with Protein
Data Bank complexes.
1006
1225
AthaMap
Genome-wide map of potential transcription factor and small RNA binding sites
in Arabidopsis thaliana
42
48
Drosophila CTFM
Motif models reported in 51 primary references in the form of position PWMs for
56 Drosophila melanogaster transcription factors.
59
62
Repository of information on C2H2 zinc fingers and engineered zinc- finger
arrays.
858
873
ZifBASE
An extensive collection of various natural and engineered zinc finger proteins.
139
144
AGRIS
Resource of Arabidopsis promoter sequences, transcription factors and their
target genes.
53
53
Repository of experimental data from universal protein binding microarray
(PBM) experiments.
296
437
Database of motifs found in plant cis-acting regulatory DNA elements, all from
previously published reports.
28
480
ZIFDB
UniPROBE
PLACE
footprintDB
footprintDB predicts:
1. Transcription factors which bind a specific DNA site or motif
2. DNA motifs likely to recognised by a specific DNA-binding protein
http://floresta.eead.csic.es/footprintdb/
alignment of protein interfaces
Alignment of protein interfaces
The rationale behind footprintDB is the observation that proteins which recognize a
similar DNA motif most often have a similar set of residues at the interface.
DNA motif ~ TF interface
yCAATTAws ~ RKRTQNTK
-yaATTAam ~ RRRIQNTK
-yAATTArg ~ RRRIQNAK
-TAATTArc ~ RRRIQNAK
-tmATTAAs ~ KRRIQNMK
Alignment of protein interfaces
Noyes et al. have recently shown that homeodomain binding specificities depend on
the interface residues involved in DNA motif recognition.
Noyes, M.B., Christensen, R.G., Wakabayashi, A., Stormo, G.D., Brodsky, M.H., Wolfe, S.A.: Analysis of homeodomain specificities allows the
family-wide prediction of preferred recognition sites. Cell 133 (2008) 1277-1289
Alignment of protein interfaces
Unknown homeodomain protein
Homeodomain interface residues
RRRIQNAK
Interface alignment with footprintDB
annotated interfaces
yCAATTAws
-yaATTAam
-TAATTArc
-tmATTAAs
~
~
~
~
RKRTQNTK
RRRIQNTK
RRRIQNAK
KRRIQNMK
Predicted DNA binding motif
TAATTArc
Alignment of protein interfaces
Scoring of aligned protein interfaces will be more accurate in predicting which DNA
motif bind a unknown DNA binding protein that other scoring methods like local
alignment.
Homeodomains:
bZIPs:
ROC curve shows that interface alignments improve DNA motif predictions in
comparisson with Blast scores.
alignment of DNA motifs
DNA motif alignment issues
• Three alignment combinations: ATC / GTT ; ATC / AAC ; GAT / GTT
longer calculation time and higher false positive rate than a pairwise alignment
• Different motif sizes: TgAGt / ackrTGACGTCAycra
it’s not a big issue if we divide the score by the number of aligned nucleotides
• Small motifs are prone to false high-scoring alignments, due to the small
nucleotide alphabet size: AGt / CGT
high similarity thresholds are required, particularly with individual Zinc Fingers
that usually recognize 3 nts
DNA motif alignment issues
• Complex motifs (multimeric proteins): ackrTGACGTCAycra /
rTGACwmAGCA
they are not easy to align and heteromultimers might bind different sites
• A single motif for TFs with multiple DNA-binding domains
it might not be possible to know which domain binds to each submotif
• TFs with different annotated motifs
as a result of different oligomeric conformations or experimental approaches
• Motifs with very low information content: akaTTrchhaAhcw
might be genuine or result from low resolution experiments; source of FP hits
Alignment of DNA motifs
Some families of transcription factors and their singularities:
Family
Motifs
Multimeric
Multidomain
TAATkr, TGAyA
Sometimes
Unusual
CACGTG, CAsshG
Always (homodimers, heterodimers)
Never
CACGTG, -ACGT-, TGAGTC
Always (homodimers, heterodimers)
Never
GkTwGkTr
Usual (multimers)
Usual
mTT(T)GwT, TTATC, ATTCA
Sometimes
Unusual
GAGA
GAGA
Never
Never
Fork head
TrTTTr
Unusual
Never
CGG
Usual (homodimers)
Never
GGAw
Usual (homodimers, heterodimers, multimers)
Never
GGnnwTyCC'
Always (homodimers, heterodimers)
Never
AAnnGAAA
Always (homodimers, heterodimers, multimers)
Never
Homeodomain
Basic helix-loop-helix (bHLH)
Basic leucine zipper (bZIP)
MYB
High mobility group (HMG)
Fungal Zn(2)-Cys(6) binuclear
cluster
Ets
Rel homology domain (RHD)
Interferon regulatory factor
Alignment of DNA motifs
Motifs are aligned with Smith-Waterman ungapped algorithm and motif
similarity is calculated using the sum of the Pearson Correlation
Coefficients of the motif positions.
G A C
G C C
Similarity:
1 + 0 + 1 = 2 / 3 = 0.67
Alignment of DNA motifs
Motifs are aligned with Smith-Waterman ungapped algorithm and motif
similarity is calculated using the sum of the Pearson Correlation
Coefficients of the motif positions.
A
0
1
0
01
02
03
C
0
4
4
G
6
0
0
T
0
1
2
G
C
C
01
02
03
A
0
3
0
C
0
1
4
G
3
0
0
T
1
0
0
GCC
GAC
G
A
C
Simil = r1+r2+r3 = 0.94 + 0.14 + 0.87 = 1.95
Pearson Correlation Coefficient:
Position 1:
(0  1.5)
(0  1.5)(0  1)  (0  1.5)(0  1)  (6  1.5)(3  1)  (0  1.5)(1  1)
2

 (0  1.5)  (6  1.5)  (0  1.5)  (0  1)  (0  1)  (3  1)  (1  1)
2
2
2
2
2
2
2

 0.94
Alignment of DNA motifs
4900
TRANSFAC
individual
DNA sites
were
aligned
with
their
corresponding DNA motifs (PWMs), yielding a mean similarity of 0.70
AGCTTCCTC
GGCATCCAG
GTCTTCCTA
AGCTTCCAC
GGCATCCAC
GACTTCCTC
P0
01
02
03
04
05
06
07
08
09
A
2
1
0
2
0
0
0
3
1
C
0
0
6
0
0
6
6
0
4
G
4
4
0
0
0
0
0
0
1
T
0
1
0
4
6
0
0
3
0
Half of DNA sites share <0.70 similarity with its motif
DNA motifs have a large variability
G
G
C
T
T
C
C
W
C
Alignment of DNA motifs
4900 TRANSFAC individual DNA sites were aligned against random
footprintDB database motifs, yielding a mean similarity of 0.47.
AGCTTCCTC
P0
01
02
03
04
05
06
07
08
09
A
C
G
?
Individual DNA sites and motifs can yield
moderate similarities by chance
T
Alignment of DNA motifs
Which motif similarity threshold should
we use to identify DNA sites and motifs?
0.47 < ? < 0.70
AGCTTCCTC
P0
01
02
03
04
05
06
07
08
09
A
2
1
0
2
0
0
0
3
1
C
0
0
6
0
0
6
6
0
4
G
4
4
0
0
0
0
0
0
1
T
0
1
0
4
6
0
0
3
0
G
G
C
T
T
C
C
W
C
Alignment of DNA motifs
Drawing a ROC curve interpolating TPR and FPR from TRANSFAC
alignments, we obtain that values of motif similarity ratio beween 0.60 and
0.55 cover a sensitivity (TPR) range of 0.71-0.80 and a specificity (1-FPR)
range of 0.88-0.74.
0.4
0.3 0.1
1
0.5
0.9
0.8
0.6
0.7
similarity
0.55 – 0.60
0.6
TPR
0.7
0.5
0.4
0.8
0.3
0.2
0.9
0.1
1
0
0
0.1
0.2
0.3
0.4
0.5
FPR
0.6
0.7
0.8
0.9
1
Thanks for your attention
Laboratory of Computational Biology
Estación Experimental de Aula Dei / CSIC
Av. Montañana 1.005
50059 Zaragoza (Spain)
Tel.: +34 976716089
Web: http://www.eead.csic.es/compbio/
Questions?
Download