The relation between amino-acid substitutions in the interface of transcription factors and their recognized DNA motifs Álvaro Sebastian Yagüe asebastian@eead.csic.es Laboratory of Computational Biology http://www.eead.csic.es/compbio Estación Experimental de Aula Dei CSIC, Zaragoza, España February 2, 2010 - V National Conference BIFI 2011 Content index • DNA recognition and binding • 3D footprinting • footprintDB database • alignment of DNA motifs • alignment of protein interfaces DNA recognition and binding DNA-binding proteins DNA-binding proteins are proteins that are composed of DNA-binding domains and thus have a specific or general affinity for either single or double stranded DNA. lac repressor Tyr 17 Tyr 12 Tyr 7 Jones CE, Olson OM: Sequence-specific DNA-protein interaction: the lac repressor. J Theor Biol 64:323-332, 1977. DNA-binding proteins DNA-binding proteins are proteins that are composed of DNA-binding domains and thus have a specific or general affinity for either single or double stranded DNA. lac repressor Tyr 7 Tyr 12 Tyr 17 Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996. DNA-binding proteins DNA-binding proteins are proteins that are composed of DNA-binding domains and thus have a specific or general affinity for either single or double stranded DNA. lac repressor Tyr 7 Tyr 12 Tyr 17 Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271:1247-1254, 1996. DNA-binding proteins DNA-binding proteins are proteins that are composed of DNA-binding domains and thus have a specific or general affinity for either single or double stranded DNA. Tyr 7 Tyr 12 Tyr 17 3D footprinting Methods for studying protein-DNA interactions Method Advantages Limitations Nitrocellulose filter binding assay Relatively simple handling No localisation of binding site Footprinting assays Technical simplicity Incomplete binding frequently results in unclear footprint Methylation interference Combined analysis of binding site and effect of epigenetic variations Very complex workflow Electrophoretic mobility shift assay (EMSA) Technically simple assay that permits semi-quantitative studies In complex analyses, no immediate information on binding sites or proteins involved Chromatin immunoprecipitation (ChIP) Applicable also for in vivo analyses Relies very strongly on antibody specificity DNA adenine methyltransferase identification (DamID) In vivo detection Requirement of exogenous fusion proteins Systematic evolution of ligands by exponential enrichment (SELEX) Real-time recording of association and dissociation Enables in vitro selection of optimal binding partners Yeast one-hybrid system In vivo assay Very complex system DNA microarrays High throughput Analysis process for individual proteins Protein microarrays High throughput Monomer-specificity Proximity ligation Highly specific and sensitive down to single-molecule detection Complex sample preparation Atomic force microscopy, X-ray crystallography, nuclear magnetic resonance High-resolution structural information No use for definition of interaction pairs or identification of genomic locations Surface plasmon resonance (SPR) No high throughput Only selection of best binding events Helwa R, Hoheisel JD: Analysis of DNA-protein interactions: from nitrocellulose filter binding assays to microarray studies. Anal Bioanal Chem 398:2551-2561. 3D Footprinting 3D footprinting is a computational technique developed in our lab that annotates DNAbinding interfaces by analizing 3D published structures from PDB. 3D-footprint calcultated interface: 1D5Y Interface residues for 1d5y_A TF: 32,34,35,37,38 http://floresta.eead.csic.es/3dfootprint/ footprintDB footprintDB We have designed, implemented and curated a database with more than 3000 unique DNAbinding proteins (mostly transcription factors, TFs) and 4000 Position Weight Matrices (PWMs) extracted from the literature and other repositories. TF sequences in footprintDB have annotated their DNA-binding interface residues by aligning their sequences with 3D-footprint templates. footprintDB Database Description TFs PWMs TRANSFAC Data on transcription factors, their experimentelly-proven binding sites, their positional weight matrices and regulated genes. 367 608 JASPAR CORE Curated, non-redundant set of profiles, derived from published collections of experimentally defined transcription factor binding sites for eukaryotes. 443 465 RegulonDB Curated data of the transcriptional regulatory network of Escherichia coli K12. 70 70 3D-footprint Database of DNA-binding protein structures that is updated weekly with Protein Data Bank complexes. 1006 1225 AthaMap Genome-wide map of potential transcription factor and small RNA binding sites in Arabidopsis thaliana 42 48 Drosophila CTFM Motif models reported in 51 primary references in the form of position PWMs for 56 Drosophila melanogaster transcription factors. 59 62 Repository of information on C2H2 zinc fingers and engineered zinc- finger arrays. 858 873 ZifBASE An extensive collection of various natural and engineered zinc finger proteins. 139 144 AGRIS Resource of Arabidopsis promoter sequences, transcription factors and their target genes. 53 53 Repository of experimental data from universal protein binding microarray (PBM) experiments. 296 437 Database of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports. 28 480 ZIFDB UniPROBE PLACE footprintDB footprintDB predicts: 1. Transcription factors which bind a specific DNA site or motif 2. DNA motifs likely to recognised by a specific DNA-binding protein http://floresta.eead.csic.es/footprintdb/ alignment of protein interfaces Alignment of protein interfaces The rationale behind footprintDB is the observation that proteins which recognize a similar DNA motif most often have a similar set of residues at the interface. DNA motif ~ TF interface yCAATTAws ~ RKRTQNTK -yaATTAam ~ RRRIQNTK -yAATTArg ~ RRRIQNAK -TAATTArc ~ RRRIQNAK -tmATTAAs ~ KRRIQNMK Alignment of protein interfaces Noyes et al. have recently shown that homeodomain binding specificities depend on the interface residues involved in DNA motif recognition. Noyes, M.B., Christensen, R.G., Wakabayashi, A., Stormo, G.D., Brodsky, M.H., Wolfe, S.A.: Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133 (2008) 1277-1289 Alignment of protein interfaces Unknown homeodomain protein Homeodomain interface residues RRRIQNAK Interface alignment with footprintDB annotated interfaces yCAATTAws -yaATTAam -TAATTArc -tmATTAAs ~ ~ ~ ~ RKRTQNTK RRRIQNTK RRRIQNAK KRRIQNMK Predicted DNA binding motif TAATTArc Alignment of protein interfaces Scoring of aligned protein interfaces will be more accurate in predicting which DNA motif bind a unknown DNA binding protein that other scoring methods like local alignment. Homeodomains: bZIPs: ROC curve shows that interface alignments improve DNA motif predictions in comparisson with Blast scores. alignment of DNA motifs DNA motif alignment issues • Three alignment combinations: ATC / GTT ; ATC / AAC ; GAT / GTT longer calculation time and higher false positive rate than a pairwise alignment • Different motif sizes: TgAGt / ackrTGACGTCAycra it’s not a big issue if we divide the score by the number of aligned nucleotides • Small motifs are prone to false high-scoring alignments, due to the small nucleotide alphabet size: AGt / CGT high similarity thresholds are required, particularly with individual Zinc Fingers that usually recognize 3 nts DNA motif alignment issues • Complex motifs (multimeric proteins): ackrTGACGTCAycra / rTGACwmAGCA they are not easy to align and heteromultimers might bind different sites • A single motif for TFs with multiple DNA-binding domains it might not be possible to know which domain binds to each submotif • TFs with different annotated motifs as a result of different oligomeric conformations or experimental approaches • Motifs with very low information content: akaTTrchhaAhcw might be genuine or result from low resolution experiments; source of FP hits Alignment of DNA motifs Some families of transcription factors and their singularities: Family Motifs Multimeric Multidomain TAATkr, TGAyA Sometimes Unusual CACGTG, CAsshG Always (homodimers, heterodimers) Never CACGTG, -ACGT-, TGAGTC Always (homodimers, heterodimers) Never GkTwGkTr Usual (multimers) Usual mTT(T)GwT, TTATC, ATTCA Sometimes Unusual GAGA GAGA Never Never Fork head TrTTTr Unusual Never CGG Usual (homodimers) Never GGAw Usual (homodimers, heterodimers, multimers) Never GGnnwTyCC' Always (homodimers, heterodimers) Never AAnnGAAA Always (homodimers, heterodimers, multimers) Never Homeodomain Basic helix-loop-helix (bHLH) Basic leucine zipper (bZIP) MYB High mobility group (HMG) Fungal Zn(2)-Cys(6) binuclear cluster Ets Rel homology domain (RHD) Interferon regulatory factor Alignment of DNA motifs Motifs are aligned with Smith-Waterman ungapped algorithm and motif similarity is calculated using the sum of the Pearson Correlation Coefficients of the motif positions. G A C G C C Similarity: 1 + 0 + 1 = 2 / 3 = 0.67 Alignment of DNA motifs Motifs are aligned with Smith-Waterman ungapped algorithm and motif similarity is calculated using the sum of the Pearson Correlation Coefficients of the motif positions. A 0 1 0 01 02 03 C 0 4 4 G 6 0 0 T 0 1 2 G C C 01 02 03 A 0 3 0 C 0 1 4 G 3 0 0 T 1 0 0 GCC GAC G A C Simil = r1+r2+r3 = 0.94 + 0.14 + 0.87 = 1.95 Pearson Correlation Coefficient: Position 1: (0 1.5) (0 1.5)(0 1) (0 1.5)(0 1) (6 1.5)(3 1) (0 1.5)(1 1) 2 (0 1.5) (6 1.5) (0 1.5) (0 1) (0 1) (3 1) (1 1) 2 2 2 2 2 2 2 0.94 Alignment of DNA motifs 4900 TRANSFAC individual DNA sites were aligned with their corresponding DNA motifs (PWMs), yielding a mean similarity of 0.70 AGCTTCCTC GGCATCCAG GTCTTCCTA AGCTTCCAC GGCATCCAC GACTTCCTC P0 01 02 03 04 05 06 07 08 09 A 2 1 0 2 0 0 0 3 1 C 0 0 6 0 0 6 6 0 4 G 4 4 0 0 0 0 0 0 1 T 0 1 0 4 6 0 0 3 0 Half of DNA sites share <0.70 similarity with its motif DNA motifs have a large variability G G C T T C C W C Alignment of DNA motifs 4900 TRANSFAC individual DNA sites were aligned against random footprintDB database motifs, yielding a mean similarity of 0.47. AGCTTCCTC P0 01 02 03 04 05 06 07 08 09 A C G ? Individual DNA sites and motifs can yield moderate similarities by chance T Alignment of DNA motifs Which motif similarity threshold should we use to identify DNA sites and motifs? 0.47 < ? < 0.70 AGCTTCCTC P0 01 02 03 04 05 06 07 08 09 A 2 1 0 2 0 0 0 3 1 C 0 0 6 0 0 6 6 0 4 G 4 4 0 0 0 0 0 0 1 T 0 1 0 4 6 0 0 3 0 G G C T T C C W C Alignment of DNA motifs Drawing a ROC curve interpolating TPR and FPR from TRANSFAC alignments, we obtain that values of motif similarity ratio beween 0.60 and 0.55 cover a sensitivity (TPR) range of 0.71-0.80 and a specificity (1-FPR) range of 0.88-0.74. 0.4 0.3 0.1 1 0.5 0.9 0.8 0.6 0.7 similarity 0.55 – 0.60 0.6 TPR 0.7 0.5 0.4 0.8 0.3 0.2 0.9 0.1 1 0 0 0.1 0.2 0.3 0.4 0.5 FPR 0.6 0.7 0.8 0.9 1 Thanks for your attention Laboratory of Computational Biology Estación Experimental de Aula Dei / CSIC Av. Montañana 1.005 50059 Zaragoza (Spain) Tel.: +34 976716089 Web: http://www.eead.csic.es/compbio/ Questions?