Genome-wide Analysis of Gene regulation Presentation by: David Rozado Berlin, 4th of May, 2005 Comparative analysis of methods for representing and searching for transcription factor binding sites Robert Osada, Elena Zaslavsky and Mona Singh Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA Introduction • Identification of DNA binding sites for transcription factors – Important step in unraveling the transcriptional regulatory network • Several approaches for transcription factor’s binding sites search – – – – consensus sequences position-specific scoring matrices Berg and von Hippel Centroid • Such basic approaches can all be extended by incorporating: – pair wise nucleotide dependencies – per-position information content • The paper evaluates the effectiveness of the basic approaches and their extensions in finding binding sites for a transcription factor of interest Datasets • 68 regulatory proteins and their aligned DNA binding domains • Number of Filters applied: – Only proteins with at least four binding sites were considered – Duplicate binding sites were removed in order to preserve the integrity of the leave-one-out cross-validation – Each binding site unambiguously located within the E.coli K-12 genome and extracted along with flanking regions on each side • This process left 35 transcription factors and 410 binding sites – Average of 11.7 ± 8.5 sites per transcription factor Notation • • • • • • • • S: set of N DNA binding sites for a transcription factor nj (b): number of times base b appears in the j -th position in S fj (b): corresponding frequency n(b): number of times base b appears overall in the N binding sites f (b): overall frequency for base b nij: (b, d): number of times the ordered pair (b, d) occurs in positions i and j fij: (b, d) corresponding frequency tj:: j -th base of the sequence t to be scored i j t S Approaches for representing and searching for binding sites Extension I - Pairwise correlations • A method for incorporating pairwise correlations should only take into account those pairs that act together in determining DNA– protein binding specificity. • Such precise information is not always readily available • As approximation, focus on considering pairwise correlations between bases that are nearby in sequence • Introduce the notion of scope to delimit which pairs are considered correlated. – A scope of one restricts correlated positions to adjacent pairs – a scope of two considers both adjacent pairs and pairs separated by an intermediate base Extensions II - Information content • Information content (IC) is a concept based on the information-theoretic notion of entropy. • In the current application, the entropy of a position expresses the number of bits necessary to describe the position in a binding site • The information content of a position is calculated by subtracting its entropy from the value of the maximum possible entropy • The higher the information content, the more conserved (and presumably more important) the position Cross-validation testing and analysis • Common usage of any of the methods described above would be to scan non-coding regions in a genome in order to find binding sites for a particular transcription factor • Such a framework is not easily applicable when we wish to evaluate and compare different methods – The E.coli genome contains many yet uncharacterized binding sites – Predicted windows may correspond to true binding sites even if they are not annotated as such in the original dataset • Testing framework with sets of positive and negative examples Cross-validation testing and analysis II • Conduct leave-one-out cross-validation studies to evaluate a particular method • Suppose s belongs to a set S of known binding sites, each of length l, for a particular transcription factor TF • The method under consideration uses all the sites except s, to build the binding site representation for TF, and scores s as well as a set of negative examples • The negative examples consist of all binding sites in our dataset except those known to be bound by TF • It is still is possible that transcription factor TF can bind some of the negative examples • Nevertheless, s should be among the top scoring sites in the overall pool Comparing Methods • For each site s of a transcription factor under consideration a rank in crossvalidation testing is computed by counting how many negative examples score as well or better than s – lower rank indicating better performance • To compare how well two methods perform, a Wilcoxon matched-pairs signed-ranks test is used • The number of times one method outperforms the other is compared with how many times such an event would happen merely by chance under the assumption that both methods perform equally well • A ROC curve for each individual leave-one-out test is created and then, the average over all sites for that transcription factor is computed Comparison of basic methods ROC curves comparing performance when pairs are considered for Centroid ROC curves comparing performance when pairs are considered for PSSM ROC curves comparing Centroid-P with scope 2 using regular sites and sites with columns shuffled Performance of methods based on averaged ranks per transcription factor Conclusions • Using per-position information content to weigh positional scores improves the performance of all methods – Sometimes dramatically • Methods based on nucleotide matches, such as consensus sequences and Centroid, show statistically significant improvements when incorporating pairwise nucleotide dependences – Probabilistic methods, such as log-odds PSSMs, do not show statistically significant improvements when incorporating pairwise dependencies • Difference in performance between methods decreases substantially once information content and pairwise correlations have been incorporated Making connections between novel transcription factors and their DNA motifs Kai Tan,1 Lee Ann McCue,2 and Gary D. Stormo1,3 1Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri 63110, USA; 2The Wadsworth Center, New York State Department of Health, Albany, New York 12201-0509, USA Introduction • A computational method to connect novel transcription factors and DNA motifs in E. coli • The method takes advantage of three types of information to assign a DNA binding motif to a given TF 1. A distance constraint between a TF and its closest binding site in the genome (Dmin information) 2. The phylogenetic correlation between TFs and their regulated genes (PC information) 3. A binding specificity constraint for TFs having structurally similar DNA-binding domains (FMC information) • The different types of information are combined to calculate the probability of a given transcription-factor–DNA-motif pair being a true pair Distance constraint • Besides auto-regulation, it has been noticed in many cases that TFs and the genes they regulate are near each other in the genome – Distance constraint between the TF and its closest binding site in the genome • Dmin_self is the distance between a TF gene and its closest binding site in the genome • Dmin_cross is the distance between a TF gene and the closest binding site for a different TF The phylogenetic correlation • TFs and their regulated genes tend to evolve concurrently • Connect TFs and DNA motifs through correlation between their occurrences in a comparative analysis of multiple species • Two types of phylogenetic correlation (PC) distributions – PC for true TF–DNA-motif pairs. – PC for false TF–DNA-motif pairs Binding specificity constraint • TFs that are more similar to one another are expected to bind to sites that are more similar to each other than to dissimilar pairs • Distribution of average similarity scores for motifs from the same family and from different families Conclusions • Hypothesize that information concerning the connection of a TF to its DNA motif is carried in the genome sequences • TFs and their binding sites are often in similar genomic locations (Dmin information) • TFs tend to evolve concurrently with their regulated genes (PC information) • TFs from the same structural family tend to have similar DNA motifs Functional determinants of transcription factors in Escherichia coli: protein families and binding sites M. Madan Babu and Sarah A. Teichmann MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Introduction • DNA-binding transcription factors regulate expression of genes near to where they bind • These factors can be activators or repressors of transcription, or both • A fundamental question is what determines whether a transcription factor acts as an activator or a repressor? – – – – Protein–protein contacts Position of the DNA-binding domain in the protein primary sequence Altered DNA structure, Position of its binding site on the DNA relative to the transcription start site • This work suggest that, in general, in E. Coli, a transcription factor’s protein family is not indicative of its regulatory function, but the position of its binding site on the DNA is Domain Architectures for different TFs Conclusions • Activators, repressors and dual regulators in E. coli belong to many of the same protein families and share some domain architectures • A transcription factor’s regulatory role is not determined by protein structure or evolutionary relationships – Transcription factors have evolved by duplication of an ancestral transcription factor, followed by a change in function through a shift in binding sites • A transcription factor’s regulatory role is determined to a large extent simply by the position of the transcription factor binding site – Activators have essentially only upstream binding sites – More than two thirds of repressors have at least one downstream binding site