Folie 1

advertisement
Genome-wide Analysis of Gene
regulation
Presentation by: David Rozado
Berlin, 4th of May, 2005
Comparative analysis of methods for
representing and searching for transcription
factor binding sites
Robert Osada, Elena Zaslavsky and Mona Singh
Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics,
Princeton University, Princeton, NJ 08544, USA
Introduction
• Identification of DNA binding sites for transcription factors
– Important step in unraveling the transcriptional regulatory network
• Several approaches for transcription factor’s binding sites search
–
–
–
–
consensus sequences
position-specific scoring matrices
Berg and von Hippel
Centroid
• Such basic approaches can all be extended by incorporating:
– pair wise nucleotide dependencies
– per-position information content
• The paper evaluates the effectiveness of the basic approaches and
their extensions in finding binding sites for a transcription factor of
interest
Datasets
• 68 regulatory proteins and their aligned DNA binding domains
• Number of Filters applied:
– Only proteins with at least four binding sites were considered
– Duplicate binding sites were removed in order to preserve the integrity
of the leave-one-out cross-validation
– Each binding site unambiguously located within the E.coli K-12 genome
and extracted along with flanking regions on each side
• This process left 35 transcription factors and 410 binding sites
– Average of 11.7 ± 8.5 sites per transcription factor
Notation
•
•
•
•
•
•
•
•
S: set of N DNA binding sites for a transcription factor
nj (b): number of times base b appears in the j -th position in S
fj (b): corresponding frequency
n(b): number of times base b appears overall in the N binding sites
f (b): overall frequency for base b
nij: (b, d): number of times the ordered pair (b, d) occurs in positions i and j
fij: (b, d) corresponding frequency
tj:: j -th base of the sequence t to be scored
i
j
t
S
Approaches for representing and searching
for binding sites
Extension I - Pairwise correlations
• A method for incorporating pairwise correlations should only take
into account those pairs that act together in determining DNA–
protein binding specificity.
• Such precise information is not always readily available
• As approximation, focus on considering pairwise correlations
between bases that are nearby in sequence
• Introduce the notion of scope to delimit which pairs are considered
correlated.
– A scope of one restricts correlated positions to adjacent pairs
– a scope of two considers both adjacent pairs and pairs separated by an
intermediate base
Extensions II - Information content
•
Information content (IC) is a concept based on the information-theoretic
notion of entropy.
•
In the current application, the entropy of a position expresses the number
of bits necessary to describe the position in a binding site
•
The information content of a position is calculated by subtracting its
entropy from the value of the maximum possible entropy
•
The higher the information content, the more conserved (and presumably
more important) the position
Cross-validation testing and analysis
• Common usage of any of the methods described above would be to
scan non-coding regions in a genome in order to find binding sites
for a particular transcription factor
• Such a framework is not easily applicable when we wish to evaluate
and compare different methods
– The E.coli genome contains many yet uncharacterized binding sites
– Predicted windows may correspond to true binding sites even if they are
not annotated as such in the original dataset
• Testing framework with sets of positive and negative examples
Cross-validation testing and analysis II
• Conduct leave-one-out cross-validation studies to evaluate a
particular method
• Suppose s belongs to a set S of known binding sites, each of length
l, for a particular transcription factor TF
• The method under consideration uses all the sites except s, to build
the binding site representation for TF, and scores s as well as a set
of negative examples
• The negative examples consist of all binding sites in our dataset
except those known to be bound by TF
• It is still is possible that transcription factor TF can bind some of the
negative examples
• Nevertheless, s should be among the top scoring sites in the overall
pool
Comparing Methods
•
For each site s of a transcription factor under consideration a rank in crossvalidation testing is computed by counting how many negative examples
score as well or better than s
– lower rank indicating better performance
•
To compare how well two methods perform, a Wilcoxon matched-pairs
signed-ranks test is used
•
The number of times one method outperforms the other is compared with
how many times such an event would happen merely by chance under the
assumption that both methods perform equally well
•
A ROC curve for each individual leave-one-out test is created and then, the
average over all sites for that transcription factor is computed
Comparison of basic methods
ROC curves comparing performance when
pairs are considered for Centroid
ROC curves comparing performance when
pairs are considered for PSSM
ROC curves comparing Centroid-P with scope 2
using regular sites and sites with columns shuffled
Performance of methods based on averaged
ranks per transcription factor
Conclusions
•
Using per-position information content to weigh positional scores improves the
performance of all methods
– Sometimes dramatically
•
Methods based on nucleotide matches, such as consensus sequences and
Centroid, show statistically significant improvements when incorporating
pairwise nucleotide dependences
– Probabilistic methods, such as log-odds PSSMs, do not show statistically
significant improvements when incorporating pairwise dependencies
•
Difference in performance between methods decreases substantially once
information content and pairwise correlations have been incorporated
Making connections between novel transcription
factors and their DNA motifs
Kai Tan,1 Lee Ann McCue,2 and Gary D. Stormo1,3
1Department of Genetics, Washington University School of Medicine,
Saint Louis, Missouri 63110, USA; 2The Wadsworth Center,
New York State Department of Health, Albany, New York 12201-0509,
USA
Introduction
• A computational method to connect novel transcription factors and
DNA motifs in E. coli
• The method takes advantage of three types of information to assign
a DNA binding motif to a given TF
1. A distance constraint between a TF and its closest binding site
in the genome (Dmin information)
2. The phylogenetic correlation between TFs and their regulated
genes (PC information)
3. A binding specificity constraint for TFs having structurally similar
DNA-binding domains (FMC information)
• The different types of information are combined to calculate the
probability of a given transcription-factor–DNA-motif pair being a
true pair
Distance constraint
• Besides auto-regulation, it has been noticed in many cases that TFs and
the genes they regulate are near each other in the genome
– Distance constraint between the TF and its closest binding site in the genome
• Dmin_self is the distance between a TF gene and its closest binding site in
the genome
• Dmin_cross is the distance between a TF gene and the closest binding site for
a different TF
The phylogenetic correlation
• TFs and their regulated genes tend to evolve concurrently
• Connect TFs and DNA motifs through correlation between their
occurrences in a comparative analysis of multiple species
• Two types of phylogenetic correlation (PC) distributions
– PC for true TF–DNA-motif pairs.
– PC for false TF–DNA-motif pairs
Binding specificity constraint
• TFs that are more similar to one another are expected to bind to sites
that are more similar to each other than to dissimilar pairs
• Distribution of average similarity scores for motifs from the same family
and from different families
Conclusions
• Hypothesize that information concerning the connection of a TF to
its DNA motif is carried in the genome sequences
• TFs and their binding sites are often in similar genomic locations
(Dmin information)
• TFs tend to evolve concurrently with their regulated genes (PC
information)
• TFs from the same structural family tend to have similar DNA motifs
Functional determinants of transcription factors in
Escherichia coli: protein families and binding sites
M. Madan Babu and Sarah A. Teichmann
MRC Laboratory of Molecular Biology, Hills Road,
Cambridge CB2 2QH, UK
Introduction
• DNA-binding transcription factors regulate expression of genes near to
where they bind
• These factors can be activators or repressors of transcription, or both
• A fundamental question is what determines whether a transcription
factor acts as an activator or a repressor?
–
–
–
–
Protein–protein contacts
Position of the DNA-binding domain in the protein primary sequence
Altered DNA structure,
Position of its binding site on the DNA relative to the transcription start site
• This work suggest that, in general, in E. Coli, a transcription factor’s
protein family is not indicative of its regulatory function, but the position
of its binding site on the DNA is
Domain Architectures for different TFs
Conclusions
• Activators, repressors and dual regulators in E. coli belong to many of the
same protein families and share some domain architectures
• A transcription factor’s regulatory role is not determined by protein
structure or evolutionary relationships
– Transcription factors have evolved by duplication of an ancestral transcription
factor, followed by a change in function through a shift in binding sites
• A transcription factor’s regulatory role is determined to a large extent
simply by the position of the transcription factor binding site
– Activators have essentially only upstream binding sites
– More than two thirds of repressors have at least one downstream binding site
Download