1/ The computational pipeline

advertisement
Supplementary notes
1/ The computational pipeline
Figure 1: Overview of the computational pipeline for the prediction of
conserved direct target genes of a transcription factor.
To identify functional target sites and consequently a target gene battery of a
given transcription factor in a genome wide manner we devised a multi step procedure
(figure 1), which relies on the evolutionary conservation of functionally relevant
transcription factor binding sites. This procedure is applied at a genome-wide level
and therefore allows to predict whole sets of target genes.
For the genome wide analysis we first reduce the complexity by limiting the
search space to the region upstream of annotated genes that we subsequently search
for the presence of the motif corresponding to the transcription factor binding-site.
The regions that do not contain the motif of interest are discarded in this step. We
then perform a pair-wise alignment with the selected regions within closely related
genomes (e. g. human/mouse or human/rat) using promoterwise [1] as an alignment
tool adapted for the analysis of non-coding sequences
(http://www.ebi.ac.uk/~birney/wise2/). Regions where the identified binding site is
not in a conserved stretch are discarded. In a next step we scan orthologous regions in
more diverged species for the presence of the motif. This additional filtering step is
independent of any alignment, i. e. the motif does not have to lie in a conserved
stretch.
For many of these upstream sequences, no significant alignment is detectable
between diverged genomes, making it difficult to identify the orthologous region in
the diverged genome. However in most cases gene orthology is unambiguous and can
be used as an anchor to identify the corresponding upstream region. For example
orthologous genes in fish (zebrafish and fugu) are identified and their upstream region
is then scanned for the presence of the motif. An upstream region passes this second
filtering step only if the motif is present in one of the corresponding fish regions.
Predicted target genes are now defined by virtue of their linked upstream region that
contains the site.
Benchmarking
We benchmarked our in silico procedure using the binding site of the transcription
factors E2F (Transfac, Jaspar [2,3] and comparing our data set with that obtained by
chromatin-immuno precipitation (IP) [4] (see also Materials and Methods). The E2F
position weight matrices (PWMs) (Transfac, M00516) is used to search E2F target
genes as describe in the material and method. The sensitivity and specificity of the
procedure for each cutoff of the matrix were calculated and the receiver operating
characteristic (ROC) curve plotted (figure 2)
Figure 2 : ROC curve. Two alternative PWMs for E2F binding site were used
(M00050 and M00516 from Transfac (Matys et al. 2006)) variable cutoffs were
analyzed and the predicted target genes were compared with the data from [4] as
described in materials and methods.
Next we set the PWM match cutoff to 85% with the matrix that performs best
(M00516). To test if the computational procedure enriches significantly in target
genes experimentally identified by Ren et al. [4] to be bound by human E2F1 E2F4 in
their upstream regions, a randomization procedure was applied. A set of genes was
randomly sampled from the genes experimentally analyzed. The number of genes in
that random set corresponds to the real number of genes found by the computational
procedure to overlap with the set of genes analyzed by Ren et al. [4]. The overlap
between the random set and the positive set was assessed and compared with the real
overlap (using the predicted target genes). This randomization procedure was repeated
100000 times (table 1 and table2).
Real data
Filftering steps
In ALL
In POS
percentage in POS
Motif present
165
34
20.6
Conserved w
29
19
65.5
Table 1 : Number of genes predicted for each filtering steps that overlap with the
genes analyzed by [4] (ALL) and the positive set from [4] (POS). The filtering steps
are the presence of a hit in the upstream region of the human gene (Motif present), the
presence of the motif in a conserved region with rodent (conserved with rodent), the
presence of the motif in the orthologous region of fish (conserved with fish) or the full
filtering procedure as describe in Material and methods.
Randomization
Filtering steps
In ALL
In POS (average)
percentage in POS
Motif present
165
13.5
8.5
Table 2 : Random genes within ALL were randomly sampled and the overlap with
POS calculated. For each steps the number of genes sampled is the same as found
using the computational procedure and the randomization was repeated 100000 times
per filtering steps.
In no case, the random dataset shows an overlap that is equal or better than the
experimental dataset (P value < 0.00001). Furthermore, the percentage of genes found
that overlap with the positive set increase at each filtering steps. These results confirm
Conserved w
29
2.4
8.3
that the filtering procedure significantly enriches in target genes and this enrichment
is improved in each filtering steps.
2/ Promoter analysis
We tested the ability of those constructs with an altered ATOH7 (Ath5) binding
site to drive the expression of a GFP reporter in vivo, in transgenic medaka embryos.
Due to the high stability of the GFP protein, fluorescence is also detected in mature
neurons, long after ath5 expression has ceased. Embryos injected with GFP reporter
constructs, in which the E-box consensus in the two conserved motifs is disrupted
(HP), fail to efficiently express GFP in the endogenous domain (not shown, see also
Materials and Methods).
Interestingly the variations changing the binding motif without altering the Ebox consensus (M, N and MN, see Table S3 for primer sequences) had a striking
effect on the specificity of the promoter. In the respective transgenic embryos of all
three variants ectopic GFP expression domains were detected, in addition to the
endogenous domain in the retina. Those were the olfactory receptor neurons (Fig.
S1C), as well as yolk cells and the tail region (data not shown). Ath5 is not normally
expressed in these territories at any developmental stage, with the exception of
Xenopus [5], where a single nucleotide alteration within one of the two conserved
motifs is found (see Fig. 1A). Strikingly the same alteration in the medaka promoter,
or the Xenopus promoter itself, when analyzed in transgenic medaka embryos drives
GFP expression in both the retina and the olfactory receptor neurons (data not
shown). Thus the change of a single base pair in the E-box of the medaka promoter is
sufficient to attract a new regulatory input leading to the establishment of a new
expression domain.
3/ Specificity of activation
While Xenopus Neurod4 (Xath3) can activate the Ath5::GFP reporter, it only
partially activates the predicted Ath5 target genes (Fig. S3). Interestingly this
activation is mediated by the induction of endogenous Ath5 expression and is
efficiently blocked in the presence of Ath5 morpholino oligos (Table S3). Xenopus
Ash1 (Xash1) in contrast does neither activate the Ath5 targets nor the Ath5 reporter.
Both are efficiently activated by Xenopus Xath5, an activation that cannot be
competed for by the medaka Ath5 morpholino, indicating the specificity of the
morpholino and the interaction.
4/ Mutant analysis
To investigate whether the in silico target genes are directly controlled by Ath5,
we examined the expression of six predicted targets in the retina of the zebrafish
mutant lakritz (lak), which does not express functional Ath5 [6]. In all cases analyzed
(Brn3C, CD166, Adam11, Gfi-1, HuC and NN-1, Fig. S4), RGC expression of the
target genes was specifically abolished in the lak mutant retinae, indicating the
requirement of lak/ath5 function for the expression of the target genes in this domain
(Fig.S 4). The expression of these Ath5 target genes is not affected in their other
expression domains outside the retina indicating an Ath5 independent input into their
transcriptional control.
References
1. Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res
14: 988-995.
2. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, et al. (2006)
TRANSFAC and its module TRANSCompel: transcriptional gene regulation
in eukaryotes. Nucleic Acids Res 34: D108-110.
3. Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, et al. (2006)
A new generation of JASPAR, the open-access repository for transcription
factor binding site profiles. Nucleic Acids Res 34: D95-97.
4. Ren B, Cam H, Takahashi Y, Volkert T, Terragni J, et al. (2002) E2F integrates cell
cycle progression with DNA repair, replication, and G(2)/M checkpoints.
Genes Dev 16: 245-256.
5. Burns CJ, Vetter ML (2002) Xath5 regulates neurogenesis in the Xenopus olfactory
placode. Dev Dyn 225: 536-543.
6. Kay JN, Finger-Baier KC, Roeser T, Staub W, Baier H (2001) Retinal Ganglion
Cell Genesis Requires lakritz, a Zebrafish atonal Homolog. Neuron 30: 725736.
Download