Materials and Methods

advertisement
SUPPLEMENTARY MATERIALS AND METHODS
Strains and transgenes used
N2 Bristol wildtype (BRENNER 1974), cog-1(sy607) (PALMER et al. 2002), cog-1(ot119),
cog-1(ot201) (SARIN et al. 2007). The following transgenes were used: otIs114=Is[lim6prom::gfp; rol-6(d)] (CHANG et al. 2003), otIs151=Is[ceh-36prom::rfp; rol-6(d)] (JOHNSTON
and HOBERT 2003), syIs54=Is[unc-119(+); ceh-2:;gfp] (INOUE et al. 2005),
otEx1492=Ex[cog-1prom1::gfp; rol-6(d)], otEx2942=Ex[cog-1prom1del2::gfp] (ETCHBERGER et
al. 2007). New transgenes: otEx3761-63- three independent lines of Ex[cog-1promA::gfp;
elt-2::gfp], otEx3764-66- three independent lines of Ex[cog-1promA-ot119::gfp; elt-2::gfp],
otEx3767-69- three independent lines of Ex[cog-1promA-ot201::gfp; elt-2::gfp], otEx377072- three independent lines of Ex[cog-1promB::gfp; elt-2::gfp], otEx3782-84- three
independent lines of Ex[cog-1promA-del2::gfp; elt-2::gfp]. otEx3527, 3536, 3538- three
independent lines of Ex[gcy-1prom::gfp; elt-2::gfp], otEx3550-52- three independent lines
of Ex[gcy-1prom-del1::gfp; elt-2::gfp], otEx3553-56- three independent lines of Ex[gcy1prom-del2::gfp; elt-2::gfp], otEx3779-81- three independent lines of Ex[gcy-1promdel1,2::gfp; elt-2::gfp]. otEx3773-78- six independent rescuing lines of Ex[cog-1; elt2::gfp].
Comparative genome hybridization
Details of the DNA preparation and aCGH procedure can be found in the accompanying
manuscript and in Maydan et al. Briefly, Roche NimbleGen Inc. manufactured the
microarrays and performed the DNA labeling, hybridization, image analysis and
extraction of raw fluorescence intensities. The DNA samples (ot119 and ot201) were
labeled with Cy3 and the reference DNA (wild-type control; all strains contain the
otIs114 transgene) with Cy5. No background has been subtracted before calculating the
log2ratio values and the normalization followed a LOESS regression. The microarray
contains a single copy of all possible 50-mer oligonucleotides targeting each stand of
chromosome II between coordinates 14743042 and 15068429 with the exception of the
oligonucleotides affected by the repeats listed in Wormbase data freeze WS180 and the
1
oligonucleotides exceeding 150 cycles in NimbleGen’s synthesis process. No other
constraints have been applied while designing the oligonucleotides.
DNA constructs and generation of transgenic lines:
The cog-1promA reporter construct was created by topo cloning a 6081bp PCR fragment
extending from -6081 to -1 bp relative to the transcriptional start site into vector pCR 2.1
using an PstI site and an BamHI site introduced at either end of the PCR product; this
fragment was subsequently subcloned into the pPD95.75 expression vector. The cog1promB reporter construct was created by cloning a 2104 bp PCR fragment extending
from -6081 to -3977 bp relative to the transcriptional start site directly into pPD95.75
using HindIII and XbaI sites introduced at either end of the PCR product. The gcy1prom::gfp reporter construct was created by cloning a 712 bp fragment from -712 to -1
into pPD95.75 utilizing HindIII and BamHI restriction sites introduced at either end of the
PCR product.
The cog-1promA-ot119, cog-1promA-ot201, cog-1promA-del2, gcy-1prom-del1, gcy-
1prom-del2, and gcy-1prom-del1,2 reporter constructs were created with the QuickChange II
XL Site Directed Mutagenesis Kit (Stratagene) utilizing cog-1promA or gcy-1prom reporter
construct as the original template. Primer sequences for these constructs are as
follows:
cog-1promA PstI fwd: ATGCCTGCAGCCAAAGATGGTTTTTTTCCAC
cog-1promA BamHI rev: TAGAGGATCCTCTGGTTATGGTAGAGGGGAG
cog-1promB HindIII fwd: TTAAGCTTTCCAAAGATGGTTTTTTTCC
cog-1promB XbaI rev: TTTCTAGAGGGCTAGTTTTGGGAAAACG
cog-1promA-ot119 fwd: GCGAGACGAGAAGCTCTATCATCATTATAATTTTC
cog-1promA-ot119 rev: GAAAATTATAATGATGATAGAGCTTCTCGTCTCGC
cog-1promA-ot201 fwd: GAAAGCAGCGAGACGAAAAGCCCTATCATCA
cog-1promA-ot201 rev: TGATGATAGGGCTTTTCGTCTCGCTGCTTTC
cog-1promA-del2 fwd: CATATTTTCTATCTACATCATCATCTTC
cog-1promA-del2 rev: GAAGATGATGATGTAGATAGAAAATATG
gcy-1prom HindIII fwd: TTAAGCTTGTGTACTACAACAAGGGACTTTG
gcy-1prom BamHI rev: TTGGATCCCGAAAAGATAATTTCAAAACAAT
gcy-1prom-del1 fwd: TTTAACTAACTTGCAGACTTAATTTGGAAACATCATGATAAA
2
gcy-1prom-del1 rev: TGATGTTTCCAAATTAAGTCTGCAAGTTAGTTAAAATATTTT
gcy-1prom-del2 fwd: CACATTATCAAGAAAAACTAGTGTATTTTATTGGGAAATTTC
gcy-1prom-del2 rev: CCCAATAAAATACACTAGTTTTTCTTGATAATGTGTGTAGTT
cog-1prom constructs were injected into otIs151 worms at a concentration of 25ng/ul with
45ng/ul elt-2::gfp injection marker. gcy-1prom constructs were first PCR amplified with
primers gcy-1prom HindIII fwd and BamHI rev and that the resulting linear PCR product
was injected into otIs151 at 50ng/ul with 45ng/ul elt-2::gfp injection marker. Resulting
lines were maintained at 20°C and scored under a Zeiss Imager.Z1 microscope.
ot119 and ot201 mutants were rescued by injection of a PCR fragment containing the
full cog-1 genomic region from -3977 bp upstream relative to the transcriptional start
through 975 bp downstream of the stop codon at 25ng/ul with 45ng/ul elt-2::gfp injection
marker. Primers used to amplify the cog-1 locus as follows:
cog-1prom1 5’: TTGCATGCCTTGCTCAATGTACGTATATG
cog-1 3’UTR rvs: GTACGCTCCAGTTCAGAC
Bioinformatic analysis of the ASE motif
Position weight matrix calculation and scoring
We derived the ASE motif position weight matrix from a set of 37 12-mer sites
known to be functional (ETCHBERGER et al. 2007). The weights are calculated as a 4 x
12 matrix of the frequency of each possible nucleotide in each position. We did not add
pseudo-counts to the counts derived from the functional sites. The ‘score’ referred to in
the text for a given putatively functional ASE motif is defined as follows, for a locus in
the genome with sequence GAAGCCATAATT.
In the following,
SIF means ‘site is functional’
GAA… means ‘sequence is GAAGCCATAATT’
3
PSIF | GAA... 
PGAA... | SIF * PSIF 
PGAA...
where
PGAA... | SIF   M 1G M 2 A M 3 A M 4G  M 12 A
where M’s are the cells of the weight matrix and
PGAA...  PGAA PG | GAA PC | AAG  P A | ATT 
which is a second-order Markov chain.
Mostly the effect of using a second order Markov chain for the background probability
has the effect of downgrading the score of poly-A runs, which would be poorly estimated
from a zeroth-order Markov chain.
Also, for all sites, PSIF  is an unknown constant equal to the total number of functional
sites in the genome divided by the size of the genome. Fortunately, since it is constant
it does not affect comparisons between scores, so it is safely set equal to 1 in our score
calculations.
Retrieving ASE motifs
In a series of statistical experiments, we retrieved all ASE motifs occurring
anywhere upstream of all 20,183 C. elegans genes, using an in-house version of
CisOrtho (to be published). The upstream region was defined as the interval between
the translation start site (ATG of first exon) and the boundary of the nearest exon of a
neighboring gene, or 5,000 bases, whichever came first. We used version WS187 of
the C. elegans genome and gene annotations from
ftp://ftp.wormbase.org/pub/wormbase/genomes/elegans. For genes with multiple
4
isoforms, we defined unique gene bounds as the furthest 5’ and 3’ extent of any
isoform.
Top-N motif score
The top-N motif score for a given gene and value of N is defined as the
probability that all N top-scoring upstream ASE motifs are functional:
N
PSIF1 , SIF2 ,  , SIFN | seq1 , seq 2 ,  , seq N    PSIFm | seq m 
m 1
In case N upstream sites do not exist for a gene, the top-N motif score is defined to be
zero. We use this score to test the hypothesis that N sites are required to be functional
for a given gene. In our experiments, N varies from 1 to 10. As for individual motif
scores, the numbers are merely proportional to probabilities through PSIF  . However,
since we always compare top-N motif scores to others of the same N value, omission of
this factor makes no difference in the overall ordering.
SAGE tag score
After considering a few different ways of sorting genes based on SAGE tag
criteria, we found the most predictive quantity to be a modified ASE tag count to AFD
tag count ratio
ASE
AFD
0
ASE  0 and AFD  0
ASE
ASE  0 and AFD  0
ASE  0 and AFD  0
Correlation analysis with SAGE data
5
In addition to correlation experiments performed between ASE motifs and ASE
expression discussed in text we performed the same ROC curve analysis using SAGE
tag ratios as a ‘predictive quantity’ like the top-N motif score was used, despite their
experimental origin. We found the tag ratio to be a very good predictor of ASE
expression (as defined by the 52 gene set). The top four genes in this ordering are all
ASE-expressed. Among the top 1095 (5%) genes, 24 (50%) are ASE-expressed. This
correlation turned out to be a better predictor of ASE expression than any of the top-N
motif score based predictions.
In light of this we also tested the relationship between SAGE tag ratios and the
top-N motif score directly. Unlike the other analyses, however, this required first
interpreting each gene’s tag ratio as a discrete binary classification of ASE expression.
To achieve this we tried several different cutoff tag ratios ranging from 1 to 10, calling
genes above the cutoff ‘ASE-expressed’, and the rest ‘not ASE-expressed’. We did a
separate ROC analysis of each classification against each of the 10 top-N motif scores.
Unfortunately, none yielded a strong correlation. The best involved a SAGE tag ratio
cutoff of 7 which defined 55 ‘ASE expressed’ genes. We did however, again observe
the same trend, namely that as N increased, the top-N motif score became a better
predictor of the SAGE tag ratio based classification of ASE expression. We speculate
that overall, the weak correlation between SAGE tag ratios and top-N motif score has to
do with a considerable amount of error introduced by forcing the SAGE tag ratio, a
continuous quantity, into a binary classification. Also, SAGE tag ratios and top-N motif
scores are only indirectly related through actual ASE expression. Therefore, it is to be
expected that their correlation be weaker than either direct correlation.
Combining top-N motif ranking with SAGE tag ranking
The ASE motif scores and the SAGE tag score imply a ranking of all the genes in
C. elegans. The ranking is the integer position of the gene in the list sorted by that
criterion. We speculated that positive evidence (highly ranked genes) was more reliable
as indicating ASE expression than was negative evidence indicating lack of ASE
expression, and further that this principle holds for both SAGE-tag and top-N motif score
6
based evidence. To exploit this we defined a ‘combined rank’ as the best rank attained
by a gene between its SAGE tag ratio score and top-4 motif score. We performed ROC
analysis using this combined rank and found it to give a modest improvement over the
SAGE tag ratio score alone, making it the best overall predictor of ASE expression.
Raw Data Format
Supplementary Table 1 is the complete set of data used in all analyses
presented in the main text, including the SAGE tag ratio correlation and combined rank
experiments. It has one line per C. elegans gene with the following columns
(explanation reproduced in the Excel file as well):
Gene
WormBase ID, i.e. ‘R03C1.3’
Common
common name of the gene, i.e. ‘cog-1’
ase_expressed
1 if belonging to the 52 ASE-expressed gene set, 0 if not
ase_tag_count
SAGE ASE tag count
afd_tag_count
SAGE AFD tag count
site_info
List of all upstream ASE motifs up to 5000 bases, formatted as
"offset:score:sequence", where offset is the number of bases
upstream, score is the weight matrix score as defined above, and the
sequence is given for the site on the strand that it is on.
sage_tag_ratio
SAGE tag score as defined above
xvalues_sage_ratio
number of ‘ase_expressed = 0’ genes above this gene when sorted
by sage_tag_ratio
yvalues_sage_ratio
number of ‘ase_expressed = 1’ genes above this gene when sorted
by sage_tag_ratio
sage_motif_combined_rank
ordering based on the best ranking a gene attained between from
score4 or sage_tag_ratio
xvalues_combined
number of ‘ase_expressed = 0’ genes sorted above this gene when
sorted by sage_motif_combined_rank
yvalues_combined
number of ‘ase_expressed = 1’ genes sorted above this gene when
sorted by sage_motif_combined_rank
scoreN
top-N motif score
xvaluesN
number of ‘ase_expressed = 0’ genes sorted above this gene when
sorted by scoreN
7
yvaluesN
number of ‘ase_expressed = 1’ genes sorted above this gene when
sorted by scoreN
Data Visualization
We used R (R-project.org) and user-submitted packages ‘ROCR’ and ‘lattice’
avaliable from the CRAN section of the website.
8
LITERATURE CITED
BRENNER, S., 1974 The genetics of Caenorhabditis elegans. Genetics 77: 71-94.
CHANG, S., R. J. JOHNSTON, JR. and O. HOBERT, 2003 A transcriptional regulatory
cascade that controls left/right asymmetry in chemosensory neurons of C.
elegans. Genes Dev 17: 2123-2137.
ETCHBERGER, J. F., A. LORCH, M. C. SLEUMER, R. ZAPF, S. J. JONES et al., 2007 The
molecular signature and cis-regulatory architecture of a C. elegans gustatory
neuron. Genes Dev 21: 1653-1674.
INOUE, T., M. W ANG, T. O. RIRIE, J. S. FERNANDES and P. W. STERNBERG, 2005
Transcriptional network underlying Caenorhabditis elegans vulval development.
Proc Natl Acad Sci U S A.
JOHNSTON, R. J., and O. HOBERT, 2003 A microRNA controlling left/right neuronal
asymmetry in Caenorhabditis elegans. Nature 426: 845-849.
PALMER, R. E., T. INOUE, D. R. SHERWOOD, L. I. JIANG and P. W. STERNBERG, 2002
Caenorhabditis elegans cog-1 Locus Encodes GTX/Nkx6.1 Homeodomain
Proteins and Regulates Multiple Aspects of Reproductive System Development.
Dev Biol 252: 202-213.
SARIN, S., M. O'MEARA M, E. B. FLOWERS, C. ANTONIO, R. J. POOLE et al., 2007 Genetic
Screens for Caenorhabditis elegans Mutants Defective in Left/Right Asymmetric
Neuronal Fate Specification. Genetics 176: 2109-2130.
9
Download