The first wave of the zygotic transcription in mouse embryos is highly

advertisement
The first wave of the zygotic transcription in mouse embryos is highly promiscuous
and uncoupled from splicing and 3’end processing.
SUPPLEMENTAL MATERIAL
Ken-ichro Abe1,*, Ryoma Yamamoto1,*, Vedran Franke2,*, Minjun Cao1, Yutaka
Suzuki3, Masataka G. Suzuki1, Kristian Vlahovicek2,4,8, Petr Svoboda5,7,
Richard M. Schultz6,7 and Fugaku Aoki1,7
1
Department of Integrated Biosciences, Graduate school of Frontier Sciences, The
University of Tokyo, Kashiwa, Japan
2
Bioinformatics Group, Division of Biology, Faculty of Science, Zagreb University,
Zagreb, Croatia
3
Department of Medical Genome Science, Graduate school of Frontier Sciences, The
University of Tokyo, Kashiwa, Japan, The University of Tokyo, Tokyo, Japan.
4
Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo,
Norway
5
Institute of Molecular Genetics, Academy of Sciences of the Czech Republic, Prague,
Czech Republic
6
Department of Biology, University of Pennsylvania, Philadelphia, USA.
7
Address correspondence to:
Fugaku Aoki, Department of Integrated Biosciences, Graduate School of Frontier
Sciences, University of Tokyo, Room # 302, Seimei-Building, Kashiwa, Chiba
277-8562, Japan.
Tel: +81-471-36-3695; Fax: +81-471-36-3698; E-mail: aokif@k.u-tokyo.ac.jp
Petr Svoboda, Institute of Molecular Genetics, Academy of Sciences of the Czech
Republic, Prague, Czech Republic.
Tel: +420-241063147; E-mail: svobodap@img.cas.cz
Richard M. Schultz, Department of Biology, University of Pennsylvania, 433 South
University Avenue, Philadelphia, PA 19104-6018, USA.
Tel: 215-898-7869; E-mail: rschultz@sas.upenn.edu
8
Corresponding author for computational biology
*These authors contributed equally to this work.
Running title: genome activation
Key words: RNA-seq; preimplantation mouse embryo; transcription; gene expression;
pre-mRNA splicing.
1
SUPPLEMENTAL DATA
SUPPLEMENTAL EXPERIMENTAL PROCEDURES
Bioinformatic analyses
Mapping of Illumina RNA-seq reads on the mouse genome
RNA-seq libraries were subjected to sequencing using Genome Analyzer IIx
(Illumina). As the first step, we filtered and removed the adapters from both th 35-nt
single-end and 76-nt pair-end reads using the Trimmomatic software (doi:
10.1093/bioinformatics/btu170),
with the following parameters:
ILLUMINACLIP: TruSeq2-PE.fa:2:30:10 TRAILING:20 MINLEN:36.
The filtered reads were mapped onto the mm9/NCBI37 version of the mouse
genome using the STAR mapper (doi: 10.1093/bioinformatics/bts635):
STAR –genomeDir mm9_GenomeIndex--readFilesIn $file1 $file2 --runThreadN 20
--genomeLoad LoadAndKeep --outFilterMultimapNmax 10 --outFileNamePrefix
$filename --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.2
--outSAMstrandField intronMotif --sjdbScore 2
The genome index was constructed with the addition of the mm9 Ensembl
gene annotation, downloaded on 20.09.2013. from the Ensembl data base.
STAR --runMode genomeGenerate --genomeDir ./ --genomeSAindexNbases 20
--genomeFastaFiles mm9.fa --runThreadN 10 --sjdbGTFfile Ensembl_Genes.gtf
--sjdbOverhang 99
Mapping of SOLiD RNA-seq reads on the mouse genome
We mapped 50SE SOLiD data (Park et al, 2013) with the Tophat aligner (version
1.3.2) (Trapnell et al, 2009) as in the original publication. The most stringent set of
parameters was used for mapping:
--initial-read-mismatches 3 --segment-mismatches 2 --segment-length 25
2
Mapping of RNA-seq reads on the rDNA sequence
A suffix array was constructed from the mouse rDNA fragment (NCBI:
BK000964). All libraries were mapped to the resulting index using the STAR aligner,
with the same parameters that were used for genome mapping.
HTS data normalization
Initial quantile normalization on replicates demonstrated data robustness
(data not shown). Since the total read counts used for mapping were similar, the
dynamic ranges of read counts did not vary significantly across experiments, which
permitted us to use CPM normalization for downstream analyses.
Read distribution analysis
All downstream bioinformatics analyses were done using the R software and
the Bioconductor framework (10.1371/journal.pcbi.1003118). For the analysis of pair-end
reads, we used only pairs that were properly paired during the mapping (these are
subsequently termed ‘reads’ throughout the manuscript).
Genomic distribution of reads was calculated by counting those overlapping
functional elements in the following precedence: Exon –> rDNA –> Repeats –>
Others. Repeats and rDNA locations were taken from the UCSC Repeatmasker data
base, while the Exon set was downloaded from Ensembl.
For each multiexonic gene we selected a transcript with the highest number of exons
as a representative.
UCSC track visualization
Data was visualized in the UCSC browser by constructing bigWig tracks
using the Bedtools software (10.1093/bioinformatics/btq033). The oocyte subtracted tracks
were made by marking all regions in the MII sample that contained more than 5 reads,
and removing those regions from other samples.
3
Hierarchical clustering
Unsupervised clustering of HTS datasets from different stages of
preimplantation development. The hierarchical clustering was performed on log2 of
RPKM values, that were calculated by counting the number of paired reads that fall
within the given transcript models. As the distance measure we used 1 - correlation
between samples.
Transcription termination analysis
Reads mapped to transcripts, were filtered for expression and only the
transcripts with >1 read per kb exon length (RPK) were selected separately for each
stage. All loci with transcripts active in MII (any signal in read per total exon length)
were excluded from further analysis. Further, for each transcript, regions 10kb of
annotated transcription end site were subdivided into 1kb bins and reads mapping to
each bin were counted. Each bin count was then divided by the exonic RPK (reads per
kb) of the respective transcript and all values exceeding 1 were removed from
subsequent analyses (in order to remove transcription signals from adjacent loci and
intergenic transposable elements). Final analyzed transcript counts per stage are the
following: MII: 0; 1c: 1011; 2c: 1788; 4c 1373; Mo: 1182; Bl: 1375. Median bin/exon
ratios were then calculated and plotted for each bin and cell stage (Fig. 6B).
Splicing analysis
We used two different approaches to demonstrate aberrant splicing in 1-cell
stage: First, reads mapped to known gene model loci on the mm9 genome were
quantified by each transcript, divided into reads mapping to introns and exons and
normalized to the respective exon and intron lengths (per kilobase) and filtered to
contain at least one read per 1kb (RPK) of exonic sequence. Transcripts with no
mapped reads in MII stage were selected for subsequent analysis in order to avoid
4
maternal RNA interfering with the signal. The intron and exon count values were then
divided to represent the intron to exon read count ratio. All ratio levels above 1 were
discarded to filter out transcription signal of unannotated transposable elements in
intronic regions. Total loci analyzed per stage are the following: MII: 0; 1c: 862;
1c+DRB: 119; 2c: 1535; 2c+APH: 2081; 4c: 1178; Mo: 994; Bl: 1168. A distribution
of ratio values was presented as the violin plot, scaled to the same distribution width
(Fig. 6C).
Second, out of the total mapped paired-end read set, we removed the reads
where both sequence pairs map to the same element of the transcript (i.e. both map
within exon or intron). The remaining reads were divided into two groups: one
(termed ‘unspliced’) where pairs map to different transcript element types (i.e.
intron/exon, including the cases where one read of a pair spans the splice junction and
maps to both exon and adjacent intron); and the other (termed ‘spliced’) with reads in
a pair map to two different exons (including cases where one read spans the spliced
exon boundary. The ratio of spliced to unspliced read pairs was then calculated per
each cell stage (Fig. 6D).
Plasmid construction
To generate pGL3-Tktl1 and -Zp3 vectors, which contained Tktl1 and Zp3
promoters, respectively, the 1,800- and 480-bp regions upstream of Tktl1 and Zp3
genes were amplified using PrimeSTAR Max DNA Polymerase (Takara) using
HindIII-anchored primers (Table S4). PCR conditions were as follows: initial heat
treatment at 94º C for 1 min followed by 32 cycles of 94º C for 10 s, 63 or 65 º C for 5
s, and 72º C for 10 s for the amplification of Tktl1 or Zp3 promoters, respectively. The
PCR products were cloned into the pGL3-Basic vector (Promega, Madison, WI) using
5
the In Fusion Cloning advantage kit (Clontech, Shiga, Japan).
To produce the pEluc-test vector (Toyobo, Tokyo) containing 76-bp regions
upstream of the identified TSSs in the pGL3-Basic vector (see Figure S4), these
regions were amplified using ExTaq HS (Takara) with SpeI- and EcoRI-anchored
primers (Supplemental Table S6). The 76-bp length was chosen because we
previously identified a minimal 76-bp fragment upstream of the TSS of the Tktl1 gene
that showed significant transcriptional activity in a luciferase reporter gene assay
using 1-cell embryos (Hamamoto et al, 2014). PCR conditions were as follows: initial
heat treatment at 94º C for 2 min followed by 32 cycles of 94º C for 20 s, 60º C for 30
s, and 72º C for 15 s. The PCR products were cloned into the pEluc-test vector
(Toyobo, Tokyo, Japan) using an In Fusion Cloning advantage kit (Clontech). All
constructs were verified by DNA sequencing using a 3130xl Genetic Analyzer
(Applied Bio systems, Carlsbad, CA).
Rapid amplification of 5’ cDNA ends (5’ RACE)
Total RNA was extracted using Isogen (Nippon Gene) from 100 1-cell
embryos that had been injected with pGL3-vector and treated with TURBO DNase
(Ambion, Austin, TX, USA) according to the manufacturer’s instructions. Next, 5’
rapid amplification of cDNA ends (5’ RACE) was performed using the GeneRacer Kit
(cat#L1500-01; Invitrogen, Carlsbad, CA) according to the manufacturer’s
recommendations. After ligation to the adaptor, total RNA was reverse transcribed
using the Prime script RT-PCR Kit (Takara) with the inner primer specific for firefly
luciferase, and then amplified by PCR using LA Taq (Takara) and primer sets for
firefly luciferase and the adaptor under the following program for touchdown PCR:
initial denaturation at 94° C for 2 min, followed by 5 cycles of denaturation at 94° C
6
for 30 s, annealing at 72°C for 2 min and extension at 68° C for 2 min; 5 cycles of
denaturation at 94° C for 30 s, annealing at 70° C for 2 min and extension at 68° C for
2 min; and 25 cycles of denaturation at 94° C for 30 s, annealing at 62° C for 30 s and
extension at 68° C for 2 min. For reverse transcription and PCR amplification,
primers for the adaptor and firefly luciferase were
5’-CGACTGGAGCACGAGGACACTGA-3 and
5’-CGCGCCCAACACCGGCATAA-3’, respectively. The amplification products
were resolved in 1.5% agarose gels, and the 500- and 900-bp PCR products were
purified using the Wizard SV Gel and PCR Clean Up System (Promega). The
products were then cloned into the pCR II-TOPO vector (Promega). Several
independent clones were sequenced using a 3130xl Genetic Analyzer (Applied Bio
systems, Carlsbad, CA).
7
SUPPLEMENTAL FIGURES
Figure S1. Validation of the RNA sequence data for eggs and embryos. (A)
Comparison of sequencing replicates and biological replicates. The upper two panels
show the correlation of two sequencing replicates using the same total RNA samples
from MII eggs (left) and 1-cell embryos (right). The first sequencing was 35-nt single
end (35SE) sequencing and the second sequencing was 76-nt paired-end (76PE)
where only the first 35 nucleotides of one of the paired reads was used for the analysis.
The lower two panels show the correlation of two replicates using the different
preparations of total RNA samples from MII eggs (left) and 1-cell embryos (right).
The first sequencing was 35SE sequencing, the second sequencing was 76PE where
only the first 35 nucleotides of one of the paired reads was used for the analysis. The
x- and y-axes are logarithms of RPKM (reads per kilobase per million) to base 2 in
each gene. (B) Comparison of 76PE Illumina sequencing with 50SE SOLiD
sequencing data from Park et al. (Park et al, 2013). The x- and y-axes represent
log10(number of reads) from 50kb tiling windows from chromosome 1 from MII eggs
(left panel) and 1-cell embryos (right panel). Chromosome 1 was selected at random
to reduce the number of displayed 50kb windows in order to offer better visualization
of data. (C) Comparison of mRNA expression dynamics identified 76PE Illumina
sequencing with 50SE SOLiD sequencing data from Park et al. (Park et al, 2013). The
x- and y-axes represent log2(indicated mRNA fold-change according to Park et al.
(y-axis) and our data (x-axis)). (D) Composition of 76PE libraries. Shown is the
proportion of reads matching rRNA-derived transcripts (rRNA), transcripts produced
from repeats identified by Repeatmasker (repeat), annotated mRNAs (mRNA), and
other sequences (other).
8
Figure S2. rRNA expression during ZGA. (A) Expression of rRNA during minor
ZGA. Grey bars and the left axis represent RPKMs mapping to the 5’ETS RNA
sequence while black bars and the right axis represent RPKMs mapping to mature
18S rRNA. (B) Expression of rRNA during early development. Grey bars and the left
axis represent RPKMs mapping to the 5’ETS RNA sequence while black bars and the
right axis represent RPKMs mapping to mature 18S rRNA. Note that the first clear
increase in 5’ETS RNA appears between the 1-cel and 2-cell stages while the highest
surge of 5’ETS RNA abundance appears between 2-cell and 4-cell stages.
Figure S3. Features of 1-cell transcription. (A) Higher density of low CPM reads is
reproducibly observed in 1-cell deep sequencing dataset. Shown is a UCSC browser
screenshot of a 5 MB segment from the chromosome 3. The inset provides a higher
magnification of the indicated region. The bottom track was produced by merging all
three 1-cell sequencing sets. Note that merging all three 1-cell datasets does not result
in higher peaks but increased density of low CPM reads. Shown are reads mapped to
the genome, vertical scale was trimmed at 20 reads; trimming is indicated by
horizontal dashed lines. (B) Comparison of Y-chromosome-derived reads in 76PE
Illumina and 50SE SOLiD data (Park et al, 2013). A UCSC browser screenshot of the
chromosome Y (nucleotides 1-3,000,000) shows 76PE and 50SE raw data mapped to
the genome. The vertical scale was trimmed at 10 reads for 76PE data and 40 reads
for 50SE data since 50SE data are several-fold deeper; trimming is indicated by
horizontal dashed lines. The residual signal on the chromosome Y in MII eggs and
1-cell embryos treated with DRB is an artifact caused by common
retrotransposon-derived sequences (mainly MT-derived).
9
Figure S4. MuERV-L retrotransposons yield distinct asymmetric signature in HTS
data from 1-cell and 2-cell embryos. (A) HTS reveals different patterns of
retrotransposon expression during early development. The y-axis shows the relative
abundance of reads mapping to each specific retrotransposon. The value in 1-cell
embryos (red column) was set to one. (B) An example of transcriptional activity of
two MuERV-L elements located in an intergenic region in chromosome 1. A UCSC
browser screenshot shows 76-nt paired-end read data mapped to the genome, the
vertical scale represented by grey columns is 100 reads (there was no trimming). Note
two distinct peaks in the 2-cell embryos (reflecting high MuERV-L expression and
3’end processing) and apparent extension of transcription far into the genomic flank.
In contrast, the 1-cell embryo shows a low level of transcription within the same
genomic flank but not a distinct peak over MuERV-L elements (see also Fig. S2C).
(C) HTS data from the two intergenic regions analyzed by RT-PCR. Shown are reads
mapped to the genome, vertical scale was trimmed at 20 reads; trimming is indicated
by horizontal dashed lines. Position of PCR amplicons #1 and #2 is indicated by
dashed lines with arrowheads.
Figure S5. (A) Opportunistic transcription lacking defined core promoter elements in
1-cell embryos. pGL3-Basic vector was injected into the nucleus of growing oocytes,
2-cell embryos, and the male pronucleus of 1-cell embryos, and transcription from
TSS#1 and #2 (see Fig. 4) was examined using RT-PCR. (B) Analysis of
transcriptional regulatory elements in the pGL3-Basic vector. Transcription starts sites
were identified using 5’ RACE. Transcriptional regulatory elements (GC-box, BRE,
TATA-box, initiator and DPE) located near the identified TSSs by 5’RACE were
identified in the pGL3-Basic vector using the definitions described in previous reports
10
(Smale & Kadonaga, 2003). We identified a TATA-box upstream of TSS#1s, but it
was distant from TSS#1-2. Although functional TATA-boxes are typically located ~30
bp upstream of the TSS (Butler & Kadonaga, 2002), the TATA-box was found 42 and
79 bp upstream of TSS#1-1 and TSS#1-2, respectively. There was no TATA-box in
the upstream region of TSS#2s. We also identified GC-boxes in the regions upstream
of four TSSs, but they were located closer to the TSSs than the expected position (~60
bp upstream of TSS) in TSS#1-1 and TSS#2-2. Inr was not identified around any TSS.
In addition, no DPE was found in the regions downstream of any TSS (except for
TSS#2-5, 30 and 38 bp downstream of which two potential DPEs were present).
Figure S6. Analysis of genes transcribed during the 1-cell stage. (A) Expression
levels in different tissues of the 4039 genes that we identified as transcribed in 1-cell
embryos. Expression of the 4039 genes was analyzed for tissue expression data
selected from Gnf1M microarray datasets from BioGPS (Su et al, 2004). The
greyscale indicates approximate mRNA expression level (log2[BioGPS expression
value]) in each tissue. (B) Hierarchical clustering of 76 nt paired-end datasets from
different stages of preimplantation development performed on spearman correlations
between log2 RPKM values for repeat-masked intron-mapping reads. (C) Examples of
two genes that did not pass our filtering criteria for the list of 4039 genes but appear
transcribed at the 1-cell stage. Both genes were previously found to be highly
sensitive to -amanitin at the 2-cell stage (Zeng & Schultz, 2005).
Figure S7. Relaxed post-transcriptional processing in 1-cell embryos. (A) Deficient
splicing in the parthenogenetically activated embryos. MII eggs (MII) and 1-cell
embryos, which had been parthenogenetically activated (Par) or fertilized (Fer), were
11
subjected to RT-PCR using the primer pairs across the splicing junctions of Klf5, Nid2,
Mxra7 and Sord. (B) Concordance of the 76PE dataset with previously published
HTS data (50SE; Park et al, 2013). UCSC browser screenshots show raw HTS results
from MII oocytes and 1-cells data from Park et al. (Park et al, 2013) and MII, 1-cell,
1-cell+DRB and 2-cell data from our 76PE. The dashed line represent 20 reads from
the library where 76PE libraries have ~ 4-7x106 reads, which do not map to rRNA
while the depth of other samples is ~ 10x larger.
Figure S8. Inefficient processing of protein coding transcripts at the 1-cell stage
in 50SE data (Park et al, 2013).
(A) Example of a gene transcribed at 1-cell stage. Shown are reads from 1-cell and
2-cell-derived 50SE sequenced libraries mapping to the Klf5 gene region. Their
distribution indicates inefficient splicing and 3’end processing in 1-cell embryos. In
contrast to the 2-cell embryo, the 1-cell embryo does not show any enrichment of
exon-derived reads (exons are depicted as black rectangles) and no apparent
transcriptional termination at the 3’end. A detailed analysis of all reads mapping to the
Klf5 gene identified a single read derived from a spliced Klf5 transcript. Below is
shown profile of Zp3, an abundantly-expressed oocyte-specific gene with very well
defined exon-intron boundaries, which are retained also at the 1-cell stage. The
vertical scale was trimmed at 2.5 CPM; trimming is indicated by horizontal dashed
lines. The blue scale bars represent 10kb. (B) Transcription termination analysis.
Lines represent median ratios of read counts per kb (RPK) of reads downstream of
transcription termination site to exons for gene sets transcribed in 1-cell embryos and
subsequent stages but not in MII eggs. Downstream regions for genes with at least
one RPK in exons are divided into 1-kb slices, and reads in each slice are counted and
12
divided by the RPK value of the respective exon (point 1, 100%). The 1-cell stage
(including 1-cell parthenogenotes) shows higher downstream to exon read ratio
indicating the extension of transcription past the polyA site. (C) Violin plot
distributions of intron/exon read count ratios per cell stage for genes not transcribed in
MII eggs. Intron and exon read counts were normalized to 1-kb length (RPK) and
divided to obtain the read ratio for each region transcribed at the 1-cell stage or later.
The 1-cell stage (including 1-cell parthenogenotes) shows a shift towards higher
intron/exon ratios indicating that a larger proportion of transcripts contain unspliced
intronic regions, compared to the later stages. The MII stage is displayed as control
and contains no values.
13
SUPPLEMENTAL TABLES
Table S1 (Table S1.xlsx)
Overview of all deep-sequenced libraries presented in the manuscript
Table S2 (Table S2.xlsx)
Genes transcribed in 1-cell embryos – selection based on exon-mapping reads.
Table S3 (Table S3.xlsx)
Genes transcribed in 1-cell embryos – selection based on intron-mapping reads.
Table S4 (Table S4.xlsx)
Primers and PCR conditions
14
SUPPLEMENTAL REFERENCES
Butler JE, Kadonaga JT (2002) The RNA polymerase II core promoter: a key
component in the regulation of gene expression. Genes Dev 16: 2583-2592
Hamamoto G, Suzuki T, Suzuki MG, Aoki F (2014) Regulation of transketolase like 1
gene expression in the murine one-cell stage embryos. PLoS One 9
Park S-J, Komata M, Inoue F, Yamada K, Nakai K, Ohsugi M, Shirahige K (2013)
Inferring the choreography of parental genomes during fertilization from
ultralarge-scale whole-transcriptome analysis. Genes Dev 27: 2736-2748
Smale ST, Kadonaga JT (2003) The RNA polymerase II core promoter. Annu Rev
Biochem 72: 449-479
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R,
Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB (2004) A gene atlas
of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A
101: 6062-6067
Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 25: 1105-1111
Zeng F, Schultz RM (2005) RNA transcript profiling during zygotic gene activation
in the preimplantation mouse embryo. Dev Biol 283: 40-57
15
Download