The first wave of the zygotic transcription in mouse embryos is highly promiscuous and uncoupled from splicing and 3’end processing. SUPPLEMENTAL MATERIAL Ken-ichro Abe1,*, Ryoma Yamamoto1,*, Vedran Franke2,*, Minjun Cao1, Yutaka Suzuki3, Masataka G. Suzuki1, Kristian Vlahovicek2,4,8, Petr Svoboda5,7, Richard M. Schultz6,7 and Fugaku Aoki1,7 1 Department of Integrated Biosciences, Graduate school of Frontier Sciences, The University of Tokyo, Kashiwa, Japan 2 Bioinformatics Group, Division of Biology, Faculty of Science, Zagreb University, Zagreb, Croatia 3 Department of Medical Genome Science, Graduate school of Frontier Sciences, The University of Tokyo, Kashiwa, Japan, The University of Tokyo, Tokyo, Japan. 4 Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway 5 Institute of Molecular Genetics, Academy of Sciences of the Czech Republic, Prague, Czech Republic 6 Department of Biology, University of Pennsylvania, Philadelphia, USA. 7 Address correspondence to: Fugaku Aoki, Department of Integrated Biosciences, Graduate School of Frontier Sciences, University of Tokyo, Room # 302, Seimei-Building, Kashiwa, Chiba 277-8562, Japan. Tel: +81-471-36-3695; Fax: +81-471-36-3698; E-mail: aokif@k.u-tokyo.ac.jp Petr Svoboda, Institute of Molecular Genetics, Academy of Sciences of the Czech Republic, Prague, Czech Republic. Tel: +420-241063147; E-mail: svobodap@img.cas.cz Richard M. Schultz, Department of Biology, University of Pennsylvania, 433 South University Avenue, Philadelphia, PA 19104-6018, USA. Tel: 215-898-7869; E-mail: rschultz@sas.upenn.edu 8 Corresponding author for computational biology *These authors contributed equally to this work. Running title: genome activation Key words: RNA-seq; preimplantation mouse embryo; transcription; gene expression; pre-mRNA splicing. 1 SUPPLEMENTAL DATA SUPPLEMENTAL EXPERIMENTAL PROCEDURES Bioinformatic analyses Mapping of Illumina RNA-seq reads on the mouse genome RNA-seq libraries were subjected to sequencing using Genome Analyzer IIx (Illumina). As the first step, we filtered and removed the adapters from both th 35-nt single-end and 76-nt pair-end reads using the Trimmomatic software (doi: 10.1093/bioinformatics/btu170), with the following parameters: ILLUMINACLIP: TruSeq2-PE.fa:2:30:10 TRAILING:20 MINLEN:36. The filtered reads were mapped onto the mm9/NCBI37 version of the mouse genome using the STAR mapper (doi: 10.1093/bioinformatics/bts635): STAR –genomeDir mm9_GenomeIndex--readFilesIn $file1 $file2 --runThreadN 20 --genomeLoad LoadAndKeep --outFilterMultimapNmax 10 --outFileNamePrefix $filename --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.2 --outSAMstrandField intronMotif --sjdbScore 2 The genome index was constructed with the addition of the mm9 Ensembl gene annotation, downloaded on 20.09.2013. from the Ensembl data base. STAR --runMode genomeGenerate --genomeDir ./ --genomeSAindexNbases 20 --genomeFastaFiles mm9.fa --runThreadN 10 --sjdbGTFfile Ensembl_Genes.gtf --sjdbOverhang 99 Mapping of SOLiD RNA-seq reads on the mouse genome We mapped 50SE SOLiD data (Park et al, 2013) with the Tophat aligner (version 1.3.2) (Trapnell et al, 2009) as in the original publication. The most stringent set of parameters was used for mapping: --initial-read-mismatches 3 --segment-mismatches 2 --segment-length 25 2 Mapping of RNA-seq reads on the rDNA sequence A suffix array was constructed from the mouse rDNA fragment (NCBI: BK000964). All libraries were mapped to the resulting index using the STAR aligner, with the same parameters that were used for genome mapping. HTS data normalization Initial quantile normalization on replicates demonstrated data robustness (data not shown). Since the total read counts used for mapping were similar, the dynamic ranges of read counts did not vary significantly across experiments, which permitted us to use CPM normalization for downstream analyses. Read distribution analysis All downstream bioinformatics analyses were done using the R software and the Bioconductor framework (10.1371/journal.pcbi.1003118). For the analysis of pair-end reads, we used only pairs that were properly paired during the mapping (these are subsequently termed ‘reads’ throughout the manuscript). Genomic distribution of reads was calculated by counting those overlapping functional elements in the following precedence: Exon –> rDNA –> Repeats –> Others. Repeats and rDNA locations were taken from the UCSC Repeatmasker data base, while the Exon set was downloaded from Ensembl. For each multiexonic gene we selected a transcript with the highest number of exons as a representative. UCSC track visualization Data was visualized in the UCSC browser by constructing bigWig tracks using the Bedtools software (10.1093/bioinformatics/btq033). The oocyte subtracted tracks were made by marking all regions in the MII sample that contained more than 5 reads, and removing those regions from other samples. 3 Hierarchical clustering Unsupervised clustering of HTS datasets from different stages of preimplantation development. The hierarchical clustering was performed on log2 of RPKM values, that were calculated by counting the number of paired reads that fall within the given transcript models. As the distance measure we used 1 - correlation between samples. Transcription termination analysis Reads mapped to transcripts, were filtered for expression and only the transcripts with >1 read per kb exon length (RPK) were selected separately for each stage. All loci with transcripts active in MII (any signal in read per total exon length) were excluded from further analysis. Further, for each transcript, regions 10kb of annotated transcription end site were subdivided into 1kb bins and reads mapping to each bin were counted. Each bin count was then divided by the exonic RPK (reads per kb) of the respective transcript and all values exceeding 1 were removed from subsequent analyses (in order to remove transcription signals from adjacent loci and intergenic transposable elements). Final analyzed transcript counts per stage are the following: MII: 0; 1c: 1011; 2c: 1788; 4c 1373; Mo: 1182; Bl: 1375. Median bin/exon ratios were then calculated and plotted for each bin and cell stage (Fig. 6B). Splicing analysis We used two different approaches to demonstrate aberrant splicing in 1-cell stage: First, reads mapped to known gene model loci on the mm9 genome were quantified by each transcript, divided into reads mapping to introns and exons and normalized to the respective exon and intron lengths (per kilobase) and filtered to contain at least one read per 1kb (RPK) of exonic sequence. Transcripts with no mapped reads in MII stage were selected for subsequent analysis in order to avoid 4 maternal RNA interfering with the signal. The intron and exon count values were then divided to represent the intron to exon read count ratio. All ratio levels above 1 were discarded to filter out transcription signal of unannotated transposable elements in intronic regions. Total loci analyzed per stage are the following: MII: 0; 1c: 862; 1c+DRB: 119; 2c: 1535; 2c+APH: 2081; 4c: 1178; Mo: 994; Bl: 1168. A distribution of ratio values was presented as the violin plot, scaled to the same distribution width (Fig. 6C). Second, out of the total mapped paired-end read set, we removed the reads where both sequence pairs map to the same element of the transcript (i.e. both map within exon or intron). The remaining reads were divided into two groups: one (termed ‘unspliced’) where pairs map to different transcript element types (i.e. intron/exon, including the cases where one read of a pair spans the splice junction and maps to both exon and adjacent intron); and the other (termed ‘spliced’) with reads in a pair map to two different exons (including cases where one read spans the spliced exon boundary. The ratio of spliced to unspliced read pairs was then calculated per each cell stage (Fig. 6D). Plasmid construction To generate pGL3-Tktl1 and -Zp3 vectors, which contained Tktl1 and Zp3 promoters, respectively, the 1,800- and 480-bp regions upstream of Tktl1 and Zp3 genes were amplified using PrimeSTAR Max DNA Polymerase (Takara) using HindIII-anchored primers (Table S4). PCR conditions were as follows: initial heat treatment at 94º C for 1 min followed by 32 cycles of 94º C for 10 s, 63 or 65 º C for 5 s, and 72º C for 10 s for the amplification of Tktl1 or Zp3 promoters, respectively. The PCR products were cloned into the pGL3-Basic vector (Promega, Madison, WI) using 5 the In Fusion Cloning advantage kit (Clontech, Shiga, Japan). To produce the pEluc-test vector (Toyobo, Tokyo) containing 76-bp regions upstream of the identified TSSs in the pGL3-Basic vector (see Figure S4), these regions were amplified using ExTaq HS (Takara) with SpeI- and EcoRI-anchored primers (Supplemental Table S6). The 76-bp length was chosen because we previously identified a minimal 76-bp fragment upstream of the TSS of the Tktl1 gene that showed significant transcriptional activity in a luciferase reporter gene assay using 1-cell embryos (Hamamoto et al, 2014). PCR conditions were as follows: initial heat treatment at 94º C for 2 min followed by 32 cycles of 94º C for 20 s, 60º C for 30 s, and 72º C for 15 s. The PCR products were cloned into the pEluc-test vector (Toyobo, Tokyo, Japan) using an In Fusion Cloning advantage kit (Clontech). All constructs were verified by DNA sequencing using a 3130xl Genetic Analyzer (Applied Bio systems, Carlsbad, CA). Rapid amplification of 5’ cDNA ends (5’ RACE) Total RNA was extracted using Isogen (Nippon Gene) from 100 1-cell embryos that had been injected with pGL3-vector and treated with TURBO DNase (Ambion, Austin, TX, USA) according to the manufacturer’s instructions. Next, 5’ rapid amplification of cDNA ends (5’ RACE) was performed using the GeneRacer Kit (cat#L1500-01; Invitrogen, Carlsbad, CA) according to the manufacturer’s recommendations. After ligation to the adaptor, total RNA was reverse transcribed using the Prime script RT-PCR Kit (Takara) with the inner primer specific for firefly luciferase, and then amplified by PCR using LA Taq (Takara) and primer sets for firefly luciferase and the adaptor under the following program for touchdown PCR: initial denaturation at 94° C for 2 min, followed by 5 cycles of denaturation at 94° C 6 for 30 s, annealing at 72°C for 2 min and extension at 68° C for 2 min; 5 cycles of denaturation at 94° C for 30 s, annealing at 70° C for 2 min and extension at 68° C for 2 min; and 25 cycles of denaturation at 94° C for 30 s, annealing at 62° C for 30 s and extension at 68° C for 2 min. For reverse transcription and PCR amplification, primers for the adaptor and firefly luciferase were 5’-CGACTGGAGCACGAGGACACTGA-3 and 5’-CGCGCCCAACACCGGCATAA-3’, respectively. The amplification products were resolved in 1.5% agarose gels, and the 500- and 900-bp PCR products were purified using the Wizard SV Gel and PCR Clean Up System (Promega). The products were then cloned into the pCR II-TOPO vector (Promega). Several independent clones were sequenced using a 3130xl Genetic Analyzer (Applied Bio systems, Carlsbad, CA). 7 SUPPLEMENTAL FIGURES Figure S1. Validation of the RNA sequence data for eggs and embryos. (A) Comparison of sequencing replicates and biological replicates. The upper two panels show the correlation of two sequencing replicates using the same total RNA samples from MII eggs (left) and 1-cell embryos (right). The first sequencing was 35-nt single end (35SE) sequencing and the second sequencing was 76-nt paired-end (76PE) where only the first 35 nucleotides of one of the paired reads was used for the analysis. The lower two panels show the correlation of two replicates using the different preparations of total RNA samples from MII eggs (left) and 1-cell embryos (right). The first sequencing was 35SE sequencing, the second sequencing was 76PE where only the first 35 nucleotides of one of the paired reads was used for the analysis. The x- and y-axes are logarithms of RPKM (reads per kilobase per million) to base 2 in each gene. (B) Comparison of 76PE Illumina sequencing with 50SE SOLiD sequencing data from Park et al. (Park et al, 2013). The x- and y-axes represent log10(number of reads) from 50kb tiling windows from chromosome 1 from MII eggs (left panel) and 1-cell embryos (right panel). Chromosome 1 was selected at random to reduce the number of displayed 50kb windows in order to offer better visualization of data. (C) Comparison of mRNA expression dynamics identified 76PE Illumina sequencing with 50SE SOLiD sequencing data from Park et al. (Park et al, 2013). The x- and y-axes represent log2(indicated mRNA fold-change according to Park et al. (y-axis) and our data (x-axis)). (D) Composition of 76PE libraries. Shown is the proportion of reads matching rRNA-derived transcripts (rRNA), transcripts produced from repeats identified by Repeatmasker (repeat), annotated mRNAs (mRNA), and other sequences (other). 8 Figure S2. rRNA expression during ZGA. (A) Expression of rRNA during minor ZGA. Grey bars and the left axis represent RPKMs mapping to the 5’ETS RNA sequence while black bars and the right axis represent RPKMs mapping to mature 18S rRNA. (B) Expression of rRNA during early development. Grey bars and the left axis represent RPKMs mapping to the 5’ETS RNA sequence while black bars and the right axis represent RPKMs mapping to mature 18S rRNA. Note that the first clear increase in 5’ETS RNA appears between the 1-cel and 2-cell stages while the highest surge of 5’ETS RNA abundance appears between 2-cell and 4-cell stages. Figure S3. Features of 1-cell transcription. (A) Higher density of low CPM reads is reproducibly observed in 1-cell deep sequencing dataset. Shown is a UCSC browser screenshot of a 5 MB segment from the chromosome 3. The inset provides a higher magnification of the indicated region. The bottom track was produced by merging all three 1-cell sequencing sets. Note that merging all three 1-cell datasets does not result in higher peaks but increased density of low CPM reads. Shown are reads mapped to the genome, vertical scale was trimmed at 20 reads; trimming is indicated by horizontal dashed lines. (B) Comparison of Y-chromosome-derived reads in 76PE Illumina and 50SE SOLiD data (Park et al, 2013). A UCSC browser screenshot of the chromosome Y (nucleotides 1-3,000,000) shows 76PE and 50SE raw data mapped to the genome. The vertical scale was trimmed at 10 reads for 76PE data and 40 reads for 50SE data since 50SE data are several-fold deeper; trimming is indicated by horizontal dashed lines. The residual signal on the chromosome Y in MII eggs and 1-cell embryos treated with DRB is an artifact caused by common retrotransposon-derived sequences (mainly MT-derived). 9 Figure S4. MuERV-L retrotransposons yield distinct asymmetric signature in HTS data from 1-cell and 2-cell embryos. (A) HTS reveals different patterns of retrotransposon expression during early development. The y-axis shows the relative abundance of reads mapping to each specific retrotransposon. The value in 1-cell embryos (red column) was set to one. (B) An example of transcriptional activity of two MuERV-L elements located in an intergenic region in chromosome 1. A UCSC browser screenshot shows 76-nt paired-end read data mapped to the genome, the vertical scale represented by grey columns is 100 reads (there was no trimming). Note two distinct peaks in the 2-cell embryos (reflecting high MuERV-L expression and 3’end processing) and apparent extension of transcription far into the genomic flank. In contrast, the 1-cell embryo shows a low level of transcription within the same genomic flank but not a distinct peak over MuERV-L elements (see also Fig. S2C). (C) HTS data from the two intergenic regions analyzed by RT-PCR. Shown are reads mapped to the genome, vertical scale was trimmed at 20 reads; trimming is indicated by horizontal dashed lines. Position of PCR amplicons #1 and #2 is indicated by dashed lines with arrowheads. Figure S5. (A) Opportunistic transcription lacking defined core promoter elements in 1-cell embryos. pGL3-Basic vector was injected into the nucleus of growing oocytes, 2-cell embryos, and the male pronucleus of 1-cell embryos, and transcription from TSS#1 and #2 (see Fig. 4) was examined using RT-PCR. (B) Analysis of transcriptional regulatory elements in the pGL3-Basic vector. Transcription starts sites were identified using 5’ RACE. Transcriptional regulatory elements (GC-box, BRE, TATA-box, initiator and DPE) located near the identified TSSs by 5’RACE were identified in the pGL3-Basic vector using the definitions described in previous reports 10 (Smale & Kadonaga, 2003). We identified a TATA-box upstream of TSS#1s, but it was distant from TSS#1-2. Although functional TATA-boxes are typically located ~30 bp upstream of the TSS (Butler & Kadonaga, 2002), the TATA-box was found 42 and 79 bp upstream of TSS#1-1 and TSS#1-2, respectively. There was no TATA-box in the upstream region of TSS#2s. We also identified GC-boxes in the regions upstream of four TSSs, but they were located closer to the TSSs than the expected position (~60 bp upstream of TSS) in TSS#1-1 and TSS#2-2. Inr was not identified around any TSS. In addition, no DPE was found in the regions downstream of any TSS (except for TSS#2-5, 30 and 38 bp downstream of which two potential DPEs were present). Figure S6. Analysis of genes transcribed during the 1-cell stage. (A) Expression levels in different tissues of the 4039 genes that we identified as transcribed in 1-cell embryos. Expression of the 4039 genes was analyzed for tissue expression data selected from Gnf1M microarray datasets from BioGPS (Su et al, 2004). The greyscale indicates approximate mRNA expression level (log2[BioGPS expression value]) in each tissue. (B) Hierarchical clustering of 76 nt paired-end datasets from different stages of preimplantation development performed on spearman correlations between log2 RPKM values for repeat-masked intron-mapping reads. (C) Examples of two genes that did not pass our filtering criteria for the list of 4039 genes but appear transcribed at the 1-cell stage. Both genes were previously found to be highly sensitive to -amanitin at the 2-cell stage (Zeng & Schultz, 2005). Figure S7. Relaxed post-transcriptional processing in 1-cell embryos. (A) Deficient splicing in the parthenogenetically activated embryos. MII eggs (MII) and 1-cell embryos, which had been parthenogenetically activated (Par) or fertilized (Fer), were 11 subjected to RT-PCR using the primer pairs across the splicing junctions of Klf5, Nid2, Mxra7 and Sord. (B) Concordance of the 76PE dataset with previously published HTS data (50SE; Park et al, 2013). UCSC browser screenshots show raw HTS results from MII oocytes and 1-cells data from Park et al. (Park et al, 2013) and MII, 1-cell, 1-cell+DRB and 2-cell data from our 76PE. The dashed line represent 20 reads from the library where 76PE libraries have ~ 4-7x106 reads, which do not map to rRNA while the depth of other samples is ~ 10x larger. Figure S8. Inefficient processing of protein coding transcripts at the 1-cell stage in 50SE data (Park et al, 2013). (A) Example of a gene transcribed at 1-cell stage. Shown are reads from 1-cell and 2-cell-derived 50SE sequenced libraries mapping to the Klf5 gene region. Their distribution indicates inefficient splicing and 3’end processing in 1-cell embryos. In contrast to the 2-cell embryo, the 1-cell embryo does not show any enrichment of exon-derived reads (exons are depicted as black rectangles) and no apparent transcriptional termination at the 3’end. A detailed analysis of all reads mapping to the Klf5 gene identified a single read derived from a spliced Klf5 transcript. Below is shown profile of Zp3, an abundantly-expressed oocyte-specific gene with very well defined exon-intron boundaries, which are retained also at the 1-cell stage. The vertical scale was trimmed at 2.5 CPM; trimming is indicated by horizontal dashed lines. The blue scale bars represent 10kb. (B) Transcription termination analysis. Lines represent median ratios of read counts per kb (RPK) of reads downstream of transcription termination site to exons for gene sets transcribed in 1-cell embryos and subsequent stages but not in MII eggs. Downstream regions for genes with at least one RPK in exons are divided into 1-kb slices, and reads in each slice are counted and 12 divided by the RPK value of the respective exon (point 1, 100%). The 1-cell stage (including 1-cell parthenogenotes) shows higher downstream to exon read ratio indicating the extension of transcription past the polyA site. (C) Violin plot distributions of intron/exon read count ratios per cell stage for genes not transcribed in MII eggs. Intron and exon read counts were normalized to 1-kb length (RPK) and divided to obtain the read ratio for each region transcribed at the 1-cell stage or later. The 1-cell stage (including 1-cell parthenogenotes) shows a shift towards higher intron/exon ratios indicating that a larger proportion of transcripts contain unspliced intronic regions, compared to the later stages. The MII stage is displayed as control and contains no values. 13 SUPPLEMENTAL TABLES Table S1 (Table S1.xlsx) Overview of all deep-sequenced libraries presented in the manuscript Table S2 (Table S2.xlsx) Genes transcribed in 1-cell embryos – selection based on exon-mapping reads. Table S3 (Table S3.xlsx) Genes transcribed in 1-cell embryos – selection based on intron-mapping reads. Table S4 (Table S4.xlsx) Primers and PCR conditions 14 SUPPLEMENTAL REFERENCES Butler JE, Kadonaga JT (2002) The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev 16: 2583-2592 Hamamoto G, Suzuki T, Suzuki MG, Aoki F (2014) Regulation of transketolase like 1 gene expression in the murine one-cell stage embryos. PLoS One 9 Park S-J, Komata M, Inoue F, Yamada K, Nakai K, Ohsugi M, Shirahige K (2013) Inferring the choreography of parental genomes during fertilization from ultralarge-scale whole-transcriptome analysis. Genes Dev 27: 2736-2748 Smale ST, Kadonaga JT (2003) The RNA polymerase II core promoter. Annu Rev Biochem 72: 449-479 Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101: 6062-6067 Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105-1111 Zeng F, Schultz RM (2005) RNA transcript profiling during zygotic gene activation in the preimplantation mouse embryo. Dev Biol 283: 40-57 15