RNA-Seq technology and it's application on dosage compensation between the X chromosome and autosomes in mammals 2011-12-05 Outline RNA-Seq: technologies and it's methodologies Application on dosage compensation model RNA-Seq: technologies and it's methodologies Transcriptomics methods before RNA-Seq Hybridization-based approaches Genomic tiling microarrays Fluorescently labelled cDNA with microarrays Sequence-based approaches Sanger sequencing of cDNA or EST libraries Serial analysis of gene expression (SAGE) Cap analysis of gene expression (CAGE) Massively parallel signature sequencing (MPSS) A typical RNA-Seq experiment Sequencer used for RNA-Seq Illumina IG Applied Biosystems SOLiD Roche 454 Life Science Helicos Biosciences tSMS (has not yet been used for published RNA-Seq studies, data from Jan. 2009) Direct RNA sequencing using the Helicos approach a | RNA that is polyadenylated and 3′ deoxyblocked with poly(A) polymerase is captured on poly(dT)-coated surfaces. A 'fill-and-lock' step is performed, in which the 'fill' step is performed with natural thymidine and polymerase, and the 'lock' step is performed with fluorescently labelled A, C and G Virtual Terminator (VT) nucleotides and polymerase. This step corrects for any misalignments that may be present in poly(A) and poly(T) duplexes, and ensures that the sequencing starts in the RNA template rather than the polyadenylated tail. b | Imaging is performed to locate the positions of the templates. Then, chemical cleavage of the dye–nucleotide linker is performed to release the dye and prepare the templates for nucleotide incorporation. c | Incubation of this surface with one labelled nucleotide (C-VT is shown as an example) and a polymerase mixture is carried out. After this step, imaging is performed to locate the templates that have incorporated the nucleotide. Chemical cleavage of the dye allows the surface and DNA templates to be ready for the next nucleotideaddition cycle. Nucleotides are added in the C, T, A, G order for 120 total cycles (30 additions of each nucleotide). Advantages of RNA-Seq compared with other transcriptomics methods Quantifying expression levels: RNA-Seq and microarray compared Challenges for RNA-Seq Library construction – Bias in the result from different library construction (RNA fragmentation and cDNA fragmentation) for large RNA – Strand-specific libraries are currently laborious to produce Bioinformatic challenges – The development of efficient methods to store, retrieve and process large amounts of data – Mapping reads to the genome Coverage versus cost DNA library preparation: RNA fragmentation and DNA fragmentation compared a | Fragmentation of oligo-dT primed cDNA (blue line) is more biased towards the 3' end of the transcript. RNA fragmentation (red line) provides more even coverage along the gene body, but is relatively depleted for both the 5' and 3' ends. Note that the ratio between the maximum and minimum expression level (or the dynamic range) for microarrays is 44, for RNA-Seq it is 9,560. The tag count is the average sequencing coverage for 5,000 yeast ORFs. b | A specific yeast gene, SES1 (seryl-tRNA synthetase), is shown. Coverage versus depth Metholologies for RNA-Seq studies Mapping transcription start sites Strand-specific RNA-Seq Characterization of alternative splicing patterns Gene fusion detection Targeted approaches using RNA-Seq Small RNA profiling Direct RNA sequencing Profiling low-quantity RNA samples Mapping transcription start sites (TSSs) Mapping transcription start sites (TSSs) Advantages Low quantities of input RNA Pair-end sequencing enables identified TSSs to specific transcripts Pair-end sequencing alleviates the difficulty of aligning single short reads to repeat regions Disadvantages Primer dimers dominates sequencing data sets Dependent on cDNA synthesis or hybridization steps Be challenging for short-lived transcripts Strand-specific RNA-Seq Adaptors with known orientations are ligated to the ends of RNAs or to first-strand cDNA molecules Direct sequencing of the first-strand cDNA products Selective chemical marking of the second-strand cDNA synthesis products or RNA Characterization of alternative splicing patterns a | Sequence reads are mapped to genomic DNA or to a transcriptome reference to detect alternative isoforms of an RNA transcript. Mapping is based simply on read counts to each exon and reads that span the exonic boundaries. One infers the absence of the genomic exon in the transcript by virtue of no reads mapping to the genomic location. b | Paired sequence reads provide additional information about exonic splicing events, as demonstrated by matching the first read in one exon and placing the second read in the downstream exon, creating a map of the transcript structure. Gene fusion detection Targeted approaches using RNA-Seq Targeted approaches using RNA-Seq Small RNA profiling Direct RNA sequencing a | RNA that is polyadenylated and 3′ deoxyblocked with poly(A) polymerase is captured on poly(dT)-coated surfaces. A 'fill-and-lock' step is performed, in which the 'fill' step is performed with natural thymidine and polymerase, and the 'lock' step is performed with fluorescently labelled A, C and G Virtual Terminator (VT) nucleotides and polymerase. This step corrects for any misalignments that may be present in poly(A) and poly(T) duplexes, and ensures that the sequencing starts in the RNA template rather than the polyadenylated tail. b | Imaging is performed to locate the positions of the templates. Then, chemical cleavage of the dye–nucleotide linker is performed to release the dye and prepare the templates for nucleotide incorporation. c | Incubation of this surface with one labelled nucleotide (C-VT is shown as an example) and a polymerase mixture is carried out. After this step, imaging is performed to locate the templates that have incorporated the nucleotide. Chemical cleavage of the dye allows the surface and DNA templates to be ready for the next nucleotideaddition cycle. Nucleotides are added in the C, T, A, G order for 120 total cycles (30 additions of each nucleotide). Profiling low-quantity RNA samples a | Single-molecule DNA and RNA sequencing technologies could be modified for single-cell applications. Cells can be delivered to flow cells using fluidics systems, followed by cell lysis and capture of mRNA species on the poly(dT)-coated sequencing surfaces by hybridization. Standard sequencing runs could take place on channels with a 127.5 mm2 surface area, requiring 2,750 images to be taken per cycle to image the entire channel area. The surface area needed to accommodate ~350,000 mRNA molecules contained in a single cell is ~0.4 mm2; thus, only eight images per cycle would be needed. Sequence analysis can be done with direct RNA sequencing (DRS) or on-surface cDNA synthesis followed by single-molecule DNA sequencing. b | Counter system workflow. Two probes are used for each target site: the capture probe (shown in red) contains a target-specific sequence and a modification that allows the immobilization of the molecules on a surface; the reporter probe contains a different target-specific sequence (shown in blue) and a fluorescent barcode (shown by a green circle) that is unique to each target being examined. After hybridization of the capture and reporter probe mixture to RNA samples in solution, excess probes are removed. The hybridized RNA duplexes are then immobilized on a surface and imaged to identify and count each transcript with the unique fluorescent signals on the capture and reporter probes. Reference Zhong, W. et al. RNA-Seq a revolutionary tool for transcriptomics. Nature Reviews Genetics 10, 57 (2009). Fatih, O. et al. RNA sequencing: advances, challenges and opportunities. Nature Reviews Genetics 12, 87 (2011). Jeffrey, A. M. et al. Next-generation transcriptome assembly. Nature Reviews Genetics 12, 671 (2011) Philipp, K. et al. New class of gene-termini-associated human RNAs suggests a novel RNA copying mechanism. Nature 466, 642 (2010). Application on dosage compensation model Background Ohno's hypothesis X-linked genes are expressed at twice the level of autosomal genes per active allele to balance the gene dose between the X chromosome and autosomes. Microarray data (X:AA ~ 1) Abstract from Xiong et al Mammalian cells from both sexes typically contain one active X chromosome but two sets of autosomes. It has previously been hypothesized that X-linked genes are expressed at twice the level of autosomal genes per active allele to balance the gene dose between the X chromosome and autosomes (termed 'Ohno's hypothesis'). This hypothesis was supported by the observation that microarray-based gene expression levels were indistinguishable between one X chromosome and two autosomes (the X to two autosomes ratio (X:AA) ~1). Here we show that RNA sequencing (RNA-Seq) is more sensitive than microarray and that RNA-Seq data reveal an X:AA ratio of ~0.5 in human and mouse. In Caenorhabditis elegans hermaphrodites, the X:AA ratio reduces progressively from ~1 in larvae to ~0.5 in adults. Proteomic data are consistent with the RNASeq results and further suggest the lack of X upregulation at the protein level. Together, our findings reject Ohno's hypothesis, necessitating a major revision of the current model of dosage compensation in the evolution of sex chromosomes. Expression level definition Taking mouse as an example, we mapped all 25-mer RNA-Seq reads to the genome sequence. Only those reads uniquely mapped to exons were considered as valid hits for a given gene. The expression level of a gene is defined by the number of valid hits to the gene divided by the effective length of the gene, which is the total number of 25-mers in the DNA sequences of the exons of the gene that have no other matches anywhere in the genome. For comparisons between tissues or developmental stages, expression levels were normalized by dividing the total number of valid hits in the sample. Comparison of gene expressions measured by microarray and RNA-Seq Human liver is considered unless otherwise noted. (a) Estimation variation measured by the fold difference of microarray intensities of two same-target probesets or of RNA-Seq signals from two halves of the same gene. (b) Identical to a, except that mouse liver is considered here. (c) Comparison of the internal consistency of RNA-Seq data and microarray data. The expression differences from one-half of the nucleotides (RNASeq) or a probeset (microarray) are shown for 1,000 randomly picked gene pairs each with twofold ± 0.01-fold expression difference from the other half of nucleotides (RNA-Seq) or from the other probeset (microarray). The central bold line shows the median, the box encompasses 50% of data points and the error bars include 90% of data points. (d) Pearson's correlation (r) of microarray and RNA-Seq expression signals (gray) and of RNASeq signals from two independent experiments (black). A certain fraction of genes (x axis) with the highest expression according to one of the RNA-Seq datasets are examined. Error bars show 95% confidence intervals estimated by bootstrapping. (e) Microarray consistently underestimates expression differences between genes. The microarray expression differences of 1,000 randomly picked gene pairs each with x-fold (x = 2 ± 0.01, 4 ± 0.02, 8 ± 0.04, 16 ± 0.08, 32 ± 0.16, and 64 ± 0.32) RNA-Seq expression difference are shown. The central bold line shows the median, the box encompasses 50% of data points and the error bars include 90% of data points. (f) Relative liver expressions of 55 mouse genes, measured by RNA-Seq, microarray and qRTPCR. Comparisons of RNA-Seq gene expression levels between the X chromosome and autosomes in 12 human tissues and 3 mouse tissues (a) The median expression levels of X-linked genes (closed diamonds) and autosomal genes (open circles) are compared. Median expressions of autosomal genes were normalized to 1. Error bars show 95% bootstrap confidence intervals. Sex information is listed in the parantheses after the tissue names (M, male; F, female; NA, unknown). (b) X:AA ratios of median expressions from the human liver when X is compared to individual autosomes. Error bars show 95% bootstrap confidence intervals. Test upregulation in Ohno's hypothesis Upregulation in Ohno's hypothesis In Ohno's hypothesis, upregulation is needed for those X-linked genes that had existed in the genome before the emergence of the X chromosome; X-linked genes that originated de novo on X presumably do not require upregulation. Test upregulation in Ohno's hypothesis Comparison of RNA-Seq gene expression levels of the X chromosome and autosomes in C. elegans Caveats in this RNA-Seq analysis The Illumina sequencing used here may be biased toward certain sequences or nucleotides. Reverse transcription during cDNA library preparation is likely to be less efficient for longer transcripts. GC content may affect RNA-Seq results. A recent study using time-course microarray data excluded lowly expressed genes, which is inappropriate for measuring the absolute value of X:AA ratio. Main idea Here we contend that the low estimate of the X:AA ratio by Xiong et al. stems from the disproportionate contribution of transcriptionally inactive genes, which are not relevant for the evaluation of dosage compensation mechanisms, to the X chromosome average. We show that when only active genes are considered, the RNA-seq data give X:AA ratios closer to 1, and the observed minor deviation of the X:AA ratio from 1 is within the range expected when taking into account chromosome-to-chromosome variability Key notes RPKM (the number of associated reads per kilobase of exonic sequence per million of total reads sequenced.) We assert that the effect of a mechanism that regulates transcriptional dosage compensation pertains only to the expression magnitude of transcriptionally active genes. The fraction of undetected (RPKM = 0) genes is substantially higher on the X chromosome than on autosomes, accounting for as much as 40% of all the X-linked genes. Threshold in the analysis (RPKM >= 1 with at least 3 reads) Fraction of transcriptionally inactive genes on autosomes and X chromosome The ratio of the median transcription magnitudes of X-linked and autosomal genes The X:AA ratio estimates are shown based on the set of genes with minimal transcription (RPKM ≥ 1 and at least 3 associated reads). Black error bars show the 95% confidence interval (CI) based on bootstrap estimates incorrectly assuming independence of expression levels for neighboring genes (plotted here for reference; not used to make inferences). Red bars show the range around 1 into which the X:AA ratio is expected to fall (95% CI) in the presence of twofold upregulation of the X chromosome, taking into account interchromosomal variation (sampling of contiguous blocks of X-chromosome size from the autosomal portion of the genome). The observed X:AA values (black dots) in all tissues fall within this range, indicating that the observed transcriptional magnitude of X-linked genes is compatible with the presence of twofold upregulation. The blue bars show the range around 0.5 into which the X:AA ratio is expected to fall in the absence of X-chromosome upregulation (50% of the autosomal expression level). The X:AA estimates for the first five samples fall outside of this range, indicating that the Xlinked expression magnitude is significantly higher than that expected in the absence of dosage compensation. The X:AA values for other samples are within both the red and blue ranges, indicating that the two hypotheses (X:AA = 1 and X:AA = 0.5) cannot be clearly distinguished based on these individual data sets. The chr. 10:A and chr. 11:A ratios illustrating chromosome-to-chromosome variability Mouse RNA-seq data shows a lack of dosage compensation Dependence of the X:AA estimates on the RPKM threshold Dependence of the X:AA estimates on the RPKM threshold. The tissue-averaged X:AA estimates are shown (black) as a function of the minimal RPKM threshold, from 0 (all genes, including those with undetected expression) to RPKM ≥2. The error bars correspond to the s.e.m. between different tissues. The largest change in the ratio is observed after exclusion of genes with undetected expression (RPKM >0). As the RPKM thresholds increase, the X:AA ratio largely stabilizes above RPKM = 1. The application of a RPKM threshold increases the median expression level and can artificially shift the X:AA ratio closer to 1. The shaded gray region shows the 95% confidence envelope for the hypothetical X chromosome that is expressed at 50% of the autosomal level (see Supplementary Methods). For non-zero RPKM thresholds, the observed X:AA ratios lie outside of this 95% confidence interval, showing that the high X:AA ratios are increased more than is expected from only setting a RPKM threshold. Discussion