RNA-Seq technology and it's application on
dosage compensation between the X
chromosome and autosomes in mammals
2011-12-05
Outline

RNA-Seq: technologies and it's methodologies

Application on dosage compensation model
RNA-Seq: technologies and it's methodologies
Transcriptomics methods before RNA-Seq


Hybridization-based approaches

Genomic tiling microarrays

Fluorescently labelled cDNA with microarrays
Sequence-based approaches

Sanger sequencing of cDNA or EST libraries

Serial analysis of gene expression (SAGE)

Cap analysis of gene expression (CAGE)

Massively parallel signature sequencing
(MPSS)
A typical RNA-Seq experiment
Sequencer used for RNA-Seq

Illumina IG

Applied Biosystems SOLiD

Roche 454 Life Science

Helicos Biosciences tSMS (has not yet been used for published
RNA-Seq studies, data from Jan. 2009)
Direct RNA sequencing using the Helicos
approach
a | RNA that is polyadenylated and 3′ deoxyblocked with poly(A) polymerase is captured on
poly(dT)-coated surfaces. A 'fill-and-lock' step is
performed, in which the 'fill' step is performed with
natural thymidine and polymerase, and the 'lock'
step is performed with fluorescently labelled A, C
and G Virtual Terminator (VT) nucleotides and
polymerase. This step corrects for any
misalignments that may be present in poly(A) and
poly(T) duplexes, and ensures that the sequencing
starts in the RNA template rather than the
polyadenylated tail. b | Imaging is performed to
locate the positions of the templates. Then,
chemical cleavage of the dye–nucleotide linker is
performed to release the dye and prepare the
templates for nucleotide incorporation. c |
Incubation of this surface with one labelled
nucleotide (C-VT is shown as an example) and a
polymerase mixture is carried out. After this step,
imaging is performed to locate the templates that
have incorporated the nucleotide. Chemical
cleavage of the dye allows the surface and DNA
templates to be ready for the next nucleotideaddition cycle. Nucleotides are added in the C, T, A,
G order for 120 total cycles (30 additions of each
nucleotide).
Advantages of RNA-Seq compared with
other transcriptomics methods
Quantifying expression levels: RNA-Seq
and microarray compared
Challenges for RNA-Seq



Library construction
–
Bias in the result from different library
construction (RNA fragmentation and cDNA
fragmentation) for large RNA
–
Strand-specific libraries are currently laborious to
produce
Bioinformatic challenges
–
The development of efficient methods to store,
retrieve and process large amounts of data
–
Mapping reads to the genome
Coverage versus cost
DNA library preparation: RNA fragmentation
and DNA fragmentation compared
a | Fragmentation of oligo-dT
primed cDNA (blue line) is more
biased towards the 3' end of the
transcript. RNA fragmentation (red
line) provides more even
coverage along the gene body,
but is relatively depleted for both
the 5' and 3' ends. Note that the
ratio between the maximum and
minimum expression level (or the
dynamic range) for microarrays is
44, for RNA-Seq it is 9,560. The
tag count is the average
sequencing coverage for 5,000
yeast ORFs. b | A specific yeast
gene, SES1 (seryl-tRNA
synthetase), is shown.
Coverage versus depth
Metholologies for RNA-Seq studies

Mapping transcription start sites

Strand-specific RNA-Seq

Characterization of alternative splicing patterns

Gene fusion detection

Targeted approaches using RNA-Seq

Small RNA profiling

Direct RNA sequencing

Profiling low-quantity RNA samples
Mapping transcription start sites (TSSs)
Mapping transcription start sites (TSSs)


Advantages

Low quantities of input RNA

Pair-end sequencing enables identified TSSs to
specific transcripts

Pair-end sequencing alleviates the difficulty of aligning
single short reads to repeat regions
Disadvantages

Primer dimers dominates sequencing data sets

Dependent on cDNA synthesis or hybridization steps

Be challenging for short-lived transcripts
Strand-specific RNA-Seq



Adaptors with known orientations are ligated to the ends of
RNAs or to first-strand cDNA molecules
Direct sequencing of the first-strand cDNA products
Selective chemical marking of the second-strand cDNA
synthesis products or RNA
Characterization of alternative splicing patterns
a | Sequence reads are mapped to
genomic DNA or to a transcriptome
reference to detect alternative
isoforms of an RNA transcript.
Mapping is based simply on read
counts to each exon and reads that
span the exonic boundaries. One
infers the absence of the genomic
exon in the transcript by virtue of no
reads mapping to the genomic
location. b | Paired sequence reads
provide additional information about
exonic splicing events, as
demonstrated by matching the first
read in one exon and placing the
second read in the downstream
exon, creating a map of the
transcript structure.
Gene fusion detection
Targeted approaches using RNA-Seq
Targeted approaches using RNA-Seq
Small RNA profiling
Direct RNA sequencing
a | RNA that is polyadenylated and 3′ deoxyblocked with poly(A) polymerase is captured on
poly(dT)-coated surfaces. A 'fill-and-lock' step is
performed, in which the 'fill' step is performed with
natural thymidine and polymerase, and the 'lock'
step is performed with fluorescently labelled A, C
and G Virtual Terminator (VT) nucleotides and
polymerase. This step corrects for any
misalignments that may be present in poly(A) and
poly(T) duplexes, and ensures that the sequencing
starts in the RNA template rather than the
polyadenylated tail. b | Imaging is performed to
locate the positions of the templates. Then,
chemical cleavage of the dye–nucleotide linker is
performed to release the dye and prepare the
templates for nucleotide incorporation. c |
Incubation of this surface with one labelled
nucleotide (C-VT is shown as an example) and a
polymerase mixture is carried out. After this step,
imaging is performed to locate the templates that
have incorporated the nucleotide. Chemical
cleavage of the dye allows the surface and DNA
templates to be ready for the next nucleotideaddition cycle. Nucleotides are added in the C, T, A,
G order for 120 total cycles (30 additions of each
nucleotide).
Profiling low-quantity RNA samples
a | Single-molecule DNA and RNA sequencing technologies could be modified for single-cell applications. Cells can be delivered to flow
cells using fluidics systems, followed by cell lysis and capture of mRNA species on the poly(dT)-coated sequencing surfaces by
hybridization. Standard sequencing runs could take place on channels with a 127.5 mm2 surface area, requiring 2,750 images to be taken
per cycle to image the entire channel area. The surface area needed to accommodate ~350,000 mRNA molecules contained in a single
cell is ~0.4 mm2; thus, only eight images per cycle would be needed. Sequence analysis can be done with direct RNA sequencing (DRS)
or on-surface cDNA synthesis followed by single-molecule DNA sequencing. b | Counter system workflow. Two probes are used for each
target site: the capture probe (shown in red) contains a target-specific sequence and a modification that allows the immobilization of the
molecules on a surface; the reporter probe contains a different target-specific sequence (shown in blue) and a fluorescent barcode (shown
by a green circle) that is unique to each target being examined. After hybridization of the capture and reporter probe mixture to RNA
samples in solution, excess probes are removed. The hybridized RNA duplexes are then immobilized on a surface and imaged to identify
and count each transcript with the unique fluorescent signals on the capture and reporter probes.
Reference




Zhong, W. et al. RNA-Seq a revolutionary tool for
transcriptomics. Nature Reviews Genetics 10, 57
(2009).
Fatih, O. et al. RNA sequencing: advances, challenges
and opportunities. Nature Reviews Genetics 12, 87
(2011).
Jeffrey, A. M. et al. Next-generation transcriptome
assembly. Nature Reviews Genetics 12, 671 (2011)
Philipp, K. et al. New class of gene-termini-associated
human RNAs suggests a novel RNA copying
mechanism. Nature 466, 642 (2010).
Application on dosage compensation model
Background

Ohno's hypothesis


X-linked genes are expressed at twice the
level of autosomal genes per active allele to
balance the gene dose between the X
chromosome and autosomes.
Microarray data (X:AA ~ 1)
Abstract from Xiong et al

Mammalian cells from both sexes typically contain one active X
chromosome but two sets of autosomes. It has previously been
hypothesized that X-linked genes are expressed at twice the level of
autosomal genes per active allele to balance the gene dose between
the X chromosome and autosomes (termed 'Ohno's hypothesis'). This
hypothesis was supported by the observation that microarray-based
gene expression levels were indistinguishable between one X
chromosome and two autosomes (the X to two autosomes ratio
(X:AA) ~1). Here we show that RNA sequencing (RNA-Seq) is more
sensitive than microarray and that RNA-Seq data reveal an X:AA ratio
of ~0.5 in human and mouse. In Caenorhabditis elegans
hermaphrodites, the X:AA ratio reduces progressively from ~1 in
larvae to ~0.5 in adults. Proteomic data are consistent with the RNASeq results and further suggest the lack of X upregulation at the
protein level. Together, our findings reject Ohno's hypothesis,
necessitating a major revision of the current model of dosage
compensation in the evolution of sex chromosomes.
Expression level definition

Taking mouse as an example, we mapped all 25-mer
RNA-Seq reads to the genome sequence. Only those
reads uniquely mapped to exons were considered as
valid hits for a given gene. The expression level of a
gene is defined by the number of valid hits to the gene
divided by the effective length of the gene, which is the
total number of 25-mers in the DNA sequences of the
exons of the gene that have no other matches
anywhere in the genome. For comparisons between
tissues or developmental stages, expression levels
were normalized by dividing the total number of valid
hits in the sample.
Comparison of gene expressions measured
by microarray and RNA-Seq
Human liver is considered unless otherwise noted. (a) Estimation
variation measured by the fold difference of microarray intensities
of two same-target probesets or of RNA-Seq signals from two
halves of the same gene. (b) Identical to a, except that mouse
liver is considered here. (c) Comparison of the internal
consistency of RNA-Seq data and microarray data. The
expression differences from one-half of the nucleotides (RNASeq) or a probeset (microarray) are shown for 1,000 randomly
picked gene pairs each with twofold ± 0.01-fold expression
difference from the other half of nucleotides (RNA-Seq) or from
the other probeset (microarray). The central bold line shows the
median, the box encompasses 50% of data points and the error
bars include 90% of data points. (d) Pearson's correlation (r) of
microarray and RNA-Seq expression signals (gray) and of RNASeq signals from two independent experiments (black). A certain
fraction of genes (x axis) with the highest expression according
to one of the RNA-Seq datasets are examined. Error bars show
95% confidence intervals estimated by bootstrapping. (e)
Microarray consistently underestimates expression differences
between genes. The microarray expression differences of 1,000
randomly picked gene pairs each with x-fold (x = 2 ± 0.01, 4 ±
0.02, 8 ± 0.04, 16 ± 0.08, 32 ± 0.16, and 64 ± 0.32) RNA-Seq
expression difference are shown. The central bold line shows the
median, the box encompasses 50% of data points and the error
bars include 90% of data points. (f) Relative liver expressions of
55 mouse genes, measured by RNA-Seq, microarray and qRTPCR.
Comparisons of RNA-Seq gene expression
levels between the X chromosome and
autosomes in 12 human tissues and 3 mouse
tissues
(a) The median expression levels
of X-linked genes (closed
diamonds) and autosomal genes
(open circles) are compared.
Median expressions of autosomal
genes were normalized to 1.
Error bars show 95% bootstrap
confidence intervals. Sex
information is listed in the
parantheses after the tissue
names (M, male; F, female; NA,
unknown). (b) X:AA ratios of
median expressions from the
human liver when X is compared
to individual autosomes. Error
bars show 95% bootstrap
confidence intervals.
Test upregulation in Ohno's hypothesis

Upregulation in Ohno's hypothesis
In Ohno's hypothesis, upregulation is needed for those X-linked
genes that had existed in the genome before the emergence of
the X chromosome; X-linked genes that originated de novo on X
presumably do not require upregulation.
Test upregulation in Ohno's hypothesis
Comparison of RNA-Seq gene expression
levels of the X chromosome and autosomes
in C. elegans
Caveats in this RNA-Seq analysis




The Illumina sequencing used here may be biased
toward certain sequences or nucleotides.
Reverse transcription during cDNA library preparation
is likely to be less efficient for longer transcripts.
GC content may affect RNA-Seq results.
A recent study using time-course microarray data
excluded lowly expressed genes, which is
inappropriate for measuring the absolute value of X:AA
ratio.
Main idea

Here we contend that the low estimate of the X:AA
ratio by Xiong et al. stems from the disproportionate
contribution of transcriptionally inactive genes, which
are not relevant for the evaluation of dosage
compensation mechanisms, to the X chromosome
average. We show that when only active genes are
considered, the RNA-seq data give X:AA ratios closer
to 1, and the observed minor deviation of the X:AA
ratio from 1 is within the range expected when taking
into account chromosome-to-chromosome variability
Key notes




RPKM (the number of associated reads per kilobase
of exonic sequence per million of total reads
sequenced.)
We assert that the effect of a mechanism that
regulates transcriptional dosage compensation
pertains only to the expression magnitude of
transcriptionally active genes.
The fraction of undetected (RPKM = 0) genes is
substantially higher on the X chromosome than on
autosomes, accounting for as much as 40% of all the
X-linked genes.
Threshold in the analysis (RPKM >= 1 with at least 3
reads)
Fraction of transcriptionally inactive genes
on autosomes and X chromosome
The ratio of the median transcription
magnitudes of X-linked and autosomal genes
The X:AA ratio estimates are shown based on the set of
genes with minimal transcription (RPKM ≥ 1 and at least 3
associated reads). Black error bars show the 95%
confidence interval (CI) based on bootstrap estimates
incorrectly assuming independence of expression levels
for neighboring genes (plotted here for reference; not used
to make inferences). Red bars show the range around 1
into which the X:AA ratio is expected to fall (95% CI) in the
presence of twofold upregulation of the X chromosome,
taking into account interchromosomal variation (sampling
of contiguous blocks of X-chromosome size from the
autosomal portion of the genome). The observed X:AA
values (black dots) in all tissues fall within this range,
indicating that the observed transcriptional magnitude of
X-linked genes is compatible with the presence of twofold
upregulation. The blue bars show the range around 0.5
into which the X:AA ratio is expected to fall in the absence
of X-chromosome upregulation (50% of the autosomal
expression level). The X:AA estimates for the first five
samples fall outside of this range, indicating that the Xlinked expression magnitude is significantly higher than
that expected in the absence of dosage compensation.
The X:AA values for other samples are within both the red
and blue ranges, indicating that the two hypotheses (X:AA
= 1 and X:AA = 0.5) cannot be clearly distinguished based
on these individual data sets.
The chr. 10:A and chr. 11:A ratios illustrating
chromosome-to-chromosome variability
Mouse RNA-seq data shows a lack of
dosage compensation
Dependence of the X:AA estimates on the
RPKM threshold
Dependence of the X:AA estimates on the RPKM
threshold. The tissue-averaged X:AA estimates are
shown (black) as a function of the minimal RPKM
threshold, from 0 (all genes, including those with
undetected expression) to RPKM ≥2. The error
bars correspond to the s.e.m. between different
tissues. The largest change in the ratio is observed
after exclusion of genes with undetected
expression (RPKM >0). As the RPKM thresholds
increase, the X:AA ratio largely stabilizes above
RPKM = 1. The application of a RPKM threshold
increases the median expression level and can
artificially shift the X:AA ratio closer to 1. The
shaded gray region shows the 95% confidence
envelope for the hypothetical X chromosome that is
expressed at 50% of the autosomal level (see
Supplementary Methods). For non-zero RPKM
thresholds, the observed X:AA ratios lie outside of
this 95% confidence interval, showing that the high
X:AA ratios are increased more than is expected
from only setting a RPKM threshold.
Discussion