Flux capacitor and simulator

advertisement
Flux Capacitor
FLUX CAPACITOR and FLUX SIMULATOR
Flux Capacitor Manual
Fold
Table of Contents
Version
To Do
Usage
The Segment Graph
Read mapping
Flow Network Decomposition
Abundance Measures
Measurements
(Absolute) Frequency
Coverage
Experiment Size Normalization
Scope of measurement
The list is not sorted by priority
• output reads observed in Introns
• command-line flag to deactivate prediction / deconvolution
• enable gene_id for sub-locus information
To comment on the To Do list, go to the details page.
Usage
usage: flux
Parameter
-cb,-costBounds
Parameter Default
Float,Float
Description
cost boundaries in the form of factorminfactorm
,
ax
where factormin is the factor to determine the
minimum (and factormax correspondingly the
factor for obtaining the maximum) number
of reads that can be predicted with respect
-cm,-costModel
lin|
log
log[Float]
-cs,-costSplit
Integer
5
-d,-decompose
-f,--force
String
-o,--out
-p,--pair
-r,--ref
-s,--sra
-t,--strand
perform the mapping (i.e., "profiling") step.
Important: in Capacitor 20090718 this
step alone does not produce any output. If
applied, it collects information about the
read distribution, otherwise the assumption
is uniformally distributed reads.
String
stdout write output to a file with the given path
Integer,In
activate paired-end, specify insert size
teger}
range by sizeminsize
, max
Mandatory: file with the reference
String
annotation (GTF format)
Mandatory: file with the short read
String
alignment (BED format)
activate strand information
The Segment Graph
1
path to the lpsolve native libraries
additionally output elements of the locus,
i.e., GTF features locus and transcript
-l,--locus
-m,--map
the number of linear segments that
approximate the function underlying the
cost model
perform the flow network decomposition.
Important: in Capacitor 20090718 there is
no output produced if this option is not
applied.
suppresses communication on stderr
outputs graph information, i.e., GTF
features fragment and junction
-g,--graph
-i,--lib
to the originally observed reads. See the
page about the objective function for
details about the cost boundaries.
cost model, either linear or logarithmical.
In the case of linear costs, the a slope
different than 1 can be controlled by the
optional Float argument.
Read mapping
Single read mapping
The assignment of reads — after having mapped them to genomic locations — is
not straightforward. The Flux Capacitor follows a conservative annotation
assignment,i.e., reads are assigned uniquely to genomic regions („segments”
or ,,junctions). These regions are defined given the exon-intron structure of each
locus, an example is shown in Fig.1.
Fig.1: An example locus with two transcripts I and II (names to the left) that
overlap in segments of their exons (green boxes denoted by letters A through E,
indices indicate segments of overlapping exons). The Flux Capacitor distinguishes
further 5 non-exonic areas. 19 sequencing reads (arrows with heart labels) have
been mapped in the arrea of the locus as shown.
The locus sketched in Fig.1 consists of 8 exons that cluster in 8 segments (A1, A2,
…,E) separated by 5 non-exonic regions, i.e., the 5'proximal area (F), 3 introns
(G,H,J), and 3'proximal (K). Additionally, there exist junctions between all
adjacent segments (e.g., FA1, A1A2, etc. …), or between non-adjacent segments
that are spliced together (so-called splice-junctions, for instance A2B1). Reads are
assigned to the region they completely fall into.
category FA1 A1A2 A2 G GB1 B1B2 B1C1 B2H C1 C1C2 C2 C3 C3J J
E
EK none
assigned
3,
7,
1
2
18 4
5
17
6
15
8 14 9 10 11 12 13
read ID
19
16
Note: By meanings of the mapping, read number 13 is not compatible with the
annotation and remains unassigned.
Read pair mapping
A read pair is mapped validly iff both mate reads map to a segment or junction
and their mapping distance on at least one of the transcripts that support both
mapping locations falls within the boundaries of expected insert sizes. How paired
reads are counted and coverage by read pairs is determined summarizes Fig.2.
Fig.2: Examples of exonic structures (green boxes are exons, introns are not
drawn to scale) and distinct possible read mappings, for single (above the
structure) and paired-end reads (below). The read length is 3 and, for pairedends, the insert size is 4 (no variation). For simplification, junctions are not
shown. (A) There are 10 possible mapping locations („slots”)) in a mono-exonic
transcript with 12nt. Reads starting at positions 11 or 12 fall partially outside of
the annotation, as reads that start before position 1, and such reads are not
considered to belong to the exon as annotated. Correspondingly, 4 slots with
paired end reads can be observed. (B) Example of a transcript with 2 exons.
Disconsidering the splice-junction, which is assigned read mappings starting in
position 6 or 7, we observe 8 slots for single reads and 3 paired-end read slots.
(C) Example of a transcript with 3 exons (splice-junctions disregarded). There are
7 slots for single reads, and 2 for paired-end reads.
Flow Network Decomposition
Abundance Measures
Measurements
Different measures can be thought of for measuring abundances of RNA features,
in general loci, genes, transcripts, exons, or alternative splicing events. In the
following the terms read and mapping are used as synonyms, however, we have
to bear in mind that all observed read alignments are mappings, and slackly using
reads instead does only hold for datasets with exclusively one alignment per read.
(Absolute) Frequency
We use the terms frequency and absolute frequency equivalently to describe the
amount of observations o for a certain feature, i.e., an exon, transcript or gene.
This number is directly derived from the number of reads that maps to the
corresponding feature. Certainly, as Next Generation Sequencing technologies
adopt an intermediary step of library construction, frequency measures are biased
by the size of a certain feature, as well as the experiment size.
Coverage
We adopt the term of coverage to describe the occupance of features (i.e., exons,
transcripts, genes, events, etc. …) by sequencing reads. Straightforwardly, this
can be done in two ways, by measuring nucleotide coverage, or read coverage.
As different sequencing experiments produce reads of different lengths, and also
datasets with mixed read lengths are to be considered, the read coverage is a
universal way to measure abundances of RNA molecules.
Definition: Read coverage c is the number of observed reads o aligning to a
certain feature
divided by the number hypothetically possible different reads s in the feature: c=os
By above stated definition, the Flux Capacitor determines c by the reads that map
to a certain feature o out of the number of different mapping possibilities (i.e., s).
By this, coverage can be calculated for (linear) sequences (e.g., exons and
transcripts), and for constructs of partially overlapping or disconnected
sequences (as for instance in alternatively spliced genes). The basic idea follows
along the lines of the coverage measure proposed in there. It generalizes the
measurements focused on linear stretches of sequences as described there.
Coverage measures naturally normalize for the extend of a certain feature, and
consequently one can compare coverages of features with different sizes.
However, a llinear correlation between observations and size is assumed
intrinsically in the fraction os.
Variable o is denoted by freq in the Flux Capacitor's output. For a complete list of
all possible abundance tags, see the GTF format description.
Experiment Size Normalization
In order to compare experiments of different sizes, both measurements can be
relativized to the number of reads from which they have been derived. It remains
important to consider which number of reads are the basis for the comparisons,
possibilities here are the total number of reads in the experiment nexp (before or
after a potential quality filtering process), the number of reads ndna that can be
mapped to a reference genome, or the number of reads nrna that can be mapped
to a reference transcriptome.
Clearly nexp does not contain information about the quality of the reads, in terms
of mismatches when aligned to a reference. Therefore, in different technical
replicates the number may bias for a technical better run in comparison to one
where less reads can be mapped due to high error rates. Furthermore, sample
contamination is not considered by a normalization over nexp, which in contrast
can be obtained when considering the number of sequences that can be mapped
to a reference genome ndna. But ndna still does not comprise information about the
fraction of "undesired" reads as the ones derived from ribosomal RNA or unspliced
transcript - which can substantially vary between experiment depending on the
applied protocols. In fact, when comparing mRNA frequencies it seems to be most
senseful to take nrna of a considered transcriptome sequence into account. By this
naturally also biases from a different degree of incompleteness when comparing
against different annotations is balanced.
We therefore define relative frequency rfreq and relative coverage rcov as follows
rfreq=onrna
and
rcov=rnrna
For single reads, our rcov measurement is close to the RPKM measure (reads per
kilobase per million mapped reads). The differences are that (i) not the
transcribed length, but merely the number of different alignment positions
("slots") is taken into account (slots= length- readlength), and (ii) the RPKM measure
scales the numeric space of the obtained values up by 10 9. In order to produce
measurements comparable to the currently popular RPKM values, the Flux
Capacitor produces rcov measures shifted by the factor explained in (ii).
Scope of measurement
The section before introduced the different measures of frequency and coverage,
and their counterparts relative frequency and relative coverage after
normalization according to the size of the respective experiment. All these
measures have one component in common, the number of read mappings
counted as frequency. Read mappings are based on a certain base (see below)
and can be counted in different scopes considering a certain feature, i.e., an
exon, transcript, etc. …
The Flux Capacitor considers two bases, i.e., observation obs and prediction pred.
Base obs is the observation after mapping to the reference annotation, and
distributes reads equally amongst overlapping features (a trivial deconvolution
algorithm, so to say). Base pred considers the values after flow network
deconvolution of the reads. Both bases are considered in 3 scopes: all - the
number of all mapped reads that fall into the feature, split - the number of read
mappings from all that are assigned to the transcript(s) listed in the
transcript_id field of the feature, and unique - the subset of mappings in
split that are in regions where exactly and exclusively the transcript(s) of the
transcript_id field are annotated.
Here a toy example for the different measurements. The figure sketches a locus
L, with two transcripts T1 and T2 and 3 exons E1, E2 and E3 of which E2 is shared
by both transcripts. In total, 6 reads align within the 3 exons (splice junction
mappings are not shown for simplification). We count the following frequencies:
feature transcript ID measurement
E1
T1
obs_freq_all
obs_freq_split
obs_freq_unqiue
E2
obs_freq_all
T1,T2
obs_freq_split
obs_freq_unqiue
E3
T2
obs_freq_all
obs_freq_split
obs_freq_unqiue
value
2
2
2
3
1.5
0
1
1
1
T1
T1
T2
T2
5
3.5
2
4
2.5
1
L
T1,T2
obs_freq_all
obs_freq_split
obs_freq_unqiue
obs_freq_all
obs_freq_split
obs_freq_unqiue
obs_freq_all
6
obs_freq_split 6
obs_freq_unqiue 3
As by definition of the measurement, all equals split for the locus. Moreover,
the unique measure counts the sum of read mappings in regions, where all of the
transcripts in the locus are present. Consequently, these exclude reads from
regions that are unique to a transcript, or that are unique to a subset of
transcripts in the locus. To this end, the tag unique may be misleading for a
locus.
Footnotes
1. It has been June 10th in 2009 around aperitif time when we — Hagen, Thomas
and me — decided to prefer the term „ segment ”, which subdivides bigger units,
to the term „ fragment ” which denotes physically separated parts of a whole.
Download