Flux Capacitor FLUX CAPACITOR and FLUX SIMULATOR Flux Capacitor Manual Fold Table of Contents Version To Do Usage The Segment Graph Read mapping Flow Network Decomposition Abundance Measures Measurements (Absolute) Frequency Coverage Experiment Size Normalization Scope of measurement The list is not sorted by priority • output reads observed in Introns • command-line flag to deactivate prediction / deconvolution • enable gene_id for sub-locus information To comment on the To Do list, go to the details page. Usage usage: flux Parameter -cb,-costBounds Parameter Default Float,Float Description cost boundaries in the form of factorminfactorm , ax where factormin is the factor to determine the minimum (and factormax correspondingly the factor for obtaining the maximum) number of reads that can be predicted with respect -cm,-costModel lin| log log[Float] -cs,-costSplit Integer 5 -d,-decompose -f,--force String -o,--out -p,--pair -r,--ref -s,--sra -t,--strand perform the mapping (i.e., "profiling") step. Important: in Capacitor 20090718 this step alone does not produce any output. If applied, it collects information about the read distribution, otherwise the assumption is uniformally distributed reads. String stdout write output to a file with the given path Integer,In activate paired-end, specify insert size teger} range by sizeminsize , max Mandatory: file with the reference String annotation (GTF format) Mandatory: file with the short read String alignment (BED format) activate strand information The Segment Graph 1 path to the lpsolve native libraries additionally output elements of the locus, i.e., GTF features locus and transcript -l,--locus -m,--map the number of linear segments that approximate the function underlying the cost model perform the flow network decomposition. Important: in Capacitor 20090718 there is no output produced if this option is not applied. suppresses communication on stderr outputs graph information, i.e., GTF features fragment and junction -g,--graph -i,--lib to the originally observed reads. See the page about the objective function for details about the cost boundaries. cost model, either linear or logarithmical. In the case of linear costs, the a slope different than 1 can be controlled by the optional Float argument. Read mapping Single read mapping The assignment of reads — after having mapped them to genomic locations — is not straightforward. The Flux Capacitor follows a conservative annotation assignment,i.e., reads are assigned uniquely to genomic regions („segments” or ,,junctions). These regions are defined given the exon-intron structure of each locus, an example is shown in Fig.1. Fig.1: An example locus with two transcripts I and II (names to the left) that overlap in segments of their exons (green boxes denoted by letters A through E, indices indicate segments of overlapping exons). The Flux Capacitor distinguishes further 5 non-exonic areas. 19 sequencing reads (arrows with heart labels) have been mapped in the arrea of the locus as shown. The locus sketched in Fig.1 consists of 8 exons that cluster in 8 segments (A1, A2, …,E) separated by 5 non-exonic regions, i.e., the 5'proximal area (F), 3 introns (G,H,J), and 3'proximal (K). Additionally, there exist junctions between all adjacent segments (e.g., FA1, A1A2, etc. …), or between non-adjacent segments that are spliced together (so-called splice-junctions, for instance A2B1). Reads are assigned to the region they completely fall into. category FA1 A1A2 A2 G GB1 B1B2 B1C1 B2H C1 C1C2 C2 C3 C3J J E EK none assigned 3, 7, 1 2 18 4 5 17 6 15 8 14 9 10 11 12 13 read ID 19 16 Note: By meanings of the mapping, read number 13 is not compatible with the annotation and remains unassigned. Read pair mapping A read pair is mapped validly iff both mate reads map to a segment or junction and their mapping distance on at least one of the transcripts that support both mapping locations falls within the boundaries of expected insert sizes. How paired reads are counted and coverage by read pairs is determined summarizes Fig.2. Fig.2: Examples of exonic structures (green boxes are exons, introns are not drawn to scale) and distinct possible read mappings, for single (above the structure) and paired-end reads (below). The read length is 3 and, for pairedends, the insert size is 4 (no variation). For simplification, junctions are not shown. (A) There are 10 possible mapping locations („slots”)) in a mono-exonic transcript with 12nt. Reads starting at positions 11 or 12 fall partially outside of the annotation, as reads that start before position 1, and such reads are not considered to belong to the exon as annotated. Correspondingly, 4 slots with paired end reads can be observed. (B) Example of a transcript with 2 exons. Disconsidering the splice-junction, which is assigned read mappings starting in position 6 or 7, we observe 8 slots for single reads and 3 paired-end read slots. (C) Example of a transcript with 3 exons (splice-junctions disregarded). There are 7 slots for single reads, and 2 for paired-end reads. Flow Network Decomposition Abundance Measures Measurements Different measures can be thought of for measuring abundances of RNA features, in general loci, genes, transcripts, exons, or alternative splicing events. In the following the terms read and mapping are used as synonyms, however, we have to bear in mind that all observed read alignments are mappings, and slackly using reads instead does only hold for datasets with exclusively one alignment per read. (Absolute) Frequency We use the terms frequency and absolute frequency equivalently to describe the amount of observations o for a certain feature, i.e., an exon, transcript or gene. This number is directly derived from the number of reads that maps to the corresponding feature. Certainly, as Next Generation Sequencing technologies adopt an intermediary step of library construction, frequency measures are biased by the size of a certain feature, as well as the experiment size. Coverage We adopt the term of coverage to describe the occupance of features (i.e., exons, transcripts, genes, events, etc. …) by sequencing reads. Straightforwardly, this can be done in two ways, by measuring nucleotide coverage, or read coverage. As different sequencing experiments produce reads of different lengths, and also datasets with mixed read lengths are to be considered, the read coverage is a universal way to measure abundances of RNA molecules. Definition: Read coverage c is the number of observed reads o aligning to a certain feature divided by the number hypothetically possible different reads s in the feature: c=os By above stated definition, the Flux Capacitor determines c by the reads that map to a certain feature o out of the number of different mapping possibilities (i.e., s). By this, coverage can be calculated for (linear) sequences (e.g., exons and transcripts), and for constructs of partially overlapping or disconnected sequences (as for instance in alternatively spliced genes). The basic idea follows along the lines of the coverage measure proposed in there. It generalizes the measurements focused on linear stretches of sequences as described there. Coverage measures naturally normalize for the extend of a certain feature, and consequently one can compare coverages of features with different sizes. However, a llinear correlation between observations and size is assumed intrinsically in the fraction os. Variable o is denoted by freq in the Flux Capacitor's output. For a complete list of all possible abundance tags, see the GTF format description. Experiment Size Normalization In order to compare experiments of different sizes, both measurements can be relativized to the number of reads from which they have been derived. It remains important to consider which number of reads are the basis for the comparisons, possibilities here are the total number of reads in the experiment nexp (before or after a potential quality filtering process), the number of reads ndna that can be mapped to a reference genome, or the number of reads nrna that can be mapped to a reference transcriptome. Clearly nexp does not contain information about the quality of the reads, in terms of mismatches when aligned to a reference. Therefore, in different technical replicates the number may bias for a technical better run in comparison to one where less reads can be mapped due to high error rates. Furthermore, sample contamination is not considered by a normalization over nexp, which in contrast can be obtained when considering the number of sequences that can be mapped to a reference genome ndna. But ndna still does not comprise information about the fraction of "undesired" reads as the ones derived from ribosomal RNA or unspliced transcript - which can substantially vary between experiment depending on the applied protocols. In fact, when comparing mRNA frequencies it seems to be most senseful to take nrna of a considered transcriptome sequence into account. By this naturally also biases from a different degree of incompleteness when comparing against different annotations is balanced. We therefore define relative frequency rfreq and relative coverage rcov as follows rfreq=onrna and rcov=rnrna For single reads, our rcov measurement is close to the RPKM measure (reads per kilobase per million mapped reads). The differences are that (i) not the transcribed length, but merely the number of different alignment positions ("slots") is taken into account (slots= length- readlength), and (ii) the RPKM measure scales the numeric space of the obtained values up by 10 9. In order to produce measurements comparable to the currently popular RPKM values, the Flux Capacitor produces rcov measures shifted by the factor explained in (ii). Scope of measurement The section before introduced the different measures of frequency and coverage, and their counterparts relative frequency and relative coverage after normalization according to the size of the respective experiment. All these measures have one component in common, the number of read mappings counted as frequency. Read mappings are based on a certain base (see below) and can be counted in different scopes considering a certain feature, i.e., an exon, transcript, etc. … The Flux Capacitor considers two bases, i.e., observation obs and prediction pred. Base obs is the observation after mapping to the reference annotation, and distributes reads equally amongst overlapping features (a trivial deconvolution algorithm, so to say). Base pred considers the values after flow network deconvolution of the reads. Both bases are considered in 3 scopes: all - the number of all mapped reads that fall into the feature, split - the number of read mappings from all that are assigned to the transcript(s) listed in the transcript_id field of the feature, and unique - the subset of mappings in split that are in regions where exactly and exclusively the transcript(s) of the transcript_id field are annotated. Here a toy example for the different measurements. The figure sketches a locus L, with two transcripts T1 and T2 and 3 exons E1, E2 and E3 of which E2 is shared by both transcripts. In total, 6 reads align within the 3 exons (splice junction mappings are not shown for simplification). We count the following frequencies: feature transcript ID measurement E1 T1 obs_freq_all obs_freq_split obs_freq_unqiue E2 obs_freq_all T1,T2 obs_freq_split obs_freq_unqiue E3 T2 obs_freq_all obs_freq_split obs_freq_unqiue value 2 2 2 3 1.5 0 1 1 1 T1 T1 T2 T2 5 3.5 2 4 2.5 1 L T1,T2 obs_freq_all obs_freq_split obs_freq_unqiue obs_freq_all obs_freq_split obs_freq_unqiue obs_freq_all 6 obs_freq_split 6 obs_freq_unqiue 3 As by definition of the measurement, all equals split for the locus. Moreover, the unique measure counts the sum of read mappings in regions, where all of the transcripts in the locus are present. Consequently, these exclude reads from regions that are unique to a transcript, or that are unique to a subset of transcripts in the locus. To this end, the tag unique may be misleading for a locus. Footnotes 1. It has been June 10th in 2009 around aperitif time when we — Hagen, Thomas and me — decided to prefer the term „ segment ”, which subdivides bigger units, to the term „ fragment ” which denotes physically separated parts of a whole.