README_for_Supplemental_Datasets_7-11.

advertisement
These files give the insertion positions, read counts and gene data for the
different datasets used in the paper. All the files are in plain-text tab-separated
format with headers, and can be opened as text or in Excel. The first 8 tabseparated fields in all files are as follows:
1) chromosome – name of the chromosome to which the insertion mapped.
Possible values include all nuclear genome chromosomes and scaffolds,
as well as the chloroplast and mitochondrial genomes and the insertion
cassette, all treated as separate chromosomes.
2) strand – the direction in which the insertion mapped to the given
chromosome: “+“ means the cassette orientation matched the (arbitrary)
chromosome “+” direction, “-“ means the cassette was in the opposite
orientation, and “both” means that particular insertion includes two
cassettes in two different directions (due to merging two apparent
insertions that mapped to the same position but opposite strand – this has
only been done for some of the files).
3) position_before_insertion – the number of the base immediately before the
insertion position, 1-based. Note that sometimes this base was directly
sequenced from the flanking region, and sometimes it was assumed
based on the other side of the insertion – most of the time we only saw
flanking region reads from one side of the insertion (the 5’ or 3’ side of the
cassette). If the strand is “+”, that means the 5’ flanking region is before
the cassette, and the 3’ flanking region is after the cassette – e.g. for an
insertion between bases 100 and 101, the 5’ flanking region will include
bases 80–100, and the 3’ flanking region will include bases 101–121. The
opposite is the case for “-“ strand insertions. Therefore if the strand is “+”
in a 5’ flanking region file, or “-“ in a 3’ flanking region file, or “both” in
either type of file, the position_before_insertion_value is directly known based
on the flanking region sequence; in the other cases, what is actually
known is the position after the insertion, and the position_before_insertion
value is just set to one base before that (on the assumption that the
insertion was clean), but in reality that position may be different because
of a genomic deletion or other complicating factors.
4) gene – the v4.3 gene ID of the gene that includes the mapped insertion
position, or “intergenic” if there are no annotated genes at that position, “?”
if we did not check for annotated genes (for chloroplast and mitochondrial
genomes only), and “-“ for insertions that map to the insertion cassette.
The v5.3 genome contains 192 pairs of overlapping genes – when a
mutant falls in a region where two genes overlap, the gene names are
separated by “&”.
5) orientation_to_gene – “sense” if the cassette is inserted in the same direction
as the gene, “antisense” otherwise (this is independent of the “+”/”-“ strand
values – if the cassette is in the same direction as the “-“ strand, and the
gene is on the “-“ strand, or they’re both “+”, the orientation is “sense”; if
the cassette and gene strands are different, the orientation is “antisense”),
“both” if there are two copies of the cassette in opposite directions (when
strand is “both”), “-“ for intergenic and cassette positions (when gene is
“intergenic” or “-“), and “?” when gene is “?”. If the mutant falls in a region
where two genes overlap, the orientations with regard to the two genes
are separated by “&”.
6) gene_feature – which feature of the gene the insertion maped to: the main
possible values are “intron”, “CDS” (i.e. exon excluding UTRs),
“five_prime_UTR”, “three_prime_UTR”; the value can be two of the above
separated with “/” if the insertion mapped exactly to the edge between two
features, or one of the above with “gene_edge/” and/or “mRNA_edge/”
added, if it mapped to exactly the edge of the gene or mRNA; the value
can also be “-” for intergenic and cassette-mapped insertion positions, and
“?” when gene is “?”. 8.6% of genes in the v5.3 genome have multiple
splice
variants;
for
mutants
falling
in
those
genes,
“MULTIPLE_SPLICE_VARIANTS” is given instead of a feature, since the
feature can be different for each variant. If the mutant falls in a region
where two genes overlap, the features of to the two genes are separated
by “&”.
7) N_total_reads – how many deep-sequencing reads correspond to this
insertion (this includes perfect reads, reads with 1bp mismatches, and for
“both” strand cases it includes reads from both sides).
8) top_read_sequence – the most abundant unique sequence out of all the
reads corresponding to this insertion.
Some of the files have additional 5 tab-separated fields after those, dealing with
gene annotation:
9) gene_synonyms – other IDs previously used for that gene on Phytozome,
based on Phytozome Creinhardtii_236_synonym.txt file.
10-11) transcript_names, defline – further Phytozome information about the
gene (often absent, marked by “-“); defline is based on Phytozome
Creinhardtii_236_defline.txt file, and transcript_names are derived
from Creinhardtii_236_gene.gff3
12–14) best_arabidopsis_TAIR10_hit_name, best_arabidopsis_TAIR10_hit_symbol, and
best_arabidopsis_TAIR10_hit_defline
– the name, symbol and description
of the best Arabidopsis homolog, or “-“ if one wasn’t found, taken
directly from the Phytozome Creinhardtii_236_annotation_info.txt file.
Each table contains the insertions for a different dataset:
Supplemental table 9, 10: Raw data for the two technical replicates of the
data used for Figures 3 and 4; includes insertions mapped to the
nuclear and organellar genomes and cassette-mapped insertions; no
low-abundance cutoffs were applied; no adjacent insertion merging
was done.
Supplemental table 11: Final filtered data used for Figures 3 and 4: after
removing insertions mapped to the organellar genomes, applying the
low-abundance cutoffs, adding the two technical replicates together,
and merging some adjacent insertions. Also includes insertions
mapped to the cassette; the low abundance cutoffs were applied to
those, but adjacent insertion merging was not.
Supplemental
table 12, 13: Raw data for 5’ and 3’ flanking regions,
respectively, from the data used for Figure 5 (cassette-mapped
insertions only). Before low-abundance cutoffs, and before adding the
two replicates together.
Understanding the cassette-mapped insertion positions: In the insertion mapping
process, the cassette is treated as just another chromosome, and thus the data
is presented as if the insertion cassette was inserted into another cassette.
However, in reality the most likely explanation is multiple cassettes or cassette
fragments being ligated together, and an upstream or downstream cassette
fragment can be ligated to either a 5’ or a 3’ end to another cassette (Figure 5A).
In order to convert the +/– strand positions to the upstream/downstream
categories used in the analysis in Figure 5, use the following rules: an upstream
cassette fragment will map to the + strand when it’s read as a 5’ flanking region,
and to the – strand when it’s read as a 3’ flanking region; a downstream cassette
fragment will map in the opposite orientation (to the – strand when it’s read as a
5’ flanking region, and to the + strand when it’s read as a 3’ flanking region).
Also note the three cassette-end special cases: two intact cassette 5’ ends
ligated together will yield 5’ flanking regions mapping to position 0 strand – (same
from both cassettes); two intact cassette 3’ ends ligated together will yield 3’
flanking regions mapping to position 2,660 (last base of the cassette) strand –
(same from both cassettes); an intact cassette 5’ and 3’ end ligated together will
yield 5’ flanking regions mapping to position 2,660 strand + (from the first
cassette) and 3’ flanking regions mapping to position 0 strand + (from the second
cassette).
A note on the format: all files have been reformatted for easier reading, and
thus do not conform to the raw format generated by mutant_count_alignments.py
and parsed by mutant_analysis_classes.read_mutant_file. Please contact us if
you would like another version of the data.
Download