These files give the insertion positions, read counts and gene data for the different datasets used in the paper. All the files are in plain-text tab-separated format with headers, and can be opened as text or in Excel. The first 8 tabseparated fields in all files are as follows: 1) chromosome – name of the chromosome to which the insertion mapped. Possible values include all nuclear genome chromosomes and scaffolds, as well as the chloroplast and mitochondrial genomes and the insertion cassette, all treated as separate chromosomes. 2) strand – the direction in which the insertion mapped to the given chromosome: “+“ means the cassette orientation matched the (arbitrary) chromosome “+” direction, “-“ means the cassette was in the opposite orientation, and “both” means that particular insertion includes two cassettes in two different directions (due to merging two apparent insertions that mapped to the same position but opposite strand – this has only been done for some of the files). 3) position_before_insertion – the number of the base immediately before the insertion position, 1-based. Note that sometimes this base was directly sequenced from the flanking region, and sometimes it was assumed based on the other side of the insertion – most of the time we only saw flanking region reads from one side of the insertion (the 5’ or 3’ side of the cassette). If the strand is “+”, that means the 5’ flanking region is before the cassette, and the 3’ flanking region is after the cassette – e.g. for an insertion between bases 100 and 101, the 5’ flanking region will include bases 80–100, and the 3’ flanking region will include bases 101–121. The opposite is the case for “-“ strand insertions. Therefore if the strand is “+” in a 5’ flanking region file, or “-“ in a 3’ flanking region file, or “both” in either type of file, the position_before_insertion_value is directly known based on the flanking region sequence; in the other cases, what is actually known is the position after the insertion, and the position_before_insertion value is just set to one base before that (on the assumption that the insertion was clean), but in reality that position may be different because of a genomic deletion or other complicating factors. 4) gene – the v4.3 gene ID of the gene that includes the mapped insertion position, or “intergenic” if there are no annotated genes at that position, “?” if we did not check for annotated genes (for chloroplast and mitochondrial genomes only), and “-“ for insertions that map to the insertion cassette. The v5.3 genome contains 192 pairs of overlapping genes – when a mutant falls in a region where two genes overlap, the gene names are separated by “&”. 5) orientation_to_gene – “sense” if the cassette is inserted in the same direction as the gene, “antisense” otherwise (this is independent of the “+”/”-“ strand values – if the cassette is in the same direction as the “-“ strand, and the gene is on the “-“ strand, or they’re both “+”, the orientation is “sense”; if the cassette and gene strands are different, the orientation is “antisense”), “both” if there are two copies of the cassette in opposite directions (when strand is “both”), “-“ for intergenic and cassette positions (when gene is “intergenic” or “-“), and “?” when gene is “?”. If the mutant falls in a region where two genes overlap, the orientations with regard to the two genes are separated by “&”. 6) gene_feature – which feature of the gene the insertion maped to: the main possible values are “intron”, “CDS” (i.e. exon excluding UTRs), “five_prime_UTR”, “three_prime_UTR”; the value can be two of the above separated with “/” if the insertion mapped exactly to the edge between two features, or one of the above with “gene_edge/” and/or “mRNA_edge/” added, if it mapped to exactly the edge of the gene or mRNA; the value can also be “-” for intergenic and cassette-mapped insertion positions, and “?” when gene is “?”. 8.6% of genes in the v5.3 genome have multiple splice variants; for mutants falling in those genes, “MULTIPLE_SPLICE_VARIANTS” is given instead of a feature, since the feature can be different for each variant. If the mutant falls in a region where two genes overlap, the features of to the two genes are separated by “&”. 7) N_total_reads – how many deep-sequencing reads correspond to this insertion (this includes perfect reads, reads with 1bp mismatches, and for “both” strand cases it includes reads from both sides). 8) top_read_sequence – the most abundant unique sequence out of all the reads corresponding to this insertion. Some of the files have additional 5 tab-separated fields after those, dealing with gene annotation: 9) gene_synonyms – other IDs previously used for that gene on Phytozome, based on Phytozome Creinhardtii_236_synonym.txt file. 10-11) transcript_names, defline – further Phytozome information about the gene (often absent, marked by “-“); defline is based on Phytozome Creinhardtii_236_defline.txt file, and transcript_names are derived from Creinhardtii_236_gene.gff3 12–14) best_arabidopsis_TAIR10_hit_name, best_arabidopsis_TAIR10_hit_symbol, and best_arabidopsis_TAIR10_hit_defline – the name, symbol and description of the best Arabidopsis homolog, or “-“ if one wasn’t found, taken directly from the Phytozome Creinhardtii_236_annotation_info.txt file. Each table contains the insertions for a different dataset: Supplemental table 9, 10: Raw data for the two technical replicates of the data used for Figures 3 and 4; includes insertions mapped to the nuclear and organellar genomes and cassette-mapped insertions; no low-abundance cutoffs were applied; no adjacent insertion merging was done. Supplemental table 11: Final filtered data used for Figures 3 and 4: after removing insertions mapped to the organellar genomes, applying the low-abundance cutoffs, adding the two technical replicates together, and merging some adjacent insertions. Also includes insertions mapped to the cassette; the low abundance cutoffs were applied to those, but adjacent insertion merging was not. Supplemental table 12, 13: Raw data for 5’ and 3’ flanking regions, respectively, from the data used for Figure 5 (cassette-mapped insertions only). Before low-abundance cutoffs, and before adding the two replicates together. Understanding the cassette-mapped insertion positions: In the insertion mapping process, the cassette is treated as just another chromosome, and thus the data is presented as if the insertion cassette was inserted into another cassette. However, in reality the most likely explanation is multiple cassettes or cassette fragments being ligated together, and an upstream or downstream cassette fragment can be ligated to either a 5’ or a 3’ end to another cassette (Figure 5A). In order to convert the +/– strand positions to the upstream/downstream categories used in the analysis in Figure 5, use the following rules: an upstream cassette fragment will map to the + strand when it’s read as a 5’ flanking region, and to the – strand when it’s read as a 3’ flanking region; a downstream cassette fragment will map in the opposite orientation (to the – strand when it’s read as a 5’ flanking region, and to the + strand when it’s read as a 3’ flanking region). Also note the three cassette-end special cases: two intact cassette 5’ ends ligated together will yield 5’ flanking regions mapping to position 0 strand – (same from both cassettes); two intact cassette 3’ ends ligated together will yield 3’ flanking regions mapping to position 2,660 (last base of the cassette) strand – (same from both cassettes); an intact cassette 5’ and 3’ end ligated together will yield 5’ flanking regions mapping to position 2,660 strand + (from the first cassette) and 3’ flanking regions mapping to position 0 strand + (from the second cassette). A note on the format: all files have been reformatted for easier reading, and thus do not conform to the raw format generated by mutant_count_alignments.py and parsed by mutant_analysis_classes.read_mutant_file. Please contact us if you would like another version of the data.