Supplementary text A high-density linkage map enables a second-generation collared flycatcher genome assembly and reveals the patterns of avian recombination rate variation and chromosomal evolution by Takeshi Kawakami, Linnea Smeds, Niclas Backström, Arild Husby, Anna Qvarnström, Carina F. Mugal, Pall Olason and Hans Ellegren Linkage analysis All steps in the process of linkage analysis described below refer to the use of individual parts of the CRI-MAP package. The pedigree was split into ten sub-families comprising 14 to 104 individuals using the CRI-GEN option. The improved version of CRI-MAP can efficiently calculate recombination fractions and log likelihood of odds (LOD) scores between large numbers of markers. Despite the improvement, however, the number of markers in our dataset exceeded our available computational capacity to calculate pair-wise LOD scores between all possible marker pairs. Therefore, in order to reduce the number of pair-wise comparisons of markers, we subdivided and sequentially analysed the markers based on their genomic location in the FicAlb_1.4 assembly version of the flycatcher genome (Ellegren et al. 2012) and linkage information in previous low-density flycatcher linkage maps (Backström et al. 2008; Backström et al. 2010). Z-linked markers were pre-selected by identifying SNPs that were heterozygous in males only (882 markers), and the following linkage analysis was thus only based on male recombination in this set of markers. Step 1: Building a pre-framework map 1 The FicAlb_1.4 assembly version was composed of 298 scaffolds, 45% of which (134 scaffolds) were anchored by at least one marker on the low-density linkage map. We initially built a pre-framework map using only markers from scaffolds with some previous linkage information. For these markers we calculated pair-wise LOD scores using the TWOPOINT option. Markers were assigned to linkage groups based on pair-wise linkages and the grouping algorithm implemented in the AUTOGROUP option. This procedure efficiently removed uninformative markers, as well as putatively wrongly assigned markers, while at the same time classifying markers into linkage groups via iterative marker and linkage evaluation with multiple parameter layers with different stringency criteria. Each layer specifies four parameters: minimum LOD score, minimum threshold for informative meiosis, maximum number of shared linkages and minimum threshold for the linkage ratio. The parameter layer settings were as follows: first layer (40, 2.0, 2, 0.9), second layer (20, 1.5, 3, 0.7), third layer (10, 1.0, 5, 0.6), and fourth layer (5, 0.4, 6, 0.5). After selecting all informative markers on each chromosome, pre-framework linkage groups were constructed using the BUILD option with a threshold of LOD > 5. This was repeated three times with different starting marker pairs, and the map with the largest number of markers was chosen as the pre-framework map for each chromosome. Four chromosomes (chromosomes 18, 22, 25, and linkage group LGE22) did not have any anchored scaffolds and they were assigned to each chromosome based on the alignments with chicken and zebra finch genomes. Step 2: Building a framework map After constructing the pre-framework map, there were 36,531 markers remaining in the unmapped pool. In order to assign these unmapped markers to linkage groups, the TWOPOINT and the AUTOGROUP options were used. To maximize computational efficiency, the unmapped markers were subdivided into 32 groups based on their putative 2 chromosomal location in FicAlb_1.4 (27-2,370 markers per group), and each of these groups of unmapped markers were analysed together with the markers on the corresponding preframework linkage group. In total, 33,586 markers could be assigned to the initial 31 preframework linkage groups, and 41 markers were assigned to three novel linkage groups during this step. By using the same procedure as for the pre-framework map (TWOPOINT and BUILD), framework maps were constructed for each linkage group by adding markers onto the pre-framework maps when positions of newly added markers were supported with LOD >3.0. The FLIPS5 option was used to shuffle markers to confirm that there was no other alternative marker order with higher likelihood. Step 3: Building a best-order map Best-order maps were constructed by adding markers to the framework map by progressively relaxing the LOD threshold to 2, 1 and 0.1, again using the BUILD option. The order of markers in the best-order maps was checked by recurrent runs of the FLIPS5 option. The CHROMPIC option was used to identify and exclude unlikely tight double recombination events, which probably were the result of genotyping and/or ordering errors. Sex-specific and sex-averaged genetic distances were calculated for each linkage group using the Kosambi correction function (Kosambi 1944), and genetic maps were visualised by MapChart (Voorrips 2002). Identification of chimeric scaffolds Discrepancies in the form of scaffolds including markers from more than one linkage group were investigated manually in IGV (Thorvaldsdóttir et al. 2012). If a region contained a high number of reads with mates (from paired-end or mate-pair reads) located on a different scaffold, we assumed the scaffold was mis-assembled into a chimera and thus split it. 3 Chimeras can be formed in several ways during the assembly process. One possibility is that a junction is caused by short repeats already in the de Bruijn graph, connecting two reads that share the same k-mer but that correspond to sequences from different parts of the genome. Such incorrectly assembled contigs can be short, with relatively limited amount of sequence surrounding the junction, although sufficient for connecting it to other contigs in the scaffolding process. As a consequence, the resulting chimeric scaffold may very well be quite large. Another possibility is that a chimera with a gap in the junction is formed in the scaffolding step due to reads mapping to repetitive regions in the ends of the two contigs. The extreme coverage observed at chimeras with gaps suggests that the ends may represent collapsed repeats, with reads corresponding to different locations in the genome and forming a false link between the contigs. Scaffolds that were chimeric with a gap in the middle were split around it so that the gap region was removed from the assembly. Chimeric scaffolds with a small repetitive region was split with overlap so that the middle part (judged by paired-end reads on either side) was assigned to both sides. We found a total of 45 chimeric scaffolds and these were split and replaced on basis of the linkage map. In total, 50.6 Mb of DNA sequence was in this way relocated in the genome. After splitting chimeras, all scaffolds were renamed according to size starting with N00001 for the largest scaffold. Gap size between super-scaffolds Gaps between super-scaffolds are likely to be enriched for gap sizes longer than mate-pair library insert sizes. We attempted to approximate gap size between super-scaffolds (or singleton scaffolds) by combining information on the local recombination rate, the genetic distance between the most distal markers on the respective scaffold flanking the gap, and the 4 location of those markers within the scaffolds. Most estimates were close to 0, sometimes negative and generally less than 1 Mb. However, with the resolution provided by the recombination map we could not obtain confident estimates on the kb-scale. However, the more important conclusion from this analysis was that gaps are typically short, suggesting that the assembly covered the vast majority of each chromosome. However, we cannot formally judge how much sequence might remain unassembled outside the ultimate scaffolds on the respective end of each chromosome. Interpretation of inversion events MGR can infer inversions, translocations, fusion and fissions, although potential translocations within chromosomes will be interpreted as series of inversions. Our estimation of the abundance of inversion events thus rely on this assumption, as has been the case in earlier studies in mammals (Bourque & Tesler 2008). However, to check for possible cases of transpositions we separately aligned flycatcher chromosomes to zebra finch and chicken genomes with LASTZ , as described above, filtered the anchors and formed synteny blocks. We only identified three syntenic blocks of size 96, 106 and 127 kb, respectively, which were not located in the expected (syntenic) flycatcher chromosome given their chromosomal location in chicken and/or zebra finch. Each of these cases is described below, however, given the rarity of potential transposition events and also given the ambiguous nature of at least some of these potential events, we do not consider transpositions in the main text. 1. Scaffold N00370 (198 kb) was anchored to chromosome 5 by multiple markers. However, a 96 kb fragment from this scaffold aligned to chromosome 1/1A in the two other species. Two of the markers anchoring N00370 to chromosome 5 reside within the 96 kb fragment, 5 making it unlikely that the discrepancy would be due to the scaffold being chimeric. This may thus potentially represent a case of transposition. 2. Scaffold N00269 (569 kb) from (another part of) chromosome 5 has a 106 kb fragment that aligned to chromosome 4/4A in chicken and zebra finch. It is only linked to flycatcher chromosome 5 via mate-pairs to another scaffold and contains no markers that confirm anchoring by genetic linkage. We cannot exclude that this scaffold is chimeric, which thus represents an alternative explanation to transposition. 3. Scaffold N00033 (8,574) kb has multiple markers that anchored it to chromosome 10 and it also aligned in full to sequences on chromosome 10 in chicken. However, a 127 kb fragment from N00033 is syntenic to sequences on zebra finch chromosome Z. This region of N00033 lacks genetic markers but the observation of the expected conserved synteny with chicken chromosome 10 argues against a chimera. This would suggest that it either represents a translocation in the zebra finch linage or a mis-assembly in the zebra finch genome. References Backström N, Karaiskou N, Leder EH, et al. (2008) A gene-based genetic linkage map of the collared flycatcher (Ficedula albicollis) reveals extensive synteny and gene-order conservation during 100 million years of avian evolution. Genetics 179, 1479-1495. Backström N, Palkopoulou E, Qvarnstrom A, Ellegren H (2010) No evidence for Zchromosome rearrangements between the pied flycatcher and the collared flycatcher as judged by gene-based comparative genetic maps. Molecular Ecology 19, 33943405. Bourque G, Tesler G (2008) Computational tools for the analysis of rearrangements in mammalian genomes. In: Bioinformatics (ed. Keith J), pp. 431-455. Humana Press. Ellegren H, Sméds L, Burri R, et al. (2012) The genomic landscape of species divergence in Ficedula flycatchers. Nature 491, 756-760. Kosambi DD (1944) The estimation of map distances from recombination values. Annals of Eugenics 12, 172-175. Thorvaldsdóttir H, Robinson JT, Mesirov JP (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics. 6 Voorrips RE (2002) MapChart: Software for the graphical presentation of linkage maps and QTLs. The Journal of Heredity 93, 77-78. 7