mec12810-sup-0003-AppendixS1

advertisement
Supplementary text
A high-density linkage map enables a second-generation collared flycatcher genome
assembly and reveals the patterns of avian recombination rate variation and
chromosomal evolution
by Takeshi Kawakami, Linnea Smeds, Niclas Backström, Arild Husby, Anna Qvarnström,
Carina F. Mugal, Pall Olason and Hans Ellegren
Linkage analysis
All steps in the process of linkage analysis described below refer to the use of individual parts
of the CRI-MAP package. The pedigree was split into ten sub-families comprising 14 to 104
individuals using the CRI-GEN option. The improved version of CRI-MAP can efficiently
calculate recombination fractions and log likelihood of odds (LOD) scores between large
numbers of markers. Despite the improvement, however, the number of markers in our dataset
exceeded our available computational capacity to calculate pair-wise LOD scores between all
possible marker pairs. Therefore, in order to reduce the number of pair-wise comparisons of
markers, we subdivided and sequentially analysed the markers based on their genomic
location in the FicAlb_1.4 assembly version of the flycatcher genome (Ellegren et al. 2012)
and linkage information in previous low-density flycatcher linkage maps (Backström et al.
2008; Backström et al. 2010). Z-linked markers were pre-selected by identifying SNPs that
were heterozygous in males only (882 markers), and the following linkage analysis was thus
only based on male recombination in this set of markers.
Step 1: Building a pre-framework map
1
The FicAlb_1.4 assembly version was composed of 298 scaffolds, 45% of which (134
scaffolds) were anchored by at least one marker on the low-density linkage map. We initially
built a pre-framework map using only markers from scaffolds with some previous linkage
information. For these markers we calculated pair-wise LOD scores using the TWOPOINT
option. Markers were assigned to linkage groups based on pair-wise linkages and the
grouping algorithm implemented in the AUTOGROUP option. This procedure efficiently
removed uninformative markers, as well as putatively wrongly assigned markers, while at the
same time classifying markers into linkage groups via iterative marker and linkage evaluation
with multiple parameter layers with different stringency criteria. Each layer specifies four
parameters: minimum LOD score, minimum threshold for informative meiosis, maximum
number of shared linkages and minimum threshold for the linkage ratio. The parameter layer
settings were as follows: first layer (40, 2.0, 2, 0.9), second layer (20, 1.5, 3, 0.7), third layer
(10, 1.0, 5, 0.6), and fourth layer (5, 0.4, 6, 0.5). After selecting all informative markers on
each chromosome, pre-framework linkage groups were constructed using the BUILD option
with a threshold of LOD > 5. This was repeated three times with different starting marker
pairs, and the map with the largest number of markers was chosen as the pre-framework map
for each chromosome. Four chromosomes (chromosomes 18, 22, 25, and linkage group
LGE22) did not have any anchored scaffolds and they were assigned to each chromosome
based on the alignments with chicken and zebra finch genomes.
Step 2: Building a framework map
After constructing the pre-framework map, there were 36,531 markers remaining in the
unmapped pool. In order to assign these unmapped markers to linkage groups, the
TWOPOINT and the AUTOGROUP options were used. To maximize computational
efficiency, the unmapped markers were subdivided into 32 groups based on their putative
2
chromosomal location in FicAlb_1.4 (27-2,370 markers per group), and each of these groups
of unmapped markers were analysed together with the markers on the corresponding preframework linkage group. In total, 33,586 markers could be assigned to the initial 31 preframework linkage groups, and 41 markers were assigned to three novel linkage groups
during this step. By using the same procedure as for the pre-framework map (TWOPOINT
and BUILD), framework maps were constructed for each linkage group by adding markers
onto the pre-framework maps when positions of newly added markers were supported with
LOD >3.0. The FLIPS5 option was used to shuffle markers to confirm that there was no other
alternative marker order with higher likelihood.
Step 3: Building a best-order map
Best-order maps were constructed by adding markers to the framework map by progressively
relaxing the LOD threshold to 2, 1 and 0.1, again using the BUILD option. The order of
markers in the best-order maps was checked by recurrent runs of the FLIPS5 option. The
CHROMPIC option was used to identify and exclude unlikely tight double recombination
events, which probably were the result of genotyping and/or ordering errors. Sex-specific and
sex-averaged genetic distances were calculated for each linkage group using the Kosambi
correction function (Kosambi 1944), and genetic maps were visualised by MapChart
(Voorrips 2002).
Identification of chimeric scaffolds
Discrepancies in the form of scaffolds including markers from more than one linkage group
were investigated manually in IGV (Thorvaldsdóttir et al. 2012). If a region contained a high
number of reads with mates (from paired-end or mate-pair reads) located on a different
scaffold, we assumed the scaffold was mis-assembled into a chimera and thus split it.
3
Chimeras can be formed in several ways during the assembly process. One possibility is that a
junction is caused by short repeats already in the de Bruijn graph, connecting two reads that
share the same k-mer but that correspond to sequences from different parts of the genome.
Such incorrectly assembled contigs can be short, with relatively limited amount of sequence
surrounding the junction, although sufficient for connecting it to other contigs in the
scaffolding process. As a consequence, the resulting chimeric scaffold may very well be quite
large. Another possibility is that a chimera with a gap in the junction is formed in the
scaffolding step due to reads mapping to repetitive regions in the ends of the two contigs. The
extreme coverage observed at chimeras with gaps suggests that the ends may represent
collapsed repeats, with reads corresponding to different locations in the genome and forming
a false link between the contigs.
Scaffolds that were chimeric with a gap in the middle were split around it so that the gap
region was removed from the assembly. Chimeric scaffolds with a small repetitive region was
split with overlap so that the middle part (judged by paired-end reads on either side) was
assigned to both sides. We found a total of 45 chimeric scaffolds and these were split and
replaced on basis of the linkage map. In total, 50.6 Mb of DNA sequence was in this way
relocated in the genome. After splitting chimeras, all scaffolds were renamed according to
size starting with N00001 for the largest scaffold.
Gap size between super-scaffolds
Gaps between super-scaffolds are likely to be enriched for gap sizes longer than mate-pair
library insert sizes. We attempted to approximate gap size between super-scaffolds (or
singleton scaffolds) by combining information on the local recombination rate, the genetic
distance between the most distal markers on the respective scaffold flanking the gap, and the
4
location of those markers within the scaffolds. Most estimates were close to 0, sometimes
negative and generally less than 1 Mb. However, with the resolution provided by the
recombination map we could not obtain confident estimates on the kb-scale. However, the
more important conclusion from this analysis was that gaps are typically short, suggesting that
the assembly covered the vast majority of each chromosome. However, we cannot formally
judge how much sequence might remain unassembled outside the ultimate scaffolds on the
respective end of each chromosome.
Interpretation of inversion events
MGR can infer inversions, translocations, fusion and fissions, although potential
translocations within chromosomes will be interpreted as series of inversions. Our estimation
of the abundance of inversion events thus rely on this assumption, as has been the case in
earlier studies in mammals (Bourque & Tesler 2008). However, to check for possible cases of
transpositions we separately aligned flycatcher chromosomes to zebra finch and chicken
genomes with LASTZ , as described above, filtered the anchors and formed synteny blocks.
We only identified three syntenic blocks of size 96, 106 and 127 kb, respectively, which were
not located in the expected (syntenic) flycatcher chromosome given their chromosomal
location in chicken and/or zebra finch. Each of these cases is described below, however, given
the rarity of potential transposition events and also given the ambiguous nature of at least
some of these potential events, we do not consider transpositions in the main text.
1. Scaffold N00370 (198 kb) was anchored to chromosome 5 by multiple markers. However,
a 96 kb fragment from this scaffold aligned to chromosome 1/1A in the two other species.
Two of the markers anchoring N00370 to chromosome 5 reside within the 96 kb fragment,
5
making it unlikely that the discrepancy would be due to the scaffold being chimeric. This may
thus potentially represent a case of transposition.
2. Scaffold N00269 (569 kb) from (another part of) chromosome 5 has a 106 kb fragment that
aligned to chromosome 4/4A in chicken and zebra finch. It is only linked to flycatcher
chromosome 5 via mate-pairs to another scaffold and contains no markers that confirm
anchoring by genetic linkage. We cannot exclude that this scaffold is chimeric, which thus
represents an alternative explanation to transposition.
3. Scaffold N00033 (8,574) kb has multiple markers that anchored it to chromosome 10 and it
also aligned in full to sequences on chromosome 10 in chicken. However, a 127 kb fragment
from N00033 is syntenic to sequences on zebra finch chromosome Z. This region of N00033
lacks genetic markers but the observation of the expected conserved synteny with chicken
chromosome 10 argues against a chimera. This would suggest that it either represents a
translocation in the zebra finch linage or a mis-assembly in the zebra finch genome.
References
Backström N, Karaiskou N, Leder EH, et al. (2008) A gene-based genetic linkage map of the
collared flycatcher (Ficedula albicollis) reveals extensive synteny and gene-order
conservation during 100 million years of avian evolution. Genetics 179, 1479-1495.
Backström N, Palkopoulou E, Qvarnstrom A, Ellegren H (2010) No evidence for Zchromosome rearrangements between the pied flycatcher and the collared flycatcher
as judged by gene-based comparative genetic maps. Molecular Ecology 19, 33943405.
Bourque G, Tesler G (2008) Computational tools for the analysis of rearrangements in
mammalian genomes. In: Bioinformatics (ed. Keith J), pp. 431-455. Humana Press.
Ellegren H, Sméds L, Burri R, et al. (2012) The genomic landscape of species divergence in
Ficedula flycatchers. Nature 491, 756-760.
Kosambi DD (1944) The estimation of map distances from recombination values. Annals of
Eugenics 12, 172-175.
Thorvaldsdóttir H, Robinson JT, Mesirov JP (2012) Integrative Genomics Viewer (IGV):
high-performance genomics data visualization and exploration. Briefings in
Bioinformatics.
6
Voorrips RE (2002) MapChart: Software for the graphical presentation of linkage maps and
QTLs. The Journal of Heredity 93, 77-78.
7
Download