Electronic Supplementary Material (ESM)

advertisement
Electronic Supplementary Material (ESM)
Kawahara and Breinholt
“Phylogenomics provides strong evidence for relationships of butterflies and moths”
Additional information - sample collection, RNA extraction, and library construction
We generated 33 new transcriptomes from purified RNA extracted from tissues that were
flash-frozen or stored in RNAlater Stabilization Reagent (catalog no. 76106, Qiagen). Specimens
were collected at seven localities on three continents (electronic supplementary material, table
S1). For large specimens, flight muscle tissue was removed from the thorax of the insect and
placed in a 1.5 mL tube containing RNAlater. For small moths, the entire specimen was placed
in a 1.5 mL tube containing RNAlater. Tissues were macerated using an autoclaved plastic pestle
(catalog no. 1450-5390, USA Scientific) and the remaining body regions were placed in a tube
containing 100% EtOH, to serve as an additional molecular voucher. RNAlater-preserved tissues
are stored in freezers at the McGuire Center for Lepidoptera and Biodiversity at the Florida
Museum of Natural History (FLMNH) and vouchers are catalogued in a reference database at the
FLMNH. The SV Total RNA Isolation System (catalog no. Z3100; Promega) and the
NucleoSpin RNA II Total RNA Isolation kit (catalog no. 740955; Clontec) were used to generate
RNA extracts. RNA sample quality was assessed with an Agilent 2100 Bioanalyzer and Qubit
2.0 Fluorometer. Multiplexed libraries with twelve different barcode adaptor combinations were
constructed with the TruSeq RNA Sample Prep Kit v2 (catalog no. RS-122-2001, Illumina) in
the Kawahara Lab at the FLMNH. HiSeq 2000 runs were performed as 100 bp SE, 100 bp PE, or
150 PE reads with up to 12 samples per lane at the DNA Core facilities of the University of
1
Missouri (Columbia, MO) and Florida State University (Tallahassee, FL) (for sequencing details
of each transcriptome, see electronic supplementary material, table S1).
Using 6,568 putative single-copy orthologous genes identified with OrthoDB from the
reference genomes, we constructed a custom ortholog set (here called LEP1-COS) for use in
HaMStR v8 [29] to extend the ortholog search to non-reference taxa. In HaMStR we used the ‘–
representative’ option to predict a single sequence for each LEP1-COS locus from the
transcriptomic, genomic, or EST data of each taxon. This strategy picks the best hit to the
reference protein and further concatenates non-overlapping hits to increase the predicted
sequence length.
We also tested whether HaMStR could identify all 6,568 loci in LEP1-COS from each
reference taxon. LEP1-COS genes that HaMStR could not identify in all four of the reference
taxa were not included in phylogenomic analyses. Genes that were represented in less than 41 of
46 taxa (< 90% of taxa) were also excluded from the data matrix. We built the data matrix from
33 transcriptomes and published transcriptomes that were assembled and processed from FASTQ
files downloaded from the GenBank SRA database. These taxa and their GenBank SRA codes
are: Actias luna (SRR1002974) [21], Cnaphalocrocis medinalis (SRR647910 - SRR647915) [66],
Enyo lugubris (SRR1002983) [21], Hemaris diffinis (SRR1002987) [21], Grapholita dimorpha
(SRR803483) [66], Papilio glaucus (SRR850324), and P. polytes SRR850327 [67]. We also
incorporated data for Bicyclus anynana from the GenBank EST database [68]. In order to
exclude problematic signal that could be associated with saturation and rate heterogeneity,
nucleotides were degenerated using the degen v1.4 Perl script [69, 70] and third codon positions
removed.
2
The 33 newly assembled transcriptomes and the custom Lepidoptera core ortholog set
(LEP1-COS) are available from the Dryad Digital Repository (http://datadryad.org; accession
doi:10.5061/dryad.qd27g). Raw Illumina reads used for transcriptome assembly have been
submitted to GenBank (BioProject PRJNA248471) and the GenBank Sequence Read Archive
(SRA) database. SRA accession numbers are: SRR1298384, SRR1299208-SRR1299214,
SRR1299217, SRR1299267, SRR1299274, SRR1299296, SRR1299306, SRR1299316SRR1299318, SRR1299347, SRR1299369, SRR1299394, SRR1299418, SRR1299435,
SRR1299495, SRR1299746, SRR1299750-SRR1299752, SRR1299755, SRR1299769,
SRR1299773, SRR1299782, SRR1300145, SRR1300148, SRR1300991.
Additional information - phylogenomic analysis
We estimated phylogenies using nucleotide and amino acid data. Maximum likelihood
(ML) analyses were performed using RAxML v7.7.7 [36] for nucleotides and ExaML
(https://github.com/stamatak/ExaML) for amino acids. For the nucleotide dataset, bootstrap
support values [BP] were calculated by creating bootstrap datasets in RAxML and estimating
trees from each dataset using the ‘-f d’ option. We created bootstrap datasets in RAxML and ran
ExaML bootstrap analyses. The bootstrap stopping criterion [39] was used to determine a
sufficient number of bootstrap replicates for each analysis. Bootstrap runs were terminated at an
interval of 50 bootstraps after the stopping criterion had been met.
The reduced, 465 gene dataset was aligned as described for the larger dataset and the
ambiguously aligned regions were determined using ALISCORE v2.0 [33, 34] and removed with
ALICUT v2.2 [35]. Codon positions were identified, synonymous changes were degenerated,
and third codon positions were excluded using the same methods described above for the 2,696
3
gene dataset. The reduced dataset was filtered to ensure that every taxon was represented by at
least 100 bp of sequence data for each gene; this led to a total of 465 loci with a matrix that had
100% gene coverage. Partitions and models were estimated using the BIC criterion as
implemented in PartitionFinder. We used the relaxed clustering algorithm for the top 1% of
possible schemes. For the nucleotide and amino acid datasets, a total of 350 and 500 bootstrap
replicates were conducted, respectively.
4
Download