Electronic Supplementary Material (ESM) Kawahara and Breinholt “Phylogenomics provides strong evidence for relationships of butterflies and moths” Additional information - sample collection, RNA extraction, and library construction We generated 33 new transcriptomes from purified RNA extracted from tissues that were flash-frozen or stored in RNAlater Stabilization Reagent (catalog no. 76106, Qiagen). Specimens were collected at seven localities on three continents (electronic supplementary material, table S1). For large specimens, flight muscle tissue was removed from the thorax of the insect and placed in a 1.5 mL tube containing RNAlater. For small moths, the entire specimen was placed in a 1.5 mL tube containing RNAlater. Tissues were macerated using an autoclaved plastic pestle (catalog no. 1450-5390, USA Scientific) and the remaining body regions were placed in a tube containing 100% EtOH, to serve as an additional molecular voucher. RNAlater-preserved tissues are stored in freezers at the McGuire Center for Lepidoptera and Biodiversity at the Florida Museum of Natural History (FLMNH) and vouchers are catalogued in a reference database at the FLMNH. The SV Total RNA Isolation System (catalog no. Z3100; Promega) and the NucleoSpin RNA II Total RNA Isolation kit (catalog no. 740955; Clontec) were used to generate RNA extracts. RNA sample quality was assessed with an Agilent 2100 Bioanalyzer and Qubit 2.0 Fluorometer. Multiplexed libraries with twelve different barcode adaptor combinations were constructed with the TruSeq RNA Sample Prep Kit v2 (catalog no. RS-122-2001, Illumina) in the Kawahara Lab at the FLMNH. HiSeq 2000 runs were performed as 100 bp SE, 100 bp PE, or 150 PE reads with up to 12 samples per lane at the DNA Core facilities of the University of 1 Missouri (Columbia, MO) and Florida State University (Tallahassee, FL) (for sequencing details of each transcriptome, see electronic supplementary material, table S1). Using 6,568 putative single-copy orthologous genes identified with OrthoDB from the reference genomes, we constructed a custom ortholog set (here called LEP1-COS) for use in HaMStR v8 [29] to extend the ortholog search to non-reference taxa. In HaMStR we used the ‘– representative’ option to predict a single sequence for each LEP1-COS locus from the transcriptomic, genomic, or EST data of each taxon. This strategy picks the best hit to the reference protein and further concatenates non-overlapping hits to increase the predicted sequence length. We also tested whether HaMStR could identify all 6,568 loci in LEP1-COS from each reference taxon. LEP1-COS genes that HaMStR could not identify in all four of the reference taxa were not included in phylogenomic analyses. Genes that were represented in less than 41 of 46 taxa (< 90% of taxa) were also excluded from the data matrix. We built the data matrix from 33 transcriptomes and published transcriptomes that were assembled and processed from FASTQ files downloaded from the GenBank SRA database. These taxa and their GenBank SRA codes are: Actias luna (SRR1002974) [21], Cnaphalocrocis medinalis (SRR647910 - SRR647915) [66], Enyo lugubris (SRR1002983) [21], Hemaris diffinis (SRR1002987) [21], Grapholita dimorpha (SRR803483) [66], Papilio glaucus (SRR850324), and P. polytes SRR850327 [67]. We also incorporated data for Bicyclus anynana from the GenBank EST database [68]. In order to exclude problematic signal that could be associated with saturation and rate heterogeneity, nucleotides were degenerated using the degen v1.4 Perl script [69, 70] and third codon positions removed. 2 The 33 newly assembled transcriptomes and the custom Lepidoptera core ortholog set (LEP1-COS) are available from the Dryad Digital Repository (http://datadryad.org; accession doi:10.5061/dryad.qd27g). Raw Illumina reads used for transcriptome assembly have been submitted to GenBank (BioProject PRJNA248471) and the GenBank Sequence Read Archive (SRA) database. SRA accession numbers are: SRR1298384, SRR1299208-SRR1299214, SRR1299217, SRR1299267, SRR1299274, SRR1299296, SRR1299306, SRR1299316SRR1299318, SRR1299347, SRR1299369, SRR1299394, SRR1299418, SRR1299435, SRR1299495, SRR1299746, SRR1299750-SRR1299752, SRR1299755, SRR1299769, SRR1299773, SRR1299782, SRR1300145, SRR1300148, SRR1300991. Additional information - phylogenomic analysis We estimated phylogenies using nucleotide and amino acid data. Maximum likelihood (ML) analyses were performed using RAxML v7.7.7 [36] for nucleotides and ExaML (https://github.com/stamatak/ExaML) for amino acids. For the nucleotide dataset, bootstrap support values [BP] were calculated by creating bootstrap datasets in RAxML and estimating trees from each dataset using the ‘-f d’ option. We created bootstrap datasets in RAxML and ran ExaML bootstrap analyses. The bootstrap stopping criterion [39] was used to determine a sufficient number of bootstrap replicates for each analysis. Bootstrap runs were terminated at an interval of 50 bootstraps after the stopping criterion had been met. The reduced, 465 gene dataset was aligned as described for the larger dataset and the ambiguously aligned regions were determined using ALISCORE v2.0 [33, 34] and removed with ALICUT v2.2 [35]. Codon positions were identified, synonymous changes were degenerated, and third codon positions were excluded using the same methods described above for the 2,696 3 gene dataset. The reduced dataset was filtered to ensure that every taxon was represented by at least 100 bp of sequence data for each gene; this led to a total of 465 loci with a matrix that had 100% gene coverage. Partitions and models were estimated using the BIC criterion as implemented in PartitionFinder. We used the relaxed clustering algorithm for the top 1% of possible schemes. For the nucleotide and amino acid datasets, a total of 350 and 500 bootstrap replicates were conducted, respectively. 4