1 2 Focal species descriptions Among the Asteraceae represented in our dataset are some of the most 3 noxious weeds and invaders in the temperate world, including the six focal invasive 4 taxa of our study. The species in the genus Centaurea, including C. solstitialis, C. 5 stoebe ssp. micranthos, and C. diffusa, are native to Europe and Asia, and have 6 successfully invaded North American rangelands, making Centaurea the most 7 abundant noxious weed genus in western North America (Lejeune & Seastedt 2001). 8 In fact, Centaurea is one of only 15 plant genera significantly more likely to contain 9 invasive species in North America than expected by chance (Kuester et al. 2014). 10 Cirsium arvense Scop is native to temperate Eurasia, where it is one of the foremost 11 noxious agricultural weeds (Schroeder et al. 1993). It has spread to four other 12 continents and has become one of the most prevalent weeds in North America 13 (Moore 1975). Species in the genus Ambrosia (ragweeds), including A. artemisiifolia 14 and A. trifida, are native to North America and have become abundant colonizers of 15 disturbed habitats across temperate North America, Europe and Asia (Bassett & 16 Crompton 1975, 1982). Ragweeds are agricultural weeds and their allergenic pollen 17 is one of the primary causes of hay fever (Laaidi et al. 2003). Lastly, thirteen 18 members of the genus Helianthus (sunflowers), and in particular H. annuus are 19 globally introduced or invasive (Chamberlain and Szocs, 2013; EOL, 2014). The high 20 levels of gene flow between members of this genus make it a primary concern for 21 transgene escape and the evolution of a “super weed” (discussed in Lai et al. 2012). 22 23 Paralog removal 24 To remove potential paralogs from our alignments, we used a tree-based 25 approach. Briefly, we constructed gene trees for each orthogroup using RAxML 26 version 8.0.6 (Stamatakis 2006). Jmodeltest 2.1.4 was run for each gene to guide the 27 selection of the nucleotide substitution model (Darriba et al. 2012). To improve the 28 resulting gene tree, we implemented TreeFix v1.1.8 (Wu et al. 2013), which uses the 29 species tree topology to guide the reconstruction of the gene tree. Given a gene tree, 30 TreeFix finds a “statistically equivalent” tree that minimizes a species tree-based 31 cost function. Following this we used the program NOTAUG v2.6 (Durand et al. 32 2006) and compared the reconciled gene tree with the species tree (see species tree 33 reconstruction below) to identify likely paralogs. We removed sequences that were 34 identified as paralogs using this method from the alignments. We included at most 35 two sequences of the same species if they were in the alignments. Although this 36 method may also eliminate genes crossing species boundaries, we preferred this 37 conservative approach to avoid excessive numbers of false positives resulting from 38 misidentification of orthology. 39 40 41 Pairwise comparisons of native and introduced transcriptomes Specifically, we ran CODEML for the protein coding regions in runmode -2, 42 with F3X4 codon frequency. Prior to the analysis we eliminated all alignments with 43 average percent identity below 50%, as these likely represented misalignments 44 (this procedure was conducted for all alignments, such as the branch-site and site- 45 specific analyses). We also removed columns with missing data or gaps and only 46 retained sequences with at least 150 nucleotides for the coding regions. As low 47 divergence leads to uncertain dN/dS ratio estimates, cases where dS was below 0.01 48 were excluded. We also discarded orthogroups showing dS or dN > 2, indicating 49 saturation of substitutions. Orthogroups with dN/dS > 1 were considered 50 candidates for positive selection. 51 52 PAML branch and branch-site models 53 As gaps and other alignment errors have been known to generate false 54 positives in dN/dS analyses (Fletcher & Yang 2010; Markova-Raina & Petrov 2011), 55 each orthogroup and/or site inferred to be under positive selection was visually 56 inspected in AliView 1.07 (Larsson 2014), and any cases of potential misalignment 57 (e.g. errors associated with indels) were removed from the analysis. Furthermore, 58 because nucleotide changes inferred to be under positive selection did not occur in 59 all taxa designated as foreground, we performed an additional set of PAML analyses 60 on the remaining orthogroups setting as foreground only taxa in which changes of 61 interest were present. 62 63 Gene Ontology analysis 64 We assigned GO terms to each orthogroup based on the GO A. thaliana 65 mappings to the top hits and removed redundant GO terms. To identify which 66 biological processes rapidly evolving genes were associated with, we performed a 67 GO enrichment analysis using topGO (Alexa et al. 2006). All orthogroups that were 68 not significant in tests of positive selection were used as background. Significance 69 for each individual GO-identifier was computed with Fisher's exact test. As GO terms 70 are non-independent, we used the parent-child method that determines 71 overrepresentation of terms in the context of annotations to the term's parents 72 (Grossmann et al. 2007). This approach reduces the dependencies between the 73 individual terms, and avoids producing false-positives. 74 75 References 76 Alexa A, Rahnenführer J, Lengauer T (2006) Improved scoring of functional groups 77 from gene expression data by decorrelating GO graph structure. Bioinformatics, 78 22, 1600–1607. 79 Bassett IJ, Crompton CW (1975) The biology of Canadian weeds. 11. Ambrosia 80 artemisiifolia L. and A. psilostachya D.C. Canadian Journal of Plant Science, 55, 81 463–476. 82 Durand D, Halldórsson B, Vernot B (2006) A hybrid micro–macroevolutionary 83 approach to gene tree reconstruction. Journal of Computational Biology, 13, 84 320–335. 85 Fletcher W, Yang Z (2010) The effect of insertions, deletions, and alignment errors 86 on the branch-site test of positive selection. Molecular Biology and Evolution, 87 27, 2257–2267. 88 Grossmann S, Bauer S, Robinson PN, Vingron M (2007) Improved detection of 89 overrepresentation of Gene-Ontology annotations with parent child analysis. 90 Bioinformatics, 23, 3024–3031. 91 Laaidi M, Laaidi K, Besancenot J-P, Thibaudon M (2003) Ragweed in France: An 92 invasive plant and its allergenic pollen. Annals of Allergy, Asthma & Immunology, 93 91, 195–201. 94 95 96 97 98 99 100 101 102 103 104 Larsson A (2014) AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics, 1–3. Lejeune K, Seastedt T (2001) Centaurea species: The forb that won the west. Conservation Biology, 15, 1568–1574. Markova-Raina P, Petrov D (2011) High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Research, 21, 863–874. Moore R (1975) The biology of Canadian weeds. 13. Cirsium arvense (L.). Canadian Journal of Plant Science, 55, 1033–1048. Schroeder D, Stinson CSA, Station E (1993) A European weed survey in 10 major crop systems to identify targets for biological control, 33, 449–459. 105 Wu Y, Rasmussen D, Bansal M, Kellis M (2013) TreeFix: Statistically informed gene 106 tree error correction using species trees. Systematic Biology, 62, 110–120. 107 108 109