Disentangling methodological and biological sources of gene tree discordance on Oryza (Poaceae) chromosome 3 APPENDIX 1 BOS Filtering Protocol Orthology of annotated gene regions was determined using a three-step procedure we call BLAST-Overlap-Synteny filtering (BOS). This procedure was used twice in our pipeline, both before and after “annotation adjustment” (see online Appendix 2). First, an all vs. all BLASTp analysis (Altschul et al. 1997) was run using protein translations of the set of annotated genes across all 11 genomes (command line “blastall -F mS -s T -p blastp –e 1.0e-10”, as recommended by Robbertse et al. 2011). This set of BLAST hits was filtered further for appropriate levels of BLAST HSP (high-scoring segment pairs) overlap using the method of McMahon and Sanderson (2006; parameters: “query hit fraction”, “subject hit fraction” and “hit span fraction” all equal to 0.5), henceforth MS2006. The final filtering step used DAGchainer (Haas et al. 2004) to take advantage of the high levels of synteny seen between our genomes to eliminate possibly spurious hits and nonsyntenic paralogs. For the DAGchainer analyses we used gene order information rather than nucleotide coordinate on the chromosome (default settings except gaplength=1, max distance between matches=2, min chain length=4). All 11 genomes were compared pairwise, and BLAST hits identified by DAGchainer as being part of a syntenic chain of homologous loci in one or more pairwise comparison were retained. All other unchained BLAST hits within 5 genes (in terms of Euclidian distance) of a chained hit were also retained. To define the Euclidian distance between BLAST hits, 1 assume two genomes, X and Y, and a BLAST hit between them, c, that has been included in a syntenic chain by DAGchainer. Define (cx, cy) as the “coordinate” of that BLAST hit, i.e., the hit is between the x’th gene in the X genome and the y’th gene in the Y genome. The Euclidian distance between c and another unchained BLAST hit h, at coordinate (hx, hy), is then 𝑑(𝑐, ℎ) = √(𝑐𝑥 − ℎ𝑥 )2 + (𝑐𝑦 − ℎ𝑦 ) 2 Any hit h was retained if 𝑑(𝑐, ℎ) ≤ 5. All retained hits were then input into a singlelinkage clustering algorithm to obtain ortholog clusters. Literature cited Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402. Haas B.J., Delcher A.L., Wortman J.R., Salzberg S.L. 2004. DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics. 20:3643–3646. Robbertse B., Yoder R.J., Boyd A., Reeves J., Spatafora J.W. 2011. Hal: an automated pipeline for phylogenetic analyses of genomic data. PLoS Curr. 3:RRN1213. McMahon M.M., Sanderson M.J. 2006. Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Syst. Biol. 55:818–836. 2 APPENDIX 2 Annotation Adjustment Procedure Using the set of BLASTp hits filtered with the BOS protocol (omitting the MS2006 overlap test; see online Appendix 1), the following steps are performed for each pairing of O. sativa subsp. japonica with one of the other genomes in our dataset: 1. Identify cases in which multiple O. sativa subsp. japonica genes have BLAST hits to a single gene in another genome, or multiple genes in another genome have BLAST hits a single gene in O. sativa subsp. japonica. Together these are termed “reannotation clusters”. 2. Abort reannotation of a given cluster if: a. Any genes show within-genome similarity (likely indicative of tandem duplication) b. Multiple genes are included in the cluster for both O. sativa subsp. japonica and the other genome c. Multiple genes contained in a single genome are on different strands or have more than 5 intervening genes between them. Fused genes can be non-contiguous if a number of genes that would fall between them are missing due to sequencing or assembly error. 3. Extract the full sequence of all genes in the reannotation cluster, align all intergenomic gene pairs with the EMBOSS Needleman-Wunsch pairwise alignment routine, using parameters that encourage long gapped regions (needle parameters: gap open=50, gap extend=0.5) 3 4. Working inward from the ends of each of the shorter sequences, find the two outermost alignment windows of length 40 bp with 80% sequence identity. If no such regions exist, abort reannotation of cluster. 5. Collect all annotated introns and exons that fully or partially overlap the aligned region. 6. In the case of creating multiple annotated genes from one, discard any introns not bounded on both sides by an exon. 7. In the case of creating one annotation from multiple, make a new pseudo-intron out of what was previously annotated as an intergenic region. When annotations are altered the coding frame phases are adjusted such that the new gene sequences translate properly, although they may lack an annotated initiation and/or termination codon. APPENDIX 3 Alignment Block-shift Detection The block-shift alignment test compares positions in each sequence to the corresponding column consensuses. An alignment column is considered to have a meaningful consensus if at least m sequences have a nucleotide at that position and more than a proportion c of those nucleotides are of the same state. A sliding window of length N is moved along each individual sequence in an alignment, noting the positions at which it differs from the consensus at each alignment column. If a proportion p or greater of the nucleotides within the window differ from their column consensuses, the region is tagged as poorly aligned (a block-shift) for that individual sequence. The actual region tagged is 4 always bounded by sites identified as mismatches, so the region may be shorter than the window size N, with a minimum length of N x p. The particular parameter values used for the detection algorithm were chosen based on experimentation (c=4, m=0.5, N=10, p=0.5). Using this method, alignments could then be categorized, or “flagged”, by whether they contained a block-shift in any specified subset of taxa. BLSMASK alignments were created by masking block-shift regions with N’s in individual sequences. Masking was not done in the divergent outgroup O. brachyantha, for which initial tests suggested that the method was detecting spurious block-shifts due to true sequence divergence between homologous regions. 5