onlineAppendices.

advertisement
Disentangling methodological and biological sources of gene tree discordance on
Oryza (Poaceae) chromosome 3
APPENDIX 1
BOS Filtering Protocol
Orthology of annotated gene regions was determined using a three-step procedure
we call BLAST-Overlap-Synteny filtering (BOS). This procedure was used twice in our
pipeline, both before and after “annotation adjustment” (see online Appendix 2). First, an
all vs. all BLASTp analysis (Altschul et al. 1997) was run using protein translations of
the set of annotated genes across all 11 genomes (command line “blastall -F mS -s T -p
blastp –e 1.0e-10”, as recommended by Robbertse et al. 2011). This set of BLAST hits
was filtered further for appropriate levels of BLAST HSP (high-scoring segment pairs)
overlap using the method of McMahon and Sanderson (2006; parameters: “query hit
fraction”, “subject hit fraction” and “hit span fraction” all equal to 0.5), henceforth
MS2006. The final filtering step used DAGchainer (Haas et al. 2004) to take advantage
of the high levels of synteny seen between our genomes to eliminate possibly spurious
hits and nonsyntenic paralogs. For the DAGchainer analyses we used gene order
information rather than nucleotide coordinate on the chromosome (default settings except
gaplength=1, max distance between matches=2, min chain length=4). All 11 genomes
were compared pairwise, and BLAST hits identified by DAGchainer as being part of a
syntenic chain of homologous loci in one or more pairwise comparison were retained.
All other unchained BLAST hits within 5 genes (in terms of Euclidian distance) of a
chained hit were also retained. To define the Euclidian distance between BLAST hits,
1
assume two genomes, X and Y, and a BLAST hit between them, c, that has been included
in a syntenic chain by DAGchainer. Define (cx, cy) as the “coordinate” of that BLAST
hit, i.e., the hit is between the x’th gene in the X genome and the y’th gene in the Y
genome. The Euclidian distance between c and another unchained BLAST hit h, at
coordinate (hx, hy), is then
𝑑(𝑐, ℎ) = √(𝑐𝑥 − ℎ𝑥 )2 + (𝑐𝑦 − ℎ𝑦 )
2
Any hit h was retained if 𝑑(𝑐, ℎ) ≤ 5. All retained hits were then input into a singlelinkage clustering algorithm to obtain ortholog clusters.
Literature cited
Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J.
1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25:3389–3402.
Haas B.J., Delcher A.L., Wortman J.R., Salzberg S.L. 2004. DAGchainer: a tool for
mining segmental genome duplications and synteny. Bioinformatics. 20:3643–3646.
Robbertse B., Yoder R.J., Boyd A., Reeves J., Spatafora J.W. 2011. Hal: an automated
pipeline for phylogenetic analyses of genomic data. PLoS Curr. 3:RRN1213.
McMahon M.M., Sanderson M.J. 2006. Phylogenetic supermatrix analysis of GenBank
sequences from 2228 papilionoid legumes. Syst. Biol. 55:818–836.
2
APPENDIX 2
Annotation Adjustment Procedure
Using the set of BLASTp hits filtered with the BOS protocol (omitting the MS2006
overlap test; see online Appendix 1), the following steps are performed for each pairing
of O. sativa subsp. japonica with one of the other genomes in our dataset:
1. Identify cases in which multiple O. sativa subsp. japonica genes have BLAST hits
to a single gene in another genome, or multiple genes in another genome have
BLAST hits a single gene in O. sativa subsp. japonica. Together these are termed
“reannotation clusters”.
2. Abort reannotation of a given cluster if:
a. Any genes show within-genome similarity (likely indicative of tandem
duplication)
b. Multiple genes are included in the cluster for both O. sativa subsp.
japonica and the other genome
c. Multiple genes contained in a single genome are on different strands or
have more than 5 intervening genes between them. Fused genes can be
non-contiguous if a number of genes that would fall between them are
missing due to sequencing or assembly error.
3. Extract the full sequence of all genes in the reannotation cluster, align all
intergenomic gene pairs with the EMBOSS Needleman-Wunsch pairwise
alignment routine, using parameters that encourage long gapped regions (needle
parameters: gap open=50, gap extend=0.5)
3
4. Working inward from the ends of each of the shorter sequences, find the two
outermost alignment windows of length 40 bp with 80% sequence identity. If no
such regions exist, abort reannotation of cluster.
5. Collect all annotated introns and exons that fully or partially overlap the aligned
region.
6. In the case of creating multiple annotated genes from one, discard any introns not
bounded on both sides by an exon.
7. In the case of creating one annotation from multiple, make a new pseudo-intron
out of what was previously annotated as an intergenic region.
When annotations are altered the coding frame phases are adjusted such that the new
gene sequences translate properly, although they may lack an annotated initiation and/or
termination codon.
APPENDIX 3
Alignment Block-shift Detection
The block-shift alignment test compares positions in each sequence to the
corresponding column consensuses. An alignment column is considered to have a
meaningful consensus if at least m sequences have a nucleotide at that position and more
than a proportion c of those nucleotides are of the same state. A sliding window of length
N is moved along each individual sequence in an alignment, noting the positions at which
it differs from the consensus at each alignment column. If a proportion p or greater of the
nucleotides within the window differ from their column consensuses, the region is tagged
as poorly aligned (a block-shift) for that individual sequence. The actual region tagged is
4
always bounded by sites identified as mismatches, so the region may be shorter than the
window size N, with a minimum length of N x p. The particular parameter values used
for the detection algorithm were chosen based on experimentation (c=4, m=0.5, N=10,
p=0.5).
Using this method, alignments could then be categorized, or “flagged”, by
whether they contained a block-shift in any specified subset of taxa. BLSMASK
alignments were created by masking block-shift regions with N’s in individual sequences.
Masking was not done in the divergent outgroup O. brachyantha, for which initial tests
suggested that the method was detecting spurious block-shifts due to true sequence
divergence between homologous regions.
5
Download