tpj12569-sup-0004-AppendixS1

advertisement
Appendix 1
Additional experimental procedure details
Genomic DNA sequence generation and variant analysis
DNA was randomly sheared into ~250bp fragments and the resulting fragments were used to
create an Illumina library. This library was sequenced on Illumina HiSeq2000 sequencing system
sequencers generating 73-100bp paired end reads. MAQ-0.7.1 (Li and Durbin 2009) and SOAP (Li et al.
2009b) packages were used to map quality filtered Illumina reads to the 8x genome assembly of the
Bd21 reference strain. We required read support from at least three reads for identifying variants
through the MAQ pipeline. Tracks of genomic positions with less than three reads were extracted and
merged within 100bp to identify potential deletions. We used custom perl scripts to generate the high
confidence SNP set, containing SNPs found in both MAQ and SOAP pipelines. SNPs common to the MAQ
and SOAP pipelines were further filtered to remove SNPs in which the consensus base was ambiguous
and SNPs in which the reference base was the most common allele (potential false positives).
Putative SVs were called using BreakDancer, filtering for a confidence score of >90 (Chen et al.
2009). IMR/DENOM (Gan et al. 2011) was used in conjunction with the SOAP assembler and the BWA
aligner (Li and Durbin 2009). Ambiguous and heterozygous SNPs were removed from IMR/DENOM
output and variants from the pipeline were incorporated into the reference sequence using MCMERGE
to create line-specific genomes. ACT was used to generate variant saturation and correlation plots (Jee
et al. 2011). BEDTools was used to identify variable intersects between indels and other features as well
as calculate variant frequency among genomic windows (Quinlan and Hall 2010). To identify SNPs for
population divergence estimates we used samtools mpileup (Li et al. 2009a) to output counts for each
base for all genomic positions, requiring a minimum read mapping quality of 29, while keeping track of
genomic positions lacking sufficient sequence information. We further filtered the mpileup output for
each genomic position, requiring between 5-200 coverage. 234,045,216 genomic positions met the
above criteria for at least two resequenced lines in addition to the Bd21 reference (~86% of the Bd21
reference genome). To count towards a particular base assignment in a given sample we required a
minimum base quality of phred 30. For subsequent comparative analysis we required the consensus
base to contribute to at least 60 percent of the total base calls for a given position in a sample or the
position was excluded as uninformative. A custom perl script was used to assign the consensus alternate
base for each line and compare nucleotides at each position while omitting positions with missing data
to calculate pairwise distance matrices using a 250kb sliding window with 100kb step size. A similar
script was used to calculate nucleotide diversity. The PHYLIP package including the FITCH, CONSENSE
and DRAWTREE programs were used to generate phylogenetic trees (Felsenstein 1989). Concordance
between large indels and gene expression was performed using BEDTools (Quinlan and Hall 2010) and
custom perl scripts.
We functionally annotated variants using the phytozome 8 gene annotation and incorporated
SNPs and small indels into the Bd21 reference sequence to generate conservative synthetic genome
sequences for each line. Bd21 transcripts were mapped to the synthetic genome sequences with BLAT
and further processed with custom scripts to annotate gene models.
To obtain population genetics statistic separately from each protein coding gene, the annotated
coding sequences were aligned with ClustalW2 (Larkin et al. 2007). Sites with alignment gap in any
sequence were removed by custom perl script, and Tajima’s D was calculated with the PopGen modulex
(Stajich and Hahn 2005) in BioPerl (Stajich et al. 2002).
Quantitative mRNA-Seq data generation and analysis
RNA was extracted using the Spectrum Plant Total RNA Kit (Sigma Aldrich, St. Louis, MO, USA)
and quantified using a NanoDrop (NanoDrop Products). 1.5µg of total RNA was prepared for tag-based
RNA sequencing (Meyer et al. 2011). This method incorporates sequencing primers and barcodes at the
3’ end of mRNA molecules; sequenced tags are therefore enriched for 3’ UTR and 3’ exonic sequence
and do not require correction for gene length. Prepared library samples were analyzed for quality on a
BioAnalyzer (Agilent, Santa Clara, CA, USA) and then run on two lanes of the SOLiD 5500 system (Life
Technologies, Carlsbad, CA, USA) at the University of Texas at Austin Genome Sequence and Analysis
Facility.
Reads were sorted by barcode and assigned to individual sample plants. Low-quality reads
(those with homopolymer stretches > 15% of the read length and/or those containing >10 bases with
quality scores below 16) were removed and barcodes trimmed using custom Perl scripts. We mapped all
reads against the 8x Bd21 genome sequence using SHRiMP ver 2.1.1b (Rumble et al. 2009). We
recovered between 310,000 and 4,400,000 mapped reads per sample, with most samples falling in the
range of one to three million reads. We recovered very few reads from multiple samples of Bd21-3; this
line was excluded from further differential expression analyses. Mapping efficiency ranged from 61% to
70% and was significantly correlated with genotype (ANOVA: F6,32 = 5.58. P = 0.0005), though the mean
mapping efficiency varied only by 4.1% among genotypes. Mapped reads from each sample were used
to create a table reporting counts for each locus in each sample. Loci for which very few reads were
recovered (zero reads from >50% of samples) were excluded from further analysis. This filtering scheme
resulted in a dataset composed of 15,168 transcripts. Counts were normalized across samples using the
KDMM protocol implemented in JMPGenomics 6.0 (SAS Institute, Cary, NC).
We modeled the variance among samples as a negative binomial function and then fit a fixed
effect general linear model (GLM) including a term for “genotype,” “treatment,” their interaction, and
random “block” in JMPGenomics. Overdispersion was treated via incorporation of an overdispersion
parameter estimated in ProcGLIMMIX. We controlled for multiple testing using a positive false-discovery
rate of 0.05 (Storey and Tibshirani 2003). We also identified transcripts showing a significant treatment
response in each line using the negative binomial test implemented in DESeq (Anders and Huber 2010),
again using a FDR of 0.05. Compared to the full-model GLM, this approach results in a significant loss of
statistical power, but allows us to identify transcripts that specifically respond to the treatment in each
line.
We used the program ELEMENT (Mockler et al. 2007) to test the hypothesis that gene sets are
enriched for particular sequence elements that may control expression in cis. Specifically, we looked for
over-represented motifs in the proximal 1000 bp upstream of genes annotated in the Bd21 reference
sequence.
Bd1-1 deep RNA sequencing and transcriptome assembly
The strand specific libraries were subjected to a 101-bp cycle in a single end run in an Illumina
HiSeq2000 sequencing system at the Center for Genome Research and Biocomputing (CGRB), Oregon
State University, Corvallis, OR. Three lanes, with 8 libraries (one experiment) multiplexed per lane, were
run in one flow cell. The processing of fluorescent images into sequences, base-calling and quality value
calculations were performed using the CASAVA software version 1.8.
Pooled reads from all experiments were filtered and assembled using Rnnotator (Martin et al.
2010). The resulting contigs were filtered by size (≥ or < 1kb) and aligned with GMAP (Wu and Watanabe
2005) to the Bd21 genome assembly (IBI 2010) and the line-specific genomes. BEDTools and custom
scripts were used to identify aligned contigs defining previously unannotated transcripts (Quinlan and
Hall 2010). Tophat (Trapnell et al. 2009) was used to map Bd1-1 and Bd21 Illumina RNA-Seq reads
(Davidson et al. 2012) (obtained from the NCBI sequence read archive, SRP008505) to the genome
sequences and independently determine expression support for each transcript not contained in the
reference annotation. Bd1-1 contigs that did not align to the Bd21 and Bd1-1 genome sequences were
extracted. These transcripts were annotated using Blast2GO by using BLASTx (E-value of <1e-6) against
the NCBI Genbank non-redundant protein database followed by InterProScan search (E-value of <1e-6)
(Conesa and Gotz 2008).
References
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biol,
11, R106.
Chen, K., Wallis, J.W., McLellan, M.D., Larson, D.E., Kalicki, J.M., Pohl, C.S., McGrath, S.D., Wendl,
M.C., Zhang, Q., Locke, D.P., Shi, X., Fulton, R.S., Ley, T.J., Wilson, R.K., Ding, L. and Mardis,
E.R. (2009) BreakDancer: an algorithm for high-resolution mapping of genomic structural
variation. Nat Methods, 6, 677-681.
Conesa, A. and Gotz, S. (2008) Blast2GO: A comprehensive suite for functional analysis in plant
genomics. Int J Plant Genomics, 2008, 619832.
Davidson, R.M., Gowda, M., Moghe, G., Lin, H., Vaillancourt, B., Shiu, S.H., Jiang, N. and Robin Buell, C.
(2012) Comparative transcriptomics of three Poaceae species reveals patterns of gene
expression evolution. Plant J, 71, 492-502.
Felsenstein, J. (1989) PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics, 5, 164-166.
Gan, X., Stegle, O., Behr, J., Steffen, J.G., Drewe, P., Hildebrand, K.L., Lyngsoe, R., Schultheiss, S.J.,
Osborne, E.J., Sreedharan, V.T., Kahles, A., Bohnert, R., Jean, G., Derwent, P., Kersey, P.,
Belfield, E.J., Harberd, N.P., Kemen, E., Toomajian, C., Kover, P.X., Clark, R.M., Ratsch, G. and
Mott, R. (2011) Multiple reference genomes and transcriptomes for Arabidopsis thaliana.
Nature, 477, 419-423.
IBI (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature, 463,
763-768.
Jee, J., Rozowsky, J., Yip, K.Y., Lochovsky, L., Bjornson, R., Zhong, G., Zhang, Z., Fu, Y., Wang, J., Weng,
Z. and Gerstein, M. (2011) ACT: aggregation and correlation toolbox for analyses of genome
tracks. Bioinformatics, 27, 1152-1154.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F.,
Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J. and Higgins, D.G. (2007) Clustal
W and Clustal X version 2.0. Bioinformatics, 23, 2947-2948.
Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics, 25, 1754-1760.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin,
R. (2009a) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079.
Li, R., Li, Y., Fang, X., Yang, H., Wang, J. and Kristiansen, K. (2009b) SNP detection for massively parallel
whole-genome resequencing. Genome Res, 19, 1124-1132.
Martin, J., Bruno, V.M., Fang, Z., Meng, X., Blow, M., Zhang, T., Sherlock, G., Snyder, M. and Wang, Z.
(2010) Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNASeq reads. BMC Genomics, 11, 663.
Meyer, E., Aglyamova, G.V. and Matz, M.V. (2011) Profiling gene expression responses of coral larvae
(Acropora millepora) to elevated temperature and settlement inducers using a novel RNA-Seq
procedure. Mol Ecol, 20, 3599-3616.
Mockler, T.C., Michael, T.P., Priest, H.D., Shen, R., Sullivan, C.M., Givan, S.A., McEntee, C., Kay, S.A.
and Chory, J. (2007) The DIURNAL project: DIURNAL and circadian expression profiling, modelbased pattern matching, and promoter analysis. Cold Spring Harb Symp Quant Biol, 72, 353-363.
Quinlan, A.R. and Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features.
Bioinformatics, 26, 841-842.
Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A. and Brudno, M. (2009) SHRiMP: accurate
mapping of short color-space reads. PLoS Comput Biol, 5, e1000386.
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G.,
Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R.,
Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D. and Birney, E. (2002) The
Bioperl toolkit: Perl modules for the life sciences. Genome Res, 12, 1611-1618.
Stajich, J.E. and Hahn, M.W. (2005) Disentangling the effects of demography and selection in human
history. Mol Biol Evol, 22, 63-73.
Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci
U S A, 100, 9440-9445.
Trapnell, C., Pachter, L. and Salzberg, S.L. (2009) TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics, 25, 1105-1111.
Wu, T.D. and Watanabe, C.K. (2005) GMAP: a genomic mapping and alignment program for mRNA and
EST sequences. Bioinformatics, 21, 1859-1875.
Download