tpj12173-sup-0016-MethodsS1

advertisement
A draft genome sequence of the pulse crop chickpea (Cicer arietinum L.)
SUPPORTING TEXT
EXPERIMENTAL PROCEDURES
Plant materials
Seeds of ICC4958 and chickpea genotypes were grown in pots in the growth chamber
for four weeks before harvesting the fresh expanding leaves. DNA was isolated using
a standard phenol/chloroform method followed by RNAse A and proteinase K
treatments and precipitation with ethanol. For RNASeq, chickpea (genotype
ICC4958) seeds were grown either in the culture room or field for collection of
various tissue samples. Root and shoot tissues were collected from chickpea seedlings
(15-day-old) growing in the culture room. Mature leaves, stem, flowers and young
pod were collected from the field-grown plants. For stress treatments, chickpea
seedlings (10-day-old) were kept in water (control), 150 mM solution of NaCl (salt
stress) and between the folds of tissue paper (drought stress) and roots were harvested
after 5 h of treatment. RNA was isolated from different young organs grown under
different conditions following previously described methods (Garg et al., 2011).
Sequencing and assembly
For the draft assembly, sequence data was generated primarily by the 454/Roche GS
FLX Titanium platform using pyrosequencing technology. Construction of whole
genome shotgun (WGS) (insert size of 300-900 bp) and matepair (MP) (insert size of
3, 15 and 20 kb) libraries was performed as described by the manufacturer (Margulies
et al., 2005). 24 runs of four different WGS libraries (average insert size 700 bp) and
a total of 10 runs of three MP libraries (3 kb, 15 kb and 20 kb) generated a total of
38.29 million filtered reads with 13.354 Gb high quality (Phred score, Q20) bases.
The reads were filtered according to the method followed for tomato genome
sequence (The Tomato Genome Consortium, 2012). The Illumina GA-IIx short read
sequencing platform was used to sequence two small-insert libraries (average insert
size of 520 and 620 bp) to produce 43.7 Gb (~59X) paired-end (PE) high quality
sequence data of 100-base read lengths after quality filtering. Duplicate reads were
removed during filtering. The short read data set was assembled using ABySS version
1.2.6 with K-mer length of 47 to produce 304,948,126 bases of assembled sequences.
The contigs larger than 2000 bases were split into contigs of 2000 bases with overlaps
of 100 bases. These contigs and all the filtered reads generated by the GS FLX
platform were assembled by the de novo assembly tool Newbler (GS de novo
assembler, Roche Applied Sciences) version 2.5.3 to obtain the primary assembly.
The assembly parameters used were seed step 12, minimum overlap 40, minimum
overlap identity 95%, seed length 16, with default quality filter, adapter and primer
trimming of the input reads. Further scaffolding using publicly available BAC end
sequences (GenBank gi numbers 14645554 to 270242271) and genetic markers (Gaur
et al., 2012) was performed. Unordered pieces of working draft sequences of twelve
chickpea BACs (GenBank accession numbers AC137663, AC161101-AC161105,
AC145454-AC145458, AC145766) and the BAC end sequences were aligned by
BlastN to check the assembly. The contigs were extensively screened for
mitochondrial (GenBank accession numbers: O. sativa NC_011033.1, A. thaliana
NC_001284.2,
V.
radiata
NC_015121.1),
cloning
vectors
(ftp://ftp.ncbi.nih.gov/pub/UniVec/) and microbial genome (GenBank accession
numbers: E. coli AC_000091.1, AP009048.1; A. tumefaciens NC_003062.2,
Rhizobium
NC_008380.1,
P.
aeruginosa
NC_002516.2,
B.
thuringiensis
NC_014171.1, X. campestris NC_007086.1, D. acidovorans NC_010002.1)
contamination. Screening was done through megablast alignment using a cut-off of
95% identity. Contigs having 95% identity or more over 90% of its entire length or
more were discarded. The screened contigs above 200 bases were submitted to the
National Center for Biotechnology Information (NCBI) with the bioproject
registration ID PRJNA78951 (GenBank accession no. AHII00000000).
Gene expression analysis
RNA-seq was performed with total RNA isolated from different chickpea
tissues/organs (root, shoot, mature leaves, stem, flowers and young pod) and roots of
seedlings subjected to control, drought and salt stress conditions according to Illumina
protocols using the Illumina GA-IIx platform. High-quality reads were filtered using
NGS QC Toolkit (Patel and Jain, 2012) and mapped using CLC Genomics
Workbench to the mRNA sequences of predicted chickpea genes allowing two
mismatches for quantification of gene expression. Only the uniquely mapped reads
were considered for gene expression analysis. Differential gene expression analysis
was performed using DESeq software (Anders and Huber, 2010). The genes showing
a fold-change of at least two-fold with p-value of ≤0.05 were regarded as
differentially expressed. The genes having at least two-fold higher expression in a
particular tissue as compared to all other tissues analyzed were identified as tissuepreferential genes. A stringent criterion of having at least three reads per million
(rpm) in one tissue and less than one rpm in other tissues was used for the
identification of tissue-specific genes. GO term enrichment analysis was done using
BiNGO (Maere et al., 2005) with p-value cut-off of <0.05 after applying Bonferrroni
Family-Wise Error Rate (FWER) correction.
Synteny and genome duplication
Duplicated blocks within the chickpea genome were identified using i-ADHoRe 3.0
(Proost et al., 2012). i-ADHoRe aligns the ordered gene lists corresponding to each
pseudomolecule in order to identify anchor points. Protein sequences corresponding
to detected anchor points or collinear regions were aligned using CLUSTALW,
followed by estimation of synonymous substitution rate (Ks) using the Codeml
program of PAML 4.5 package (Yang et al, 2009). Collinear blocks were positioned
and visualized on the genome using Circos (Krzywinski et al., 2009). Soybean and
Medicago truncatula reference genome sequences were downloaded from ftp://ftp.jgipsf.org/pub/JGI_data/phytozome/v7.0/Gmax/assembly
and
http://www.medicagohapmap.org/downloads.php, respectively. The whole genome
dot plot was generated with chickpea scaffolds representing 8 linkage groups on the
x-axis against chromosome arms (North and South) of soybean and M. truncatula.
For the soybean genome, large pericentromeric regions were removed. The Promer
package of MUMer 3.22 (Delcher et al., 2002) was used to align annotated genes
based on amino acid sequence. Whole genome dot plots were generated using
MUMerplot and gnuplot 4.4 patch level 2. Reciprocal best matches between two
genomes were identified by Vmatch using parameters, query 85, subject coverage 70,
exdrop 100 and a minimum length 100. Extracted protein sequences were aligned by
ClustalW. The extracted nucleotide sequences were aligned by PAML package using
the PAL2NAL tool. The yn00 tool was used to find gene duplications from each
cluster based on dN/dS (omega) values. The duplicate gene pairs were then anchored
to the pseudomolecules using i-ADHoRe. Microsynteny analysis was performed by
mapping chickpea unigene sequences on chickpea and Medicago pseudomolecules
and the coordinates are viewed using the Multi-Genome synteny viewer (casbioinfo.cas.unt.edu/mgsv/).
Nucleotide diversity
Detection of SNP and other structural variations in the chickpea genotypes was done
by generating sequence data for three other chickpea genotypes by the 454/Roche GS
FLX Titanium platform. The reads from each genotype were separately aligned to the
ICC4958 draft sequence as reference to construct map-based assemblies of these
genotypes
using
GS
Reference
Mapper
(454.com/product/analysis-
software/index.asp). The reads of one genotype were aligned to the ICC4958 draft
assembly or the map based assembly of another genotype to identify SNPs and other
structural variations. The coordinates of the structural variations between different
genotypes are available at http://nipgr.res.in/CGAP/home.php. SNP detection in the
transcriptomes of four chickpea genotypes was performed using GigaBayes as
previously described (Jhanwar et al., 2012).
URLs used
RepeatMasker and RepeatProteinMask, http://www.repeatmasker.org/;
RepeatModeler, http://repeatmasker.org/RepeatModeler.html; PILER,
http://www.drive5.com/piler/; RepeatScout, http://bix.ucsd.edu/repeatscout/;
LTR_Finder, http://tlife.fudan.edu.cn/ltr_finder/; GLEAN, http://gleangene.svn.sourceforge.net/viewvc/glean-gene/; EVidenceModeler,
http://evidencemodeler.sourceforge.net/; Augustus, http://bioinf.unigreifswald.de/augustus/; GENSCAN, http://genes.mit.edu/GENSCAN.html;
FGENESH++,
http://linux1.softberry.com/berry.phtml?topic=index&group=programs&subgroup=gf
s; GeneWise, http://www.ebi.ac.uk/Tools/Wise2/; PASA, http://pasa.sourceforge.net/;
CEGMA, http://korflab.ucdavis.edu/Datasets/cegma/; SwissProt and TrEMBL,
http://www.uniprot.org/; TAIR10, http://www.arabidopsis.org; Blast2GO,
http://blast2go.com/b2ghome; AutoFACT,
http://megasun.bch.umontreal.ca/Software/AutoFACT.htm; BiNGO,
http://psb.ugent.be/cbd/papers/BiNGO/Home.html; PlnTFDB, http://plntfdb.bio.unipotsdam.de/v3.0/; i-ADHoRe, http://bioinformatics.psb.ugent.be/software; PAML,
http://abacus.gene.ucl.ac.uk/software/paml.html; Circos, http://circos.ca/; TIGR Plant
Transcript Assemblies, http://plantta.jcvi.org/; NGS QC Toolkit,
http://www.nipgr.res.in/ngsqctoolkit.html; CLC Genomic Workbench,
http://www.clcbio.com/index.php?id=1240; DESeq, http://wwwhuber.embl.de/users/anders/DESeq/; MeV, http://www.tm4.org/mev/; GigaBayes,
http://bioinformatics.bc.edu/marthlab/Software_Release, MISA, http://pgrc.ipkgatersleben.de/misa/misa.html, Vmatch: http://www.vmatch.de/, MCL,
http://micans.org/mcl/, CLUSTALW,
http://www.ebi.ac.uk/Tools/msa/clustalw2/, Pal2nal:
http://www.bork.embl.de/pal2nal/
Download