A draft genome sequence of the pulse crop chickpea (Cicer arietinum L.) SUPPORTING TEXT EXPERIMENTAL PROCEDURES Plant materials Seeds of ICC4958 and chickpea genotypes were grown in pots in the growth chamber for four weeks before harvesting the fresh expanding leaves. DNA was isolated using a standard phenol/chloroform method followed by RNAse A and proteinase K treatments and precipitation with ethanol. For RNASeq, chickpea (genotype ICC4958) seeds were grown either in the culture room or field for collection of various tissue samples. Root and shoot tissues were collected from chickpea seedlings (15-day-old) growing in the culture room. Mature leaves, stem, flowers and young pod were collected from the field-grown plants. For stress treatments, chickpea seedlings (10-day-old) were kept in water (control), 150 mM solution of NaCl (salt stress) and between the folds of tissue paper (drought stress) and roots were harvested after 5 h of treatment. RNA was isolated from different young organs grown under different conditions following previously described methods (Garg et al., 2011). Sequencing and assembly For the draft assembly, sequence data was generated primarily by the 454/Roche GS FLX Titanium platform using pyrosequencing technology. Construction of whole genome shotgun (WGS) (insert size of 300-900 bp) and matepair (MP) (insert size of 3, 15 and 20 kb) libraries was performed as described by the manufacturer (Margulies et al., 2005). 24 runs of four different WGS libraries (average insert size 700 bp) and a total of 10 runs of three MP libraries (3 kb, 15 kb and 20 kb) generated a total of 38.29 million filtered reads with 13.354 Gb high quality (Phred score, Q20) bases. The reads were filtered according to the method followed for tomato genome sequence (The Tomato Genome Consortium, 2012). The Illumina GA-IIx short read sequencing platform was used to sequence two small-insert libraries (average insert size of 520 and 620 bp) to produce 43.7 Gb (~59X) paired-end (PE) high quality sequence data of 100-base read lengths after quality filtering. Duplicate reads were removed during filtering. The short read data set was assembled using ABySS version 1.2.6 with K-mer length of 47 to produce 304,948,126 bases of assembled sequences. The contigs larger than 2000 bases were split into contigs of 2000 bases with overlaps of 100 bases. These contigs and all the filtered reads generated by the GS FLX platform were assembled by the de novo assembly tool Newbler (GS de novo assembler, Roche Applied Sciences) version 2.5.3 to obtain the primary assembly. The assembly parameters used were seed step 12, minimum overlap 40, minimum overlap identity 95%, seed length 16, with default quality filter, adapter and primer trimming of the input reads. Further scaffolding using publicly available BAC end sequences (GenBank gi numbers 14645554 to 270242271) and genetic markers (Gaur et al., 2012) was performed. Unordered pieces of working draft sequences of twelve chickpea BACs (GenBank accession numbers AC137663, AC161101-AC161105, AC145454-AC145458, AC145766) and the BAC end sequences were aligned by BlastN to check the assembly. The contigs were extensively screened for mitochondrial (GenBank accession numbers: O. sativa NC_011033.1, A. thaliana NC_001284.2, V. radiata NC_015121.1), cloning vectors (ftp://ftp.ncbi.nih.gov/pub/UniVec/) and microbial genome (GenBank accession numbers: E. coli AC_000091.1, AP009048.1; A. tumefaciens NC_003062.2, Rhizobium NC_008380.1, P. aeruginosa NC_002516.2, B. thuringiensis NC_014171.1, X. campestris NC_007086.1, D. acidovorans NC_010002.1) contamination. Screening was done through megablast alignment using a cut-off of 95% identity. Contigs having 95% identity or more over 90% of its entire length or more were discarded. The screened contigs above 200 bases were submitted to the National Center for Biotechnology Information (NCBI) with the bioproject registration ID PRJNA78951 (GenBank accession no. AHII00000000). Gene expression analysis RNA-seq was performed with total RNA isolated from different chickpea tissues/organs (root, shoot, mature leaves, stem, flowers and young pod) and roots of seedlings subjected to control, drought and salt stress conditions according to Illumina protocols using the Illumina GA-IIx platform. High-quality reads were filtered using NGS QC Toolkit (Patel and Jain, 2012) and mapped using CLC Genomics Workbench to the mRNA sequences of predicted chickpea genes allowing two mismatches for quantification of gene expression. Only the uniquely mapped reads were considered for gene expression analysis. Differential gene expression analysis was performed using DESeq software (Anders and Huber, 2010). The genes showing a fold-change of at least two-fold with p-value of ≤0.05 were regarded as differentially expressed. The genes having at least two-fold higher expression in a particular tissue as compared to all other tissues analyzed were identified as tissuepreferential genes. A stringent criterion of having at least three reads per million (rpm) in one tissue and less than one rpm in other tissues was used for the identification of tissue-specific genes. GO term enrichment analysis was done using BiNGO (Maere et al., 2005) with p-value cut-off of <0.05 after applying Bonferrroni Family-Wise Error Rate (FWER) correction. Synteny and genome duplication Duplicated blocks within the chickpea genome were identified using i-ADHoRe 3.0 (Proost et al., 2012). i-ADHoRe aligns the ordered gene lists corresponding to each pseudomolecule in order to identify anchor points. Protein sequences corresponding to detected anchor points or collinear regions were aligned using CLUSTALW, followed by estimation of synonymous substitution rate (Ks) using the Codeml program of PAML 4.5 package (Yang et al, 2009). Collinear blocks were positioned and visualized on the genome using Circos (Krzywinski et al., 2009). Soybean and Medicago truncatula reference genome sequences were downloaded from ftp://ftp.jgipsf.org/pub/JGI_data/phytozome/v7.0/Gmax/assembly and http://www.medicagohapmap.org/downloads.php, respectively. The whole genome dot plot was generated with chickpea scaffolds representing 8 linkage groups on the x-axis against chromosome arms (North and South) of soybean and M. truncatula. For the soybean genome, large pericentromeric regions were removed. The Promer package of MUMer 3.22 (Delcher et al., 2002) was used to align annotated genes based on amino acid sequence. Whole genome dot plots were generated using MUMerplot and gnuplot 4.4 patch level 2. Reciprocal best matches between two genomes were identified by Vmatch using parameters, query 85, subject coverage 70, exdrop 100 and a minimum length 100. Extracted protein sequences were aligned by ClustalW. The extracted nucleotide sequences were aligned by PAML package using the PAL2NAL tool. The yn00 tool was used to find gene duplications from each cluster based on dN/dS (omega) values. The duplicate gene pairs were then anchored to the pseudomolecules using i-ADHoRe. Microsynteny analysis was performed by mapping chickpea unigene sequences on chickpea and Medicago pseudomolecules and the coordinates are viewed using the Multi-Genome synteny viewer (casbioinfo.cas.unt.edu/mgsv/). Nucleotide diversity Detection of SNP and other structural variations in the chickpea genotypes was done by generating sequence data for three other chickpea genotypes by the 454/Roche GS FLX Titanium platform. The reads from each genotype were separately aligned to the ICC4958 draft sequence as reference to construct map-based assemblies of these genotypes using GS Reference Mapper (454.com/product/analysis- software/index.asp). The reads of one genotype were aligned to the ICC4958 draft assembly or the map based assembly of another genotype to identify SNPs and other structural variations. The coordinates of the structural variations between different genotypes are available at http://nipgr.res.in/CGAP/home.php. SNP detection in the transcriptomes of four chickpea genotypes was performed using GigaBayes as previously described (Jhanwar et al., 2012). URLs used RepeatMasker and RepeatProteinMask, http://www.repeatmasker.org/; RepeatModeler, http://repeatmasker.org/RepeatModeler.html; PILER, http://www.drive5.com/piler/; RepeatScout, http://bix.ucsd.edu/repeatscout/; LTR_Finder, http://tlife.fudan.edu.cn/ltr_finder/; GLEAN, http://gleangene.svn.sourceforge.net/viewvc/glean-gene/; EVidenceModeler, http://evidencemodeler.sourceforge.net/; Augustus, http://bioinf.unigreifswald.de/augustus/; GENSCAN, http://genes.mit.edu/GENSCAN.html; FGENESH++, http://linux1.softberry.com/berry.phtml?topic=index&group=programs&subgroup=gf s; GeneWise, http://www.ebi.ac.uk/Tools/Wise2/; PASA, http://pasa.sourceforge.net/; CEGMA, http://korflab.ucdavis.edu/Datasets/cegma/; SwissProt and TrEMBL, http://www.uniprot.org/; TAIR10, http://www.arabidopsis.org; Blast2GO, http://blast2go.com/b2ghome; AutoFACT, http://megasun.bch.umontreal.ca/Software/AutoFACT.htm; BiNGO, http://psb.ugent.be/cbd/papers/BiNGO/Home.html; PlnTFDB, http://plntfdb.bio.unipotsdam.de/v3.0/; i-ADHoRe, http://bioinformatics.psb.ugent.be/software; PAML, http://abacus.gene.ucl.ac.uk/software/paml.html; Circos, http://circos.ca/; TIGR Plant Transcript Assemblies, http://plantta.jcvi.org/; NGS QC Toolkit, http://www.nipgr.res.in/ngsqctoolkit.html; CLC Genomic Workbench, http://www.clcbio.com/index.php?id=1240; DESeq, http://wwwhuber.embl.de/users/anders/DESeq/; MeV, http://www.tm4.org/mev/; GigaBayes, http://bioinformatics.bc.edu/marthlab/Software_Release, MISA, http://pgrc.ipkgatersleben.de/misa/misa.html, Vmatch: http://www.vmatch.de/, MCL, http://micans.org/mcl/, CLUSTALW, http://www.ebi.ac.uk/Tools/msa/clustalw2/, Pal2nal: http://www.bork.embl.de/pal2nal/