Supplementary materials Background The Puerto Rican parrot (Amazona vittata) is the only surviving native parrot species in the United States [1]. This endangered bird was once abundant throughout the island of Puerto Rico, but the population declined drastically during the 19th century with the decimation of the old growth forest habitat [2]. One of the first people to call attention to the need for its conservation was President Theodore Roosevelt, who formed the Luquillo Forest Reserve for the purpose of protecting the parrots’ nesting grounds [1]. El Yunque National Forest and a few other remote locations were spared from deforestation, allowing a small population of parrots to survive [2-4]. At the end of the 20th century, the species came very close to extinction: only 16 birds remained alive in 1975. The Puerto Rican parrot has been included on the Endangered Species List since 1967, and listed as critically endangered by the World Conservation Union since 1994. A recovery effort has been initiated by two captive breeding programs, one at the Luquillo Aviary close to El Yunque and another at the Rio Abajo State Forest. This effort has proven to be a success; the captive population is steadily increasing, and several captive-bred birds have been released into the wild. Genome-wide consequences of the population decline and their effects on the species recovery have never been assessed. Similarly to most birds, the A. vittata genome is relatively small (1.58 Gb) – less than half the size of the human genome [5]. It is also highly invariable due to the recent population bottleneck [6]. These two aspects made the Puerto Rican parrot an ideal candidate for genome sequencing and assembly: the small genome size allows high sequencing coverage, while the low amount of genetic variation permits accurate alignments between sequenced fragments resulting in fewer assembly errors. Methods Development of the local community involvement The Puerto Rico National Parks Company recently started exhibiting Puerto Rican parrots and related species at the Puerto Rico Zoo Juan A. Rivero. This exhibition will bring attention to the conservation effort for this endangered species. In conjunction with the comparative genomics study to be conducted at the University of Puerto Rico, the parrot exhibition will become a valuable research and educational resource for the local and international scientific community. The sequence of the Puerto Rican parrot will be used as a reference to study differences between species and to find genomic regions responsible for adaptation and survival. Sample collection and genome sequencing All sampling procedures were carried out as approved by the University of Puerto Rico at Mayaguez Institutional Animal Care and Use Committee (IACUC#201109.1), and in accordance with the guidance for the Endangered Species Act. A veterinarian designated by the Puerto Rico Department of Natural and Environmental Resources (DNER) collected blood from two non-reproductive captive female birds at the Rio Abajo aviary during a routine health check procedure. Syringes were heparinized before samples were taken. Approximately 1.5 cc of blood was extracted from each bird using a 3 cc syringe with a 27-gauge needle. Birds were placed into their cages after the procedure and remained under observation for the rest of the day to make sure that no harm had been done. Samples were aliquoted and extracted multiple times (23 times for one bird and 3 times for the other) to produce a total of 2 ug of purified DNA with the QIAamp® DNA Blood Midi kit (Qiagen, Venlo, The Netherlands) according to the manufacturer's instructions and using the following modifications. Blood samples were diluted in 1x PBS in a 1:10 ratio. The samples were incubated at 70ºC in the lysis buffer and protease for 2–12 hours. After cellular lysis, DNA was precipitated with 100% ethanol and the solution was passed twice through the Midi column. The DNA was eluded two times with 300 µl of pre-warmed elution buffer, incubated for 10 minutes at room temperature and then centrifuged for two minutes. After the DNA isolation, DNA purity was assessed with a BioTek® microplate reader (BioTek, Winooski, VT, USA (A260/A280) compared to a blank control after correcting for the baseline (A320). The remainder of the tissue material has been refrigerated at -80°C. The concentrations and quality of DNA were reassessed using a fluorometer and the Quant-iT™ PicoGreen® ds DNA binding assay (Invitrogen, Carlsbad, CA, USA from which four of the best DNA samples were chosen. These samples were shipped to Axeq Technologies (Rockville, MD, USA; and Seoul, South Korea) to be used for next generation sequencing, and the DNA concentrations were validated once more upon arrival (Table S1). The DNA concentrations did not deviate between different methods of measurement. One sample (Pa9a) has been finally selected and sequenced on Illumina HiSeq 2000 platform (Illumina, San Diego, CA, USA) with both fragment and paired-end sequencing approaches, resulting in a total of 42,479,499,706 bases (Table S2). The sequencing was initiated with the construction of two genome libraries: a short fragment library (~300 bp inserts) for sequencing the majority of the genome, and a long fragment library (~2.5 kb inserts) to generate scaffolds to be used to order and assemble contigs derived from the short fragment library. Raw Illumina HiSeq reads were processed and filtered using the Genome Analyzer Pipeline software provided by the manufacturer set to the default parameters. As much as 86.48% of the 309,060,168 paired-end reads, and 85.14% of the 180,079,956 mate-pair reads generated passed Illumina quality control (QC). If one read from a pair failed the QC, the whole pair was filtered out. Assembly We did two different de novo assemblies: we used Ray [7] software on 256 CPUs for 8 hours for the first assembly, and SOAPdenovo [8] for the second (Table S3). Ultimately, only the Ray assembly was used in this study, but the contigs and scaffold FASTA files for both assemblies and the associated parameter files have been deposited to our locally managed genome database [9] and in the GigaScience database [10]. For Assembly 1 with Ray: The Illumina paired-end and mate-pair reads were assembled together, with the k-mer defined iteratively. Here, we cautiously selected a k-mer value of 31 to ensure that the N50 parameter was not over-optimized (i.e., over-optimizing N50s may lead to chimeric contigs [11], as these values are indicative of the length of the contigs, but do not yield any information on their veracity). Furthermore, to assess the overall quality of the assembly, reads were subsequently re-mapped to the 10 largest scaffolds in order to detect regions harboring unusually high/low coverage, and potential assembly errors were manually reviewed. Of these, only one (scaffold 74754) contained a single chimera (Figure S2), while the other nine did not (i.e., 1 error per 1,930,389 bp). To test for the possibility of bacterial contamination, we queried all scaffolds against the entire GenBank nucleotide database [12], and filtered matches with more than 95% identity. Most of the scaffolds did not match any of the fragments in the database (87%). The rest of these seem to belong to sequences from either avian (6%) or mammalian (5%) DNA (Figure S3 and Table S6A). This indicates that contamination, if it exists, is most likely minimal. The database resulting from this query contains highly conserved elements and can be used for the subsequent annotation effort (Table S6B). In total, given that the genome size is predicted to be 1.58 Gb, with the total scaffold length of 1,184, 594,388 bp, we infer the overall coverage of the genome to be around 76%, a value that might be slightly overestimated given that some of the scaffolds may be overlapping but could not be assembled. Sequence data was manipulated with custom Python scripts, and with the MUSCLE algorithm [13,14] within Geneious [15] that has been used for the local alignment. Local BLAST and BLAT were used for the similarity search. All statistical analyses in this study were performed with SAS 9.2 (1996-2012) software (SAS Institute, Cary NC, USA). Annotation To evaluate the current assembly, we compared the entire collection of transcripts listed for G. gallus in the NCBI Entrez Gene database using local BLAST [16]. Among the 28,846 queries, 20,138 (70%) have been found on 245,947 scaffolds, resulting in 27,431 matches (Figure S1 and Table 3). As much as 11% percent of scaffolds shared similarity with at least one G. gallus sequence at average density of 1.39 genes per kbp. The smallest number of matches was found on unmapped scaffolds (2%), but they were found there to be at the highest density (3.72 3 genes per kbp). Mapped Entrez Gene sequences made up 4% of the scaffold length on average, but in the scaffolds that were not mapped to either of the two avian genomes, the proportion was much higher (22%) (Table 3). While the unmapped scaffolds were the shortest (Figure S4A), they contained the highest density and percentage of the gene sequences (Figure S4B and C; Table S4A). A database containing G. gallus gene sequences that were mapped to the parrot scaffolds and their locations is available from Table S7A and B. We used RepeatMasker software [17] to search scaffolds for the presence of the known repeat classes. Overall, 59% of the scaffolds contained at least one repeat, and the proportion was much higher in those that matched to another genome (Table 3). Even though a relative proportion of unmatched category of sequences that contained repeats was smaller than that in the other four categories, the percentage of its length classified as a repetitive sequence was much higher than in other classes (22%) (Figure S4D). The most common class of repeats found on scaffolds were the LINES/SINES, low complexity regions, and simple repeats, as they add up to 96% of the length of all repetitive sequences we found (Figure S5). A representative database of different repetitive elements and their locations on scaffolds is also hosted as supplementary information from our homepage and the GigaScience database [10]. There were no observable differences in the distribution of different cases of repetitive elements among different classes of scaffolds listed in Table 3 and Table S8). Genome annotation and education In addition to the preliminary annotation analysis, and with the goal of using the current genome sequence as an educational tool for training the next five generations of genomics and bioinformatics students at the University Puerto Rico at Mayaguez, we developed a strategy of manual annotation of the genome. Manual annotation is also used as a method to validate high-throughput annotation, and to canalize the desire of the local community to contribute to the project. Two annotation strategies are used: (1) annotation of scaffolds for gene and repeat elements, and (2) annotating known genes from other species. According to the first strategy, each student in the Genome Annotation class (20 Undergraduate students) is given five from the list of 100 longest scaffolds ranging from 120 to 206 kbp in length and learns to apply a variety of bioinformatics tools (Table S9). Students divide each scaffold into 25 kbp segments and use an online BLAT tool to search against the chicken genome in the UCSC Genome Browser. When one segment would not align to the same chromosome as the other segments in the scaffold, the result is confirmed using an online UCSC BLAT query [18] against the zebra finch genome, as well as by the NCBI nucleotide BLAST [16] query against the chicken genome. As an output for each scaffold, a student receives an additive score of its segments, the leftmost and rightmost coordinates of the matches, the partial or complete RefSeq genes, and the number of conserved elements in at least four of the following six vertebrate species: human, mouse, rat, opossum, Xenopus tropicalis, and the zebrafish. To investigate the level of evolutionary conservation of each RefSeq gene, students also score the taxonomic groups where orthologs in NCBI’s Homologene or Gene are found. In addition, they look for the gene ontology in UniProtKB, always scoring the major biological processes in which the gene is involved, when known. Finally, students use an online version of RepeatMasker to score the number, type and extension of repetitive elements in their scaffolds. They also store the RepeatMasker detailed outputs with the coordinates and identification of each repetitive element. An example of the annotation output produced by a student is presented in Table S10. The other strategy of annotation is to search for the genes known from other species in the assembled scaffolds. Students in the Genome Annotation class receive either one or two complete genes, depending on gene structure complexity. To annotate the coordinates where coding regions start and end, as well as those of the splice sites within coding regions, students first identify these sites in the chicken genome from Ensembl, and then localize them in the parrot genome using BLASTx against the chicken proteins and identifying the results of this query in the UCSC Genome Browser. All isoforms identified in UniprotKB for each gene are annotated. Comparative analysis Scaffolds were compared to the chicken (Gallus gallus) [19] and zebra finch (Taeniopygia guttata) [20] genomes using local BLAST [16]. First, the entire database of A. vittata scaffolds was queried against both reference avian genomes to assess coverage, mapping concordance and the long-range contiguity of the constructed scaffolds. Only the top alignment was taken into consideration for each scaffold. The alignment with two other avian genomes (G. gallus and T. guttata) resulted in 93.4 Mbp of total length of alignments to the chicken genome with 82.7% identity on average (average bit score 577.3), and 41.7 Mbp of total length of alignments to the zebra finch genome with 84.5% identity on average (average bit score 431.1). Despite the overall better alignment, gap lengths in chicken were much larger (43 vs. 16 per scaffold alignment in zebra finch genome). There was no relationship between the quality scores of alignment of the parrot scaffolds to chicken and zebra finch genomes (Figure S6A), but there was a positive correlation between the length of the scaffolds and the alignment score in zebra finch, which was not observed for the chicken scaffolds (Figure S6B). However, the longer scaffolds have better alignment scores in T. guttata, but not in G. gallus (Figure S7). On average, chicken genes constitute 4% of our scaffolds, and as much as 20% in the shorter unmatched fragments (Table 3). This may be the consequence of a simple tendency that gene rich-regions are the easiest to assemble: longer scaffolds contain less repetitive sequence (Figure S8.A), matched to more of these coding sequences (Figure S8.B). References 1. Brinkley D: The wilderness warrior: Theodore Roosevelt and the crusade for America, 1st edn. New York: HarperCollins; 2009. 2. Snyder NFR, Wiley JW, Kepler CB: The parrots of luquillo, natural history and conservation of the Puerto Rican parrot Los Angeles; 1987. 3. Waide RB: The Effect of Hurricane Hugo on Bird Populations in the Luquillo Experimental Forest, Puerto Rico. Biotropica 1991, 23(4):475-480 4. Brash AR: The history of avian extinction and forest conversion on Puerto Rico. Biological Conservation 1987, 39(2):97-111. 5. Tiersch TR, Wachtel SS: On the evolution of genome size of birds. The Journal of 11 heredity 1991, 82(5):363-368. 6. Brock MK, White BN: Application of DNA fingerprinting to the recovery program of the endangered Puerto Rican parrot. Proceedings of the National Academy of Sciences 1992, 89(23):11121-11125. 7. Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of computational biology: a journal of computational molecular cell biology 2010, 17(11):15191533. 8. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome research 2010, 20(2):265-272. 9. The Caribbean Genome Center Genomes Database [http://genomes.uprm.edu/parrots] 10. Oleksyk, TK; Guiblet, W; Pombert, JF; Valentin, R; Martinez-Cruzado, JC (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039 11. Ye L, Hillier L, Minx P, Thane N, Locke D, Martin J, Chen L, Mitreva M, Miller J, Haub K et al: A vertebrate case study of the quality of assemblies derived from next-generation sequences. Genome Biology 2011, 12(3):R31. 12. Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res 2012, 40(Database issue):D48-53 13. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5:113. 14. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792-1797. 15. Drummond A.J., Ashton B., Buxton S., Cheung M., Cooper A., Duran C., Field M., Heled J., Kearse M., Markowitz S. et al: Geneious v5.5. In. http://www.geneious.com; 2010. 16. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410. 17. Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. In.: http://www.repeatmasker.org; 1996-2010. 18. Bhagwat M, Young L, Robison RR: Using BLAT to find sequence similarity in closely related genomes. Curr Protoc Bioinformatics 2012, Chapter 10:Unit10 18. 19. International Chicken Genome Sequencing C: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 2004, 432(7018):695-716. 20. Warren WC, Clayton DF, Ellegren H, Arnold AP, Hillier LW, Kunstner A, Searle S, White S, Vilella AJ, Fairley S et al: The genome of a songbird. Nature 2010, 464(7289):757-762. Supplementary Figures: Figure S1. Venn diagram of the overlap between the number of A. vittata scaffolds and the G. gallus transcripts from GenBank that were mapped to them by BLAST Figure S2. A single example of chimera detected on scaffold-74754 after visual inspection of reads mapped to 100 largest scaffolds Figure S3. Percentage of scaffolds containing fragments with > 95% similarity to GenBank sequences Figure S4. Comparison between categories of A. guttata scaffolds (described earlier in Figure 2): The box plots show the medians, Q1, Q3 and the extreme values. The means are shown in Table 3. A. Distribution of scaffold lengths; B. Distribution of densities of genes mapped per kbp of scaffold length. C. Differences in the distribution of proportion of the length of the scaffold mapped to a G. gallus transcript from NCBI Entrez Gene database. D. Differences in the distribution of proportion of the length of the scaffold mapped to a known repeat class using RepeatMasker software [5]. Figure S5. Distribution of major classes of repetitive sequences found on A. vittata scaffolds Figure S6. Relationship between the quality scores of the alignments between the parrot scaffolds to the chicken and zebra finch genomes: A. All scaffolds. B. Mismatched scaffolds only (those scaffolds that shared similarity with sequences of G. gallus and T. guttata genomes but mapped to different chromosomes in the two species; see classification in Figure 2). C. Matched sequences only (those that mapped to the same chromosome in reference genomes of the two avian species). Figure S7. Relationship between the size of a scaffold and the quality of its alignment to T. guttata and/or G. gallus genome sequence: A. All scaffolds aligned to the T. guttata genome. B. All scaffolds aligned to the G. gallus genome. C. Scaffolds from T. guttata that Mismatched scaffolds mapped to different chromosomes in G. gallus; see classification in Figure 2). D. Scaffolds from G. gallus that Mismatched scaffolds mapped to different chromosomes in T. guttata). E. Matched sequences from T. guttata only (those that mapped to the same chromosome in reference genomes of the two avian species), F. Matched sequences from G. gallus only (those that mapped to the same chromosome in reference genomes of the two avian species). Figure S8. Small fragments are repeat- rich and gene-rich: A. Relationship between the length of the scaffolds and the proportion of it length matched to the G. gallus sequences from NCBI Entrez Gene database. B. Relationship between the length of the scaffolds and the proportion of it length designated by RepeatMasker as repetitive sequence. Supplementary Tables Table S1 Quality and volume of four DNA samples extracted from whole blood of two Amazona vittata parrots selected for the genome sequencing Sample # 1 2 3 4 Sample name Pa1a Pa9a Pa15a Pa16a Concentration (ng/ul) picogreen method 180.68 171.15 242.57 258.89 Purity (A260/A28 0) 1.79 1.79 1.87 1.87 Volume (ul) 550 550 550 260 Total DNA amount (ug) 99.37 94.13 133.41 67.31 Table S2. Results of the genome sequencing (Illumina HiSeq, Axeq Technologies). Pa9a_1 and Pa9a_2 represent the opposite ends of the 300 bp short reads, and the Pa9aMP_1 and Pa9a-MP_2 are the 2,500 bp mate pairs (MP). All sequences were 101 bp long. Bases Sequence name A C G T N Pa9a_1 3,848,744,755 2,897,323,917 2,900,927,647 3,825,967,145 23,781,474 Pa9a_2 3,868,006,366 2,890,106,491 2,911,110,520 3,826,704,806 816,755 Pa9a-MP_1 2,174,635,550 1,720,812,735 1,688,621,488 2,158,688,549 246,593 Pa9a-MP_2 2,171,866,496 1,674,935,595 1,748,437,549 2,147,171,234 594,041 Table S3. Results of the genome assembly by SOAPdenovo [8] Statistics Category paired-end (27.0Gb) PE+ MP 5Gb PE+ MP All (27.0Gb + 5.2Gb) (27.0Gb + 15.5Gb) # Contigs Total Length Largest Contig Mean Length 12,764,879 1,560,663,735 15,182 122.3 12,887,828 1,772,155,983 18,359 137.5 12,887,828 1,772,155,983 18,359 137.5 Of All, Contigs ≥ 100bp N50 # Contigs Total Length Mean Length N50 673 1,884,625 1,099,661,342 583.5 1,188 636 4,450,396 1,396,126,299 313.7 1,123 636 4,450,396 1,396,126,299 313.7 1,123 Scaffolds Largest Scaffold # Scaffolds Total Length Mean Length N50 N/A N/A N/A N/A N/A 2,014,591 3,410,722 1,530,900,674 448.8 74,348 3,309,686 3,384,799 1,590,552,602 470 126,952 All Contigs These results are not used in the study, but have been deposited at http://genomes.uprm.edu/parrot/Assembly-2011/ Table S4A. Summary of the alignment of A. vittata sequences to the G. gallus genome sequence containing only the top alignment for each scaffold, its chromosomal position and quality scores. Table S4B. Summary of the alignment of A. vittata sequences to the T. guttata genome sequence containing only the top alignment for each scaffold, its chromosomal position and quality scores Table S4C. The database of the alignment information of A. vittata sequences to G. gallus and T. guttata genome sequence by BLAST Table S5. Proportions of sequences with some similarity that mapped to chromosomes of two reference avian genomes (G. gallus and T. guttata) Table S6A The summary of the database of GenBank sequences with more than 95% similarity with the parrot scaffolds Table S6B. The database of GenBank sequences with more than 95% similarity with the parrot scaffolds found by BLAST Table S7A. A map of G. gallus transcripts from NCBI Entrez Gene database that mapped to one of the A. guttata scaffolds Table S7B. The database of alignments between of G. gallus transcripts from NCBI Entrez Gene database and A. guttata scaffolds by BLAST Table S8. Distribution of different cases of repetitive elements among different classes of A. guttata scaffolds Table S9. Bioinformatics tools and outputs for scaffold and gene annotation Elements Annotated Tools Outputs per Scaffold or Gene Scaffolds BLAT n-BLAST Homologene Gene, UniProtKB RepeatMasker - Chromosome number and end coordinates of BLAT matches - Match scores - # of elements conserved - # and identity of RefSeq genes - Gene orthology and ontology Genes UniProtKB - Coordinates for start and end of each Ensembl coding region plus splice sites within x-BLAST UCSC Genome Browser Table S10. An example of annotation output produced by a student in the Genome annotation class using A. vittata genome