The Genome Sequence of the Puerto Rican Parrot

advertisement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Supplementary Materials
A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome
Sequencing Project Increases Avian Data and Advances Young
Researcher Education
Oleksyk, T.K.1,*, Pombert, J.F.2, Siu, D.3, Mazo-Vargas, A.1, Ramos, B.1, Guiblet, W.1,
Afanador, Y.1, Ruiz-Rodriguez, C.T.1,4, Nickerson, M.L. 4, Logue, D.1, Dean, M.4,
Figueroa, L. 5, Valentin, R.6, and Martinez-Cruzado, J.C.1
Address:
1
University of Puerto Rico at Mayagüez, PUERTO RICO;
2
University of British Columbia, BC, CANADA;
3
Axeq Technologies, SOUTH KOREA;
4
Cancer and Inflammation Program, National Cancer Institute, NIH, Frederick, MD, U.S.A.;
5
Compañía de Parques Nacionales de Puerto Rico, PUERTO RICO;
6
Department of Natural and Environmental Resources, PUERTO RICO;
Email: Oleksyk, T.K.* - taras.oleksyk@upr.edu, Pombert, J.F. - jpombert@mail.ubc.ca, Siu, D. danielsiu@axeq.com, Mazo-Vargas, A. - bioanyimima@gmail.com, Ramos, B. brabrg@hotmail.com, Guiblet, W. - Wilfried.Guiblet@upr.edu, Afanador, Y. –
yashira.afanador@upr.edu, Ruiz-Rodriguez, C.T. - christina.t.ruiz@gmail.com, Nickerson, M.L.
- nickersonml@od.nih.gov, Logue, D. - david.logue@upr.edu, Dean, M. - deanm@mail.nih.gov,
Figueroa, L. - wildvet@coqui.net, Valentin, R. - el.cotorro.electrico@gmail.com, and MartinezCruzado, J.C. - juancarlos.martinez@upr.edu
*To
whom correspondence should be addressed
1
Background
2
The Puerto Rican parrot (Amazona vittata) is the only surviving native parrot species in the
3
United States[1]. This endangered bird was once abundant throughout the island of Puerto Rico,
4
but the population declined drastically during the 19th century with the decimation of the old
5
growth forest habitat[2]. One of the first people to call attention to the need for its conservation
6
was President Theodore Roosevelt, who formed the Luquillo Forest Reserve for the purpose of
7
protecting the parrots’ nesting grounds[1]. El Yunque National Forest and a few other remote
8
locations were spared from deforestation, allowing a small population of parrots to survive [2-4].
9
At the end of the 20th century, the species came very close to extinction: only 13 birds remained
10
alive in 1975. The Puerto Rican parrot has been included on the Endangered Species List since
11
1967, and listed as critically endangered by the World Conservation Union since 1994. A
12
recovery effort has been initiated by two captive breeding programs, one at the Luquillo Aviary
13
close to El Yunque and another at the Rio Abajo State Forest. This effort has proven to be a
14
success; the captive population is steadily increasing, and several captive-bred birds have been
15
released into the wild. Genome wide consequences of the population decline and their effects on
16
the species recovery have never been assessed.
17
Similarly to most birds, the A. vittata genome is relatively small (1.58Gb) – less than half
18
the size of the human genome[5]. It is also highly invariable due to the recent population
19
bottleneck[6]. These two aspects made the Puerto Rican parrot an ideal candidate for genome
20
sequencing and assembly: the small genome size allows high sequencing coverage, while the low
21
amount of genetic variation permits accurate alignments between sequenced fragments resulting
22
in fewer assembly errors. In addition, a study of the A. vittata genome provides an
23
unprecedented opportunity to describe the remaining genomic variation in the species. It will
2
1
facilitate a search for the genomic variation relevant to the parrot’s survival, contribute to our
2
understanding of the causes of its decline, and help guide conservation efforts through deviing
3
effective breeding strategies. Thus, the genome-wide information will eventually contribute to
4
the conservation effort and to the species’ long-term recovery. In this paper, we present the
5
initial de novo sequence assembly of the Amazona vittata genome.
6
This initiative is unique among other genome studies, because it was funded entirely from
7
individual donations. The local community generously supported our project, and allowed rapid
8
progress of genome sequencing. Funding has been generated at fundraisers organized by the
9
faculty and students of the Biology Department of the University of Puerto Rico at Mayagüez,
10
and by individual donations from Puerto Rican citizens. To our knowledge, this is the first
11
example of a community-funded genome sequencing effort directed at an endangered species.
12
We are hopeful that this project will serve as an example for the future efforts that will use
13
genomics to raise awareness, and contribute to the conservation of endangered species
14
elsewhere.
15
Methods
16
Sample collection and genome sequencing: All sampling procedures were carried out as
17
approved by the University of Puerto Rico at Mayaguez Institutional Animal Care and Use
18
Committee (IACUC#201109.1), and in accordance with the guidance for the Endangered
19
Species Act. A veterinarian designated by the Puerto Rico Department of Natural and
20
Environmental Resources (DNER) collected blood from two non-reproductive captive female
21
birds at the Rio Abajo aviary during a routine health check procedure. Syringes were
22
heparinized before samples were taken. Approximately 1.5cc of blood was extracted from each
23
bird using a 3 cc syringe with a 27-gauge needle. Birds were placed into their cages after the
3
1
procedure and remained under observation for the rest of the day to make sure that no harm had
2
been done. Samples were aliquoted and extracted multiple times (23 times for one bird and 15
3
times for the other) to produce a total of 2 ug of purified DNA with the standard protocol of the
4
QIAamp® DNA Blood Midi kit using the following modifications. Blood samples were diluted
5
in 1x PBS in a 1:10 ratio. The samples were incubated at 70ºC in the lysis buffer and protease for
6
2 – 12 hours. After cellular lysis, DNA was precipitated with 100% ethanol and the solution was
7
passed twice through the Midi column. The DNA was eluded two times with 300 ul of pre-
8
warmed elution buffer, incubated for 10 minutes at room temperature and then centrifuged for
9
two minutes. After the DNA isolation, DNA purity was assessed with a BioTek microplate
10
reader (A260/A280) compared to a blank control after correcting for the baseline (A320). The
11
remainder of the tissue material has been refrigerated at -80°C.
12
The concentrations and quality of DNA were re-assessed using a fluorometer and the
13
picogreen ds DNA binding assay (Invitrogen), and four of the best DNA samples were chosen.
14
These samples were shipped to Axeq Technologies (Rockville, MD; and Seoul, South Korea) to
15
be used for the next generation sequencing, and the DNA concentrations were validated once
16
more upon arrival (Table S1). The DNA concentrations did not deviate between different
17
methods of measurement. One sample (Pa9a) has been finally selected and sequenced on
18
Illumina HiSeq platform with both fragment and paired-end sequencing approaches, resulting in
19
a total of 42,479,499,706 bases (Table S2).
20
The sequencing was initiated with the construction of two genome libraries: a short
21
fragment library (~300 bp inserts) for sequencing the majority of the genome, and a long
22
fragment library (~2.5 kb inserts) to generate scaffolds to be used to order and assemble contigs
23
derived from the short fragment library. Raw Illumina HiSeq reads were processed and filtered
4
1
using the Genome Analyzer Pipeline software provided by the manufacturer set to the default
2
parameters. As much as 86.48% of the 309,060,168 paired-end reads, and 85.14% of the
3
180,079,956 mate-pair reads generated passed Illumina quality control (QC). If one read from a
4
pair failed the QC, the whole pair was filtered out. Based on the total number of base pairs
5
generated (Table 1, S2), and the predicted genome size of 1.58Gb[5], we predicted a total
6
coverage depth of 26.89X of the parrot’s genome: 17.08X coverage for the short fragment reads,
7
and 9.8X coverage for the mate pairs (Table 2).
8
Assemblies: Two different de novo assemblies were attempted. In the first assembly, we used
9
Ray[7] software on 256 CPUs for 8 hours (Table 2). The second assembly was accomplished
10
with SOAPdenovo[8] (Table S3) but was not used in the current evaluation and annotation.
11
Only the Ray[7] assembly has been used for the assembly discussed in this study. However,
12
contigs and scaffold FASTA files for both assemblies and the associated parameter files have
13
been deposited to our locally managed genome database (http://genomes.uprm.edu/parrots) and
14
in GigaDB. Sequence data was manipulated with custom Python scripts, and with MUSCLE
15
algorithm[9, 10] within Geneious [11] that has been used for the local alignment. Local BLAST
16
and BLAT were used for the similarity search. All statistical analyses in this study has been
17
performed with SAS 9.2 software (SAS Institute, Cary NC 1996-2012).
18
Comparative Analysis: Scaffolds were compared to the chicken (Gallus gallus) [12] and zebra
19
finch (Taeniopygia guttata) [13] genomes using local BLAST [14]. First, the entire database of
20
A. vittata scaffolds was queried against both reference avian genomes to assess coverage,
21
mapping concordance and the long-range contiguity of the constructed scaffolds. Only the top
22
alignment was taken into consideration for each scaffold.
23
Results
5
1
Assembly: The Illumina paired-end and mate-pairs reads were assembled together with Ray[7],
2
with the k-mer defined iteratively. Here, we cautiously selected a k-mer value of 31 to ensure that
3
N50 parameter was not over-optimized (i.e. over-optimizing N50s may lead to chimeric
4
contigs[15], as these values are indicative of the length of the contigs, but do not yield any
5
information on their veracity). Furthermore, to assess the overall quality of the assembly, reads
6
were subsequently re-mapped to the 10 largest scaffolds in order to detect regions harboring
7
unusually high/low coverage, and potential assembly errors were manually reviewed. Of these,
8
only one (scaffold-74754) contained a single chimera (Figure S2), while the other 9 did not (i.e.
9
1 error per 1,930,389 bp).
10
To test for the possibility of bacterial contamination, we queried all scaffolds against the
11
entire GenBank nucleotide database [16], and filtered matches with more than 95% identity.
12
Most of the scaffolds did not match any of the fragments in the database (87%). The rest of
13
these seem to belong to sequences from either avian (6%) or mammalian (5%) DNA (Figure S3;
14
Table S6.A). This indicates that contamination, if it exists, is most likely minimal. The
15
database resulting from this query contains highly conserved elements and can be used for the
16
subsequent annotation effort (Table S6.B). In total, given that the genome size is predicted to be
17
1.58Gb, with the total scaffold length of 1,184, 594,388 bp, we infer the overall coverage of the
18
genome to be around 76%, a value that might be slightly overestimated given that some of the
19
scaffolds may be overlapping but could not be assembled.
20
Annotation: To evaluate the current assembly, we compared the entire collection of transcripts
21
listed for G. gallus in the NCBI Entrez Gene database using local BLAST[14]. Among the
22
28,846 queries, 20,138 (70%) have been found on 245,947 scaffolds, resulting in 27,431 matches
23
(Figure S1; Table 3). As much as 11% percent of scaffolds shared similarity with at least one
6
1
G. gallus sequence at average density of 1.39 genes/kbp. The smallest number of matches was
2
found on unmapped scaffolds (2%), but they were found there to be at the highest density (3.72
3
genes/kbp). Mapped Entrez Gene sequences made up 4% of the scaffold length on average, but
4
in the scaffolds that were not mapped to either of the two avian genomes, the proportion was
5
much higher (22%)(Table 3). While the unmapped scaffolds were the shortest (Figure S4A),
6
they contained the highest density and percentage of the gene sequences (Figure S4B,C; Table
7
S4.A). A database containing G. gallus gene sequences that were mapped to the parrot scaffolds
8
and their locations is available from Table S7.B.
9
We used RepeatMasker software [17] to search scaffolds for the presence of the known
10
repeat classes. Overall, 59% of the scaffolds contained at least one repeat, and the proportion
11
was much higher in those that matched to another genome (Table 3). Even though a relative
12
proportion of unmatched category of sequences that contained repeats was smaller than that in
13
the other four categories, the percentage of its length classified as a repetitive sequence was
14
much higher than in other classes (22%)(Figure S4D). The most common class of repeats found
15
on scaffolds were the LINES/SINES, low complexity regions, and simple repeats, an they add up
16
to 96% of the length of all repetitive sequences we found (Figure S5). A representative database
17
of different repetitive elements and their locations on scaffolds is presented in Table S8A,B.
18
There were no observable differences in the distribution of different cases of repetitive elements
19
among different classes of scaffolds listed in Table 3 (Table S9).
20
Comparative analysis: The alignment with two other avian genomes (G. gallus and T. guttata)
21
resulted in 93.4 Mbp of total length of alignments to the chicken genome with 82.7% identity on
22
average (average bit score 577.3), and 41.7 Mbp of total length of alignments to the zebra finch
23
genome with 84.5% identity on average (average bit score 431.1). Despite the overall better
7
1
alignment, gap lengths in chicken were much larger (43 vs. 16 per scaffold alignment in zebra
2
finch genome). There was no relationship between the quality scores of alignment of the parrot
3
scaffolds to chicken and zebra finch genomes (Figure S6.A), but a positive correlation between
4
the length of the scaffolds and the alignment score in zebra finch, which was not observed for the
5
chicken scaffolds (Figure S6.B). However, the longer scaffolds have better alignment scores in
6
T. guttata, but not in G. gallus (Figure S7).
7
The top BLAST alignments were sorted by the average point of their locations, and their
8
frequencies were calculated in 1Mbp bins and plotted along chromosomes for both G. gallus and
9
T. guttata genomes using Circos[18] (Figure 1). The chicken genome coverage was higher (109
10
scaffolds per Mbp in chicken on average vs. 72 in zebra finch), and the chicken genome also had
11
several locations where genome coverage was a lot higher than expected on average (Figure 1).
12
In both genomes, chromosome 11 had elevated coverage by scaffolds: 210 and 220 in chicken
13
and zebra finch on average respectfully (Figure 1). As much as 57% of the scaffolds partially
14
aligned to one or both of the genomes: 21.7% aligned only to G. gallus, while 10.6% aligned
15
exclusively to T. guttata, while 25% aligned to both genomes (Figure 2). The data for this
16
analysis are presented and summarized in Table S4.A for the chicken genome alignment and
17
Table S4.B for the zebra finch genome alignment. All of the information from this analysis is
18
available online (Table S4.C).
19
Although a large proportion of scaffolds shared some similarity with the two avian
20
genomes, there was some discordance in the alignment, as only 12.6% of the scaffolds (2.8% of
21
the total number of scaffolds) aligned to the same chromosome in both species (Figure 2, top;
22
Table S5), and the proportion of discordance varied across chromosomes, with the lowest value
23
on chromosome 11 (Figure 2, bottom; Table S5). While this lack of synteny could point to
8
1
extensive rearrangements during the evolutionary history, the proportions of scaffolds
2
discordantly aligned between chromosomes seems to be distributed similarly to the chromosome
3
lengths, indicating a significant random component (Figure 3). To test for contingency, we
4
selected the 200 longest scaffolds and independently queried 500 bp ends to the chicken genome.
5
Of these, only 10 scaffolds (5%) showed discordance by aligning to the opposite ends to two or
6
more different chicken chromosomes.
7
Discussion
8
Supported by funds raised by local community efforts, we made the initial sequencing and
9
assembly of the genome of the Puerto Rican parrot, the rarest parrot species in the wild. This
10
study represents the first assembly of a genome sequence for a parrot endemic to the United
11
States, and also the first genome of a species from a diverse and ecologically important genus
12
Amazona native to South America and the Caribbean.
13
The assembled sequence will be used as a starting point towards completing and
14
annotating a draft genome sequence. With roughly 76% of the genome covered, several patterns
15
are already visible that can be helpful in designing the future sequencing effort. Our sequence
16
can already be used for annotation and comparative genomics, and it is encouraging that over
17
70% of the sequences reported for G. gallus in the NCBI Entrez Gene database matched with our
18
scaffolds. On average, genes constituted 4% of our scaffolds, but in the shorter unmatched
19
fragments (Table 3) the proportion was much higher 20% (Table 3). There may also be a
20
tendency for longer scaffolds to contain less repetitive sequence (Figure S8.A), and a higher
21
number of coding sequences (Figure S8.B), which would make them easier to assemble.
9
1
At this point, our scaffolds are still too short. Unfortunately, managing the remaining
2
gaps by Sanger or Roche 454 would be cost-prohibitive for our mode of funding. As an
3
alternative, another run of deeper resequencing and longer (8kbp) mate-pair Illumina sequencing
4
could constitute a sensible approach for the short-term prospective. As the sequence coverage
5
increases, we would be able to increase the length of the scaffolds, to better identify regions that
6
cannot be assembled using short sequencing reads, and then to select the appropriate approaches
7
to resolve these gaps. The final assembly will likely be hindered by the presence of repetitive
8
sequences: some have been previously reported in parrots (Psittaciformes) [19], and the repeated
9
sequences in the A. vittata genome are likely to be underrepresented in the current assemblies.
10
We are actively following the development of other cost-effective sequencing technologies (e.g.
11
nanopore [20]) with great interest, and plan to implement it when it becomes available.
12
We have chosen a de novo assembly strategy primarily because of the absence of a
13
closely related draft genome assembly with high quality comparable to the genomes of model
14
organisms, such as human, mouse or drosophila fly. The closest related species with an
15
assembled genome is a budgerigar parakeet (Melopsittacus undulatus) used as a model genome
16
in Assemblathon [21] (http://assemblathon.org), and whose divergence time with the Amazon
17
parrots has been recently estimated at 39-64 MYA depending on the calibration of the molecular
18
clock [22]. For comparison, primates as a group developed in Paleocene (65 to 56 MYA) and
19
diverged in Eocene (56 to 34 MYA) into the major groups such as Lemuroidea, Lorisoidea,
20
Platarrini, and Cathirini [23]. A reference assembly of the A. vittata genome may be hindered
21
by the presence of repetitive sequences. At least some of the repeats have already been reported
22
in parrots (Psittaciformes): 6.8% of a parrot genome consisted of tandemly repeated, 190-bp
23
sequence (P1) located in the centromeres of many if not all chromosomes[19]. Our results show
10
1
a similar proportion of 4.3%, but this number is probably too small, as the repeted sequences are
2
likely to be underrepresented in the assembly. The divergence with the zebra finch and chicken
3
genomes is greater still, making reference assemblies difficult. The de novo assembly therefore
4
allows us to perform an unbiased and conservative reconstruction of contigs and scaffolds that
5
could be subsequently analyzed for homology and synteny against other such assemblies.
6
However, if a genome of a closely related bird is sequenced and assembled in the near
7
future, it may help organize contigs generated in this study. Complementarily, expressed
8
sequence tag (EST) or RNA-Seq studies from various parrot species will permit the development
9
of robust gene prediction models for the genome annotation of these birds. Several parrot
10
species assemblies are currently being developed by other groups: the parakeet [21] and the
11
macaw genomes. Detailed genome comparisons between them will help interpret avian
12
evolution and point to unique adaptive traits of the Puerto Rican parrot. However, at this point,
13
the de novo assembly of the PR parrot genome is still necessary to avoid alignment bias towards
14
a reference genome from another avian species.
15
Potential implications
16
Development of genome project for an endangered species: The initial sequence of A. vittata
17
in only the first step in a comprehensive effort that aims to incorporate science, public education,
18
and community activities in order to raise awareness about conservation efforts essential to save
19
the last extant parrot species in the United States. It provides a basis for the future studies as part
20
of the Puerto Rico Genome Project, which can be divided into three stages: (1) Improvement of
21
the draft assembly and annotation of genes and regulatory regions, (2) Examination of genetic
22
diversity through study of captive and free living A. vittata populations, (3) Comparative
11
1
genomics studies of the PR parrot and other species from the genus Amazona. Different efforts
2
that are proposed for these three stages and are described and discussed below.
3
The genome sequence of the Puerto Rican parrot has important implications beyond the
4
species’ conservation effort. Amazon parrots (Genus: Amazona) are an important and diverse
5
group throughout the Neotropics. A genome sequence of one Amazona is reasonably justified if
6
only to serve as a baseline reference sequence for additional assemblies of genome sequences
7
from other related species.
8
occupying different ecological niches will likely uncover adaptations at the molecular level, and
9
may serve as a useful model for discovery of genes and/or genetic alterations necessary for
10
adaptation of parrots and other birds. Lastly, many diverse assembly and annotation projects can
11
be generated from genomic datasets to provide unique training opportunities for graduate and
12
undergraduate students in genomics, bioinformatics, ecology, and species conservation.
Comparative genomics studies among closely related birds
13
Our next objective is to improve the current coverage and fill the assembly gaps in the
14
current data. A complementary sequencing of genome libraries will be undertaken to obtain
15
missing sequences or improve the quality of sequences when necessary. In addition to the
16
genetic models identified by computer algorithms, genes and other genetic elements of the
17
parrot’s genome will be identified by comparing its sequence to existing genome sequences of
18
other birds such as the zebra finch, parakeet, turkey and the chicken, as well as reptiles (the anole
19
lizard (Anolis carolinensis)).
20
expressed in various tissues of related, but not critically endangered species of parrots, such as
21
Hispaniolan parrot (Amazona ventralis), are proposed to generate models for all genes, since this
22
species is known to be genetically very related[6, 24]. Collaboration is being developed to study
23
the species transcriptome to help identify functional genes. Eventually, the combination of
Finally, cDNA library construction and sequencing of genes
12
1
multiple efforts will allow us to publish a Genome Draft: a nearly complete sequence of the
2
genome (>90%) with genes marked in the sequence.
3
Genome Annotation and Education: In addition to the preliminary annotation analysis, and
4
with the goal of using the current genome sequence as an educational tool for training the next
5
generation of genomics and bioinformatics students at the University of Puerto Rico at
6
Mayaguez, we developed a strategy of manual annotation of the genome. Manual annotation is
7
also used as a way to validate high throughput annotation, and to canalize the desire of the local
8
community to contribute to the project. Two annotation strategies are used: (1) annotation of
9
scaffolds for gene and repeat elements, and (2) annotating known genes from other species.
10
According to the first strategy, each student in the Genome Annotation class (20
11
undergraduate students) is given five from the list of one hundred longest scaffolds ranging from
12
120 to 206 kbp in length and learns to apply a variety of bioinformatics tools (Table S10).
13
Students divide each scaffold into 25 kbp segments and to use an online BLAT tool to search
14
against the chicken genome in the UCSC Genome Browser. When one segment would not align
15
to the same chromosome as the other other segments in the scaffold, the result is confirmed using
16
an online UCSC BLAT query [25] against the zebra finch genome, as well as by the NCBI
17
nucleotide BLAST [14] query against the chicken genome. As an output for each scaffold, a
18
student receives an additive score of its segments, the leftmost and rightmost coordinates of the
19
matches, the partial or complete RefSeq genes, and the number of conserved elements in at least
20
four of the following six vertebrate species: human, mouse, rat, opossum, Xenopus tropicalis,
21
and the zebrafish. To investigate the level of evolutionary conservation of each RefSeq gene,
22
students also score the taxonomic groups where orthologs in NCBI’s Homologene or Gene are
23
found. In addition, they look for the gene ontology in UniProtKB, always scoring the major
13
1
biological processes in which the gene is involved, when known. Finally, students use an online
2
version of RepeatMasker to score the number, type, and extension of repetitive elements in their
3
scaffolds. They also store the RepeatMasker detailed outputs with the coordinates and
4
identification of each repetitive element. An example of the annotation output produced by a
5
student is presented in the Table S11.
6
The other strategy of annotation is to search for the genes known from other species in
7
the assembled scaffolds. Students in the Genome Annotation class receive either one or two
8
complete genes, depending on gene structure complexity. To annotate the coordinates where
9
coding regions start and end, as well as those of the splice sites within coding regions, students
10
first identify these sites in the chicken genome from the Ensembl, and then localize them in the
11
parrot genome Uisng BLASTx against the chicken proteins and identifying the results of this
12
query in the UCSC Genome Browser. All isoforms identified in UniprotKB for each gene are
13
annotated.
14
Developing of the local community involvement: The Puerto Rico National Parks Company
15
recently started exhibiting Puerto Rican parrots and related species at the Puerto Rico Zoo. This
16
exhibition will bring attention to the conservation effort for this endangered species. In
17
conjunction with the comparative genomics study of the group to be conducted at the University
18
of Puerto Rico, the parrot exhibition will become a valuable research and educational resource
19
for the local and international scientific community. The sequence of the Puerto Rican parrot
20
will be used as a reference to study differences between species and find genomic regions
21
responsible for adaptation and survival.
22
14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Brinkley D: The wilderness warrior : Theodore Roosevelt and the crusade for
America, 1st edn. New York: HarperCollins; 2009.
Snyder NFR, Wiley JW, Kepler CB: The parrots of luquillo, natural history and
conservation of the Puerto Rican parrot Los Angeles; 1987.
Waide RB: The Effect of Hurricane Hugo on Bird Populations in the Luquillo
Experimental Forest, Puerto Rico. Biotropica 1991, 23(4):475-480
Brash AR: The history of avian extinction and forest conversion on Puerto Rico.
Biological Conservation 1987, 39(2):97-111.
Tiersch TR, Wachtel SS: On the evolution of genome size of birds. The Journal of
heredity 1991, 82(5):363-368.
Brock MK, White BN: Application of DNA fingerprinting to the recovery program of
the endangered Puerto Rican parrot. Proceedings of the National Academy of Sciences
1992, 89(23):11121-11125.
Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly of reads from a mix of
high-throughput sequencing technologies. Journal of computational biology : a
journal of computational molecular cell biology 2010, 17(11):1519-1533.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al: De
novo assembly of human genomes with massively parallel short read sequencing.
Genome research 2010, 20(2):265-272.
Centers for Disease Control and Prevention C: Growth Charts. In.; 2010.
Slaughter MH, Lohman TG, Boileau RA, Horswill CA, Stillman RJ, Van Loan MD,
Bemben DA: Skinfold equations for estimation of body fatness in children and
youth. Hum Biol 1988, 60(5):709-723.
Deurenberg P, Pieters JJ, Hautvast JG: The assessment of the body fat percentage by
skinfold thickness measurements in childhood and young adolescence. Br J Nutr
1990, 63(2):293-303.
Steinberger J, Jacobs DR, Raatz S, Moran A, Hong CP, Sinaiko AR: Comparison of
body fatness measurements by BMI and skinfolds vs dual energy X-ray
absorptiometry and their relation to cardiovascular risk factors in adolescents. Int J
Obes (Lond) 2005, 29(11):1346-1352.
Pereira MA, FitzerGerald SJ, Gregg EW, Joswiak ML, Ryan WJ, Suminski RR, Utter
AC, Zmuda JM: A collection of Physical Activity Questionnaires for health-related
research. Med Sci Sports Exerc 1997, 29(6 Suppl):S1-205.
Martinez-Cruzado JC, Toro-Labrador G, Ho-Fung V, Estevez-Montero MA, LobainaManzanet A, Padovani-Claudio DA, Sanchez-Cruz H, Ortiz-Bermudez P, SanchezCrespo A: Mitochondrial DNA analysis reveals substantial Native American
ancestry in Puerto Rico. Hum Biol 2001, 73(4):491-511.
Marshall AL, Smith BJ, Bauman AE, Kaur S: Reliability and validity of a brief
physical activity assessment for use by family doctors. Br J Sports Med 2005,
39(5):294-297; discussion 294-297.
Martínez-Cruzado JC, Toro-Labrador G, Viera-Vera J, Rivera-Vega MY, Startek J,
Latorre-Esteves M, Román-Colón A, Rivera-Torres R, Navarro-Millán IY, GómezSánchez E et al: Reconstructing the population history of Puerto Rico by means of
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
32
31
17.
18.
19.
20.
21.
22.
23.
24.
25.
mtDNA phylogeographic analysis. American Journal of Physical Anthropology 2005,
128(1):131-155.
Kish L: A Procedure for Objective Respondent Selection Within the Household In:
Leslie Kish Selected Papers. Edited by Kalton G, Heeringa S; 2003.
Brown WJ, Trost SG, Bauman A, Mummery K, Owen N: Test-retest reliability of four
physical activity measures used in population surveys. J Sci Med Sport 2004,
7(2):205-215.
Madsen CS, de Kloet DH, Brooks JE, de Kloet SR: Highly repeated DNA sequences in
birds: The structure and evolution of an abundant, tandemly repeated 190-bp DNA
fragment in parrots. Genomics 1992, 14(2):462-469.
Sellmann T, Saeed D, Danzeisen O, Albert A, Blehm A, Kram R, Kindgen-Milles D,
Hoehn T, Winterhalter M: Extracorporeal Membrane Oxygenation Implantation via
Median Sternotomy for Fulminant Pulmonary Edema After Cold Water
Submersion with Cardiac Arrest. J Cardiothorac Vasc Anesth 2011.
Hao K, Li C, Rosenow C, Wong WH: Detect and adjust for population stratification
in population-based association study using genomic control markers: an
application of Affymetrix Genechip Human Mapping 10K array. Eur J Hum Genet
2004, 12(12):1001-1006.
Booth ML, Okely AD, Chey T, Bauman A: The reliability and validity of the physical
activity questions in the WHO health behaviour in schoolchildren (HBSC) survey: a
population study. Br J Sports Med 2001, 35(4):263-267.
Brener ND, Billy JO, Grady WR: Assessment of factors affecting the validity of selfreported health-risk behavior among adolescents: evidence from the scientific
literature. J Adolesc Health 2003, 33(6):436-457.
Brock MK, White BN: Multifragment Alleles in DNA Fingerprints of the Parrot,
Amazona ventralis. Journal of Heredity 1991, 82(3):209-212.
Cahais V, Gayral P, Tsagkogeorga G, Melo-Ferreira J, Ballenghien M, Weinert L, Chiari
Y, Belkhir K, Ranwez V, Galtier N: Reference-free transcriptome assembly in nonmodel animals from next-generation sequencing data. Mol Ecol Resour 2012.
16
1
Supplementary Tables
2
3
Table S1. Quality and volume of four DNA samples extracted from whole blood of two
Amazona vittata parrots selected for the genome sequencing
Sample
#
1
2
3
4
Sample
name
Pa1a
Pa9a
Pa15a
Pa16a
Concentration
(ng/ul) picogreen
method
180.68
171.15
242.57
258.89
Purity
(A260/A28
0)
1.79
1.79
1.87
1.87
4
5
17
Volume (ul)
550
550
550
260
Total DNA
amount (ug)
99.37
94.13
133.41
67.31
1
2
3
Table S2. Results of the genome sequencing (Illumina HiSeq, Axeq Technologies). Pa9a_1 and
Pa9a_2 represent the opposite ends of the 300 bp short reads, and the Pa9a-MP_1 and Pa9aMP_2 are the 2,500 bp mate pairs (MP). All sequences were 101 bp long.
Bases
Sequence
name
A
C
G
T
N
Pa9a_1
3,848,744,755
2,897,323,917
2,900,927,647
3,825,967,145
23,781,474
Pa9a_2
3,868,006,366
2,890,106,491
2,911,110,520
3,826,704,806
816,755
Pa9a-MP_1
2,174,635,550
1,720,812,735
1,688,621,488
2,158,688,549
246,593
Pa9a-MP_2
2,171,866,496
1,674,935,595
1,748,437,549
2,147,171,234
594,041
4
5
18
1
Table S3. Results of the genome assembly by SOAPdenovo[8]
Statistics Category
PE+ MP 5Gb
PE+ MP All
(27.0Gb + 5.2Gb)
(27.0Gb + 15.5Gb)
# Contigs
Total Length
Largest Contig
Mean Length
12,764,879
1,560,663,735
15,182
122.3
12,887,828
1,772,155,983
18,359
137.5
12,887,828
1,772,155,983
18,359
137.5
Of All,
Contigs
≥ 100bp
N50
# Contigs
Total Length
Mean Length
N50
673
1,884,625
1,099,661,342
583.5
1,188
636
4,450,396
1,396,126,299
313.7
1,123
636
4,450,396
1,396,126,299
313.7
1,123
Scaffolds
Largest Scaffold
# Scaffolds
Total Length
Mean Length
N50
N/A
N/A
N/A
N/A
N/A
2,014,591
3,410,722
1,530,900,674
448.8
74,348
3,309,686
3,384,799
1,590,552,602
470
126,952
All
Contigs
2
paired-end
(27.0Gb)
These results are not used in the study, but have been deposited at http://genomes.uprm.edu/parrot
3
19
1
2
Table S4.A. Summary of the alignment of A. vittata sequences to the G. gallus genome sequence
containing only the top alignment for each scaffold, its chromosomal position and quality scores.
3
4
5
6
This in an online table available locally at Table S4.A.xls as well as at the GigaDB site: Table
S4.A.xls
20
1
2
3
4
5
6
7
8
Table S4.B. Summary of the alignment of A. vittata sequences to the T. guttata genome
sequence containing only the top alignment for each scaffold, its chromosomal position and
quality scores.
This is an online table available locally at: Table S4.B.xls as well as at the GigaDB site: Table
S4.B.xls
21
1
2
3
4
5
6
Table S4.C. The database of the alignment information of A. vittata sequences to G. gallus and
T. guttata genome sequence by BLAST.
This is an online table available locally at: Table S4.C.txt as well as at the GigaDB site: Table
S4.C.txt
22
1
2
3
4
5
6
Table S5. Proportions of sequences with some similarity that mapped to chromosomes of two
reference avian genomes (G. gallus and T. guttata)
This is an online table available locally at: Table S5.xls as well as at the GigaDB site: Table
S5.xls
23
1
2
3
4
5
6
Table S6.A The summary of the database of GenBank sequences with more than 95% similarity
with the parrot scaffolds
This is an online table available locally at: Table S6.A.xls as well as at the GigaDB site: Table
S6.A.xls
24
1
2
3
4
5
6
Table S6.B. The database of GenBank sequences with more than 95% similarity with the parrot
scaffolds found by BLAST
This is an online table available locally at: Table S6.B.txt as well as at the GigaDB site: Table
S6.B.txt
25
1
2
3
4
5
6
Table S7.A. A map of G. gallus transcripts from NCBI Entrez Gene database that mapped to one
of the A. vittata scaffolds
This is an online table available locally at: Table S7.A.xlsx as well as at the GigaDB site: Table
S7.A.xlsx
26
1
2
3
4
5
6
Table S7.B. The database of alignments between of G. gallus transcripts from NCBI Entrez
Gene database and A. guttata scaffolds by BLAST
This is an online table available locally at: Table S7.B.txt as well as at the GigaDB site: Table
S7.B.txt
27
1
2
3
4
5
6
Table S8.A. A map of known repeats identified on A. guttata scaffolds by the RepeatMasker
software
This is an online table available locally at: Table S8.A.xlsx as well as at the GigaDB site: Table
S8.A.xlsx
28
1
2
3
4
5
6
Table S8.B. The database of alignments between the known classes of repetitive sequences and
A. guttata scaffolds by RepeatMasker
This is an online table available locally at: Table S8.B.xlsx as well as at the GigaDB site: Table
S8.B.txt
29
1
2
3
4
5
6
Table S9. Distribution of different cases of repetitive elements among different classes of A.
guttata scaffolds
This is an online table available locally at: Table S9.xlsx as well as at the GigaDB site: Table
S9.xlsx
30
1
Table S10. Bioinformatics tools and outputs for scaffold and gene annotation
Elements Annotated
Tools
Outputs per Scaffold or Gene
Scaffolds
BLAT
n-BLAST
Homologene
Gene,
UniProtKB
RepeatMasker
- Chromosome number and end
coordinates of BLAT matches
- Match scores
- # of elements conserved
- # and identity of RefSeq genes
- Gene orthology and ontology
Genes
UniProtKB
- Coordinates for start and end of each
Ensembl
coding region plus splice sites within
x-BLAST
UCSC Genome Browser
2
31
1
2
Table S11. An example of annotation output produced by a student in the Genome annotation
class using A. vittata genome
3
4
5
This is an online table available locally at: Table S11.xlsx as well as at the GigaDB site: Table
S11.xlsx
32
1
Supplementary Figure Legends
2
3
Figure S1. Venn diagram of the overlap between the number of A. vittata scaffolds and the G.
gallus transcripts from GenBank that were mapped to them by BLAST
4
5
Figure S2. A single example of chimera detected on scaffold-74754 after visual inspection of
reads mapped to 100 largest scaffolds
6
7
Figure S3. Percentage of scaffolds containing fragments with >95% similarity to GenBank
sequences
8
9
10
11
12
13
14
Figure S4. Comparison between categories of A. guttata scaffolds (described earlier in Figure
2): The box plots show the medians, Q1, Q3 and the extreme values. The means are shown in
Table 3. A. Distribution of scaffold lengths; B. Distribution of densities of genes mapped per
kbp of scaffold length. C. Differences in the distribution of proportion of the length of the
scaffold mapped to a G. gallus transcript from NCBI ENtrez Gene database. D. Differences in
the distribution of proportion of the length of the scaffold mapped to a known repeat class using
RepeatMasker software [17].
15
Figure S5. Distribution of major classes of repetitive sequences found on A. vittata scaffolds
16
17
18
19
20
21
Figure S6. Relationship between the quality scores of the alignments between the parrot
scaffolds to the chicken and zebra finch genomes: A. All scaffolds. B. Mismatched scaffolds
only (those scaffolds that shared similarity with sequences of G. gallus and T. guttata genomes
but mapped to different chromosomes in the two species; see classification in Figure 2). C.
Matched sequences only (those that mapped to the same chromosome in reference genomes of
the two avian species).
22
23
24
25
26
27
28
29
30
Figure S7. Relationship between the size of a scaffold and the quality of its alignment to T.
guttata and/or G. gallus genome sequence: A. All scaffolds aligned to the T. guttata genome. B.
All scaffolds aligned to the G. gallus genome. C. Scaffolds from T. guttata that Mismatched
scaffolds mapped to different chromosomes in G. gallus; see classification in Figure 2). D.
Scaffolds from G. gallus that Mismatched scaffolds mapped to different chromosomes in T.
guttata). E. Matched sequences from T. guttata only (those that mapped to the same
chromosome in reference genomes of the two avian species), F. Matched sequences from G.
gallus only (those that mapped to the same chromosome in reference genomes of the two avian
species).
31
32
33
34
35
Figure S8. Small fragments are repeat- rich and gene-rich: A. Relationship between the length of
the scaffolds and the proportion of it length matched to the G. gallus sequences from NCBI
Entrez Gene database. B. Relationship between the length of the scaffolds and the proportion of
it length designated by RepeatMasker as repetitive sequence.
33
1
Supplementary Figures
34
Download