mec13327-sup-0001-supinfo

advertisement
Supporting Information for “Evolutionary history inferred from the de novo genome
assembly of a non-model organism, the blue-eyed black lemur”
Appendix S1. Sample information and DNA extraction for whole-genome sequenced
samples
For high-coverage sequencing, blood was obtained from one female blue-eyed black lemur
(Harlow) and one female black lemur (Harmonia) during annual check-ups at the Duke Lemur
Center. These individuals were chosen because of the potential availability of eye tissue for
Harlow and the availability of other tissues for family members of Harmonia. For low-coverage
population resequencing, blood or tissue was obtained from three blue-eyed black lemurs (one
female: Lamour; and two males: Bogart and Monroe) and three black lemurs (one female:
Blanche-Niege; and two males: Epimetheus and Zephyrus) during annual checkups or during
necropsy. These individuals were chosen due to their limited relatedness to each other and to the
individuals sequenced at high coverage (one low-coverage blue-eyed black lemur was a
grandparent of the reference individual, but all other relationships were at least first cousins, once
removed). We extracted DNA from all samples using a standard phenol-chloroform method.
Appendix S2. Choice of sequencing libraries, library preparation, and sequencing for
genome assemblies
To generate a genome assembly with large contig and scaffold lengths (Salzberg et al. 2012), we
sequenced libraries for several insert sizes, including paired-end (hereafter PE) libraries with
mean target insert sizes of 180 bp, 500 bp and 1 kb, and mate-pair (MP) libraries with mean
target insert sizes of 3 kb and 8 kb. The 180 bp library was chosen because the overlap of 100 bp
reads from each direction can be used to create longer “superreads,” thus enabling the use of a
larger value for k (Gnerre et al. 2011). We sequenced the PE libraries to high fold coverage and
used the low-coverage sequencing reads from the MP libraries for scaffolding (Figure 2).
PE libraries of target insert sizes 180 bp, 500 bp, and 1 kb were prepared using the
recommended Illumina protocols at the University of Chicago and sequenced using an Illumina
HiSeq2000 at Princeton University (Princeton, NJ). PE libraries with mean target insert size of 1
kb (blue-eyed black lemur) and 500 bp (black lemur) were prepared with Illumina’s TruSeq
DNAseq Sample prep kit and MP libraries were prepared with Illumina’s Mate-pair version 2
Library prep kit at the Keck Sequencing Center of the University of Illinois (Champaign, IL); these
libraries were also sequenced at the Keck Center. PE libraries were quantitated by qPCR and
sequenced for 100 cycles from each end of the fragments on a HiSeq2000 machine using a
TruSeq SBS sequencing kit version 3. Reads were analyzed with Casava 1.8 (pipeline 1.9) and
subsequently stripped of adaptors. For low-coverage resequencing, PE libraries of target insert
size 350 bp were generated using Illumina’s TruSeq DNA PCR-Free Sample Preparation Kit (cat
no: FC-121-3001) and sequenced on a HiSeq2500 (2 lanes paired-end 100 bp in Rapid Run
Mode) at the University of Chicago Genomics Core Facility (Chicago, IL).
Appendix S3. Quality Control on Raw Reads
Reads from each lane were screened to remove completely unusable reads (all bases read as
‘N’), and quality scores along the reads were visualized using the FASTX tool kit
(http://hannonlab.cshl.edu/fastx_toolkit/). Because the ends of blue-eyed black lemur PE reads
appeared to have highly variable quality scores, these reads were trimmed using the
fastx_trimmer program to remove low quality suffixes prior to the stand-alone error correction
programs (see S4). Black lemur reads were not trimmed using FASTX because their ends had
higher quality scores and lower variance; however, any reads containing fewer than 70% bases
with quality score > 20 were filtered out.
MP reads were screened for redundancy by comparing the first 32 base pairs of reads, and only
unique reads were retained. 24% of reads in the 3 kb library and 57% of reads in the 8 kb library
were identified as redundant. Excluding them, usable read coverage from MP libraries dropped to
2X (Figure 2a).
Appendix S4. Choice of pre-assembly correction method for PE reads
Previous work suggests that the quality of the assembly is dependent upon both the quality of the
data and the choice of assembler (Salzberg et al. 2012). Sequencing errors that are present in
raw reads complicate the de Bruijn graph. Correcting these errors prior to assembly not only
improves accuracy and contiguity but also simplifies the graph, which significantly lowers the
memory overhead associated with the graph construction process (Schatz et al. 2010). In our
own data, when assembling with raw reads only, the unitig (a unique contiguously assembled
region ending at the boundaries of repeats) N50 was only 596 bp, and the corresponding scaffold
N50 was approximately 8 kb. In order to increase these lengths, we used an external error
correction program in addition to SOAPdenovo’s internal graph correction procedure.
In order to assess how the choice of an error correction program affects assembly output, we
compared the effect of two stand-alone assembly error correction programs; i) Quake (Kelley et
al. 2010); and ii) a beta version of the then-in-development error correction program for
SOAPdenovo, kindly provided to us by Ruibang Luo. The SOAP error correction package
(version 0.04, compiled on March 6, 2012) is comparable to the method currently implemented in
SOAPdenovo2 (Luo et al. 2012), with the exception that the version we implemented did not
support long k-mers.
Quake distinguishes between k-mers that are “trusted” to reflect genome sequence and those
that are likely due to errors by evaluating the distribution of k-mer counts weighted by base quality
values (“q-mers”). This process attempts to retain true, low-coverage k-mers while eliminating
erroneous, high-coverage k-mers resulting from repetitive sequence (Kelley et al. 2010). To
generate q-mer counts, we used the program Jellyfish (Marçais & Kingsford 2011), a fast and
efficient k-mer or q-mer counter. The ‘jellyfish count’ program was run with ‘-m 19 –C –-quake ’
options, which indicates a q-mer size of 19, as recommended for the 3 Gb human genome
(Kelley et al. 2010), and q-mers were counted on both strands. We used a hash size (-s) large
enough to fit all q-mers in memory at once, since this minimizes the number of intermediate files
written to disk. The counts output by the ‘jellyfish qdump’ command were used as input to the
program ‘cov_model.py,’ a Quake accessory script that generates a histogram of q-mer counts.
The program also calculates a coverage cutoff using, by default, the point at which the probability
of the error distribution in a mixture model is 200 times the probability of the true distribution. To
strike a balance between leaving many error k-mers uncorrected and over-correcting true k-mers,
we chose a cutoff between that recommended by cov_model.py (1.25) and the local minimum of
the q-mer distribution (5.25) for correction (c = 3; Figure S4b). q-mers were corrected with the
‘correct’ program using ‘–k 19 –c 3’ options.
The SOAP error correction method makes use of the frequency spectra of both standard k-mers
(“consecutive k-mers”) and k-mers incorporating a gap (“space k-mers”) to detect and correct
error k-mers. It also uses overlapping PE information from short-insert libraries to avoid correcting
true k-mers. For the SOAP error correction, we first ran the ‘k-merfreq’ program to count
consecutive and space 17-mers (using k = 17, as recommended by Li et al. (2010), in the
trimmed reads. We used the first local minimum in the resulting k-mer counts (c=14; Figure S4c)
as the coverage cutoff, following Chaisson et al. (2009). The ‘ErrorCorrection correct’ program
was then run in parallel with ‘-I –L -j -y’ options to account for overlapping reads from the 176bp
insert library.
For both error correction methods, reads that could not be corrected were discarded. The
remaining reads were loaded to the assembly as PE if both mapped in the correct orientation to
the same contig following error correction, or as single-end (SE) otherwise. Corrected read
coverages were similar for Quake and SOAP (Figure 2a), and compared to Quake, SOAP
maintained longer read lengths following correction (Table S1). Supporting Information Table S2
shows statistics for assemblies generated using SOAP-corrected reads and Quake-corrected
reads (henceforth SCA and QCA, respectively). The contig N50 was comparable for both
assemblies; however, QCA had a smaller proportion of chaff bases (bases within contigs less
than 200 bp in length, which are discarded as unmapped reads), in addition to having larger
contigs and scaffolds.
We generated binary sequence alignment/map (BAM) files using the Burrows-Wheeler Aligner
(BWA; Li & Durbin 2009) and SAMtools (Li et al. 2009) in order to assess the accuracy of read
pair placement and to re-estimate the insert sizes for each library within the assembly. We
indexed the scaffolds using ‘bwa index’, aligned all reads using ‘bwa aln’ command with default
parameters, generated sequence alignment/map (SAM) files using ‘bwa sampe’ (for PE reads)
and ‘bwa samse’ (for SE reads) commands, and converted the SAM files to BAM files using
‘samtools view.’ We filtered these alignments based on the mapping quality (MQ ≥ 20) and only
included PE reads for which both reads aligned to the same scaffold. Filtered alignments were
sorted using ‘samtools sort’ command, and PCR duplicates were removed using ‘samtools
rmdup.’ For each target insert size, all BAM files corresponding to reads that were loaded as PE
during the assembly were merged together, and the ‘TLEN’ fields were used to determine the
insert size for each read pair. The merged BAM files were processed to retain only correctly
oriented (-> <-) PE read pairs whose insert size was within three standard deviations of the mean
(“happy pairs” according to Schatz et al. (2007)).
The placement of PE reads in both assemblies appears to be highly accurate, with 98% of read
pairs that map to the same scaffold identified as correctly oriented and happy (531,426,773 of
541,936,640 pairs for QCA, and 531,138,411 of 539,582,569 pairs for SCA). Only 0.024% of
QCA mapped pairs and 0.027% of SCA mapped pairs mapped to different scaffolds.
In choosing between QCA and SCA, our goal was to identify an assembly accurately
representing the genome in as few contigs and scaffolds as possible, and in which the majority of
read pairs were correctly oriented and happy. Although we did not detect any significant
differences between the assemblies in proportion of happy pairs or in similarity to the black lemur
bacterial artificial chromosome (BAC) sequences (see S9), our results indicated that QCA had
more read pairs that were effectively merged into larger sequences. SCA had more than twice as
many chaff contigs as QCA, in addition to a 71% increase in the number of unused reads, called
singletons (Table S2). Such a high number of singletons could be due to assembly errors; for
instance, when a tandem repeat has been incorrectly collapsed, the reads spanning the junction
of the repeat cannot be mapped and become singletons (Treangen & Salzberg 2012). Singletons
could also result from the presence of contaminant sequences, but such singletons should be
present in both QCA and SCA. Chaff contigs, on the other hand, indicate the retention of low
quality, repetitive sequences that are not able to be placed in scaffolds (Salzberg et al. 2012);
such chaff contigs are more abundant in SCA than in QCA. These results suggest that more
repeats may be unresolved in SCA than in QCA. We also observed a lower scaffold N50 and a
lower percentage of scaffolds with no gaps in SCA than in QCA; this may be a consequence of
using a higher depth threshold for classifying error k-mers, and thus eliminating a larger
proportion of true k-mers. Alternatively, the use of a lower coverage cutoff in the QCA error
correction may have led to the inclusion of some k-mers containing errors that would have been
corrected in SCA, and to the assembler aggressively merging two erroneous sequences together
(Salzberg & Yorke 2005). Based on the previous analyses and our preference for larger contigs
and scaffolds, we chose QCA as our final assembly, and all analyses in the main text are done on
this assembly.
We also tried the strategy of retaining the error reads that could not be corrected by Quake or
SOAPdenovo; however, this did not improve the contiguity of the assemblies. Instead, retention
of error reads considerably increased the time to completion of each step of the assembly along
with memory overhead, with peak memory exceeding 370 G during graph construction.
Appendix S5. Choice of assembler
At the time our assembly was generated, there were several genome assemblers capable of
assembling mammalian-size datasets (Zerbino & Birney 2008; Miller et al. 2008; Simpson et al.
2009; Li et al. 2010; Gnerre et al. 2011; Simpson & Durbin 2012). In choosing among the
available options, we prioritized ease of use, ability to run quickly and uninterrupted on a large
dataset, ability to produce an assembly using primarily short insert library data, and performance
in simulated datasets (Earl et al. 2011). Based on these criteria, we chose the de Bruijn graph
(Idury & Waterman 1995; Pevzner et al. 2001) based short read assembler SOAPdenovo (Li et
al. 2010).
Appendix S6. Evaluation of insert size distributions and resolution of “bimodal” libraries
The first step of scaffolding is to align the reads to the contig assembly, so that the paired end
information can be used to join the contigs, and the size of gaps between contigs can be
estimated (Li et al. 2010). The distribution of the true insert sizes for one library should in principle
be normal, with mean μ approximately equal to the designed insert length for the library and
variance σ2. However, the initial insert size estimates, inferred by comparing the migration of
libraries within a gel to that of standardized DNA fragments, are often incorrect (Phillippy et al.
2008). Furthermore, the MP library construction protocol involves circularization and shearing of
DNA fragments, thereby generating two populations of fragments: i) fragments that are truly
separated by the MP distance and ii) contaminating short-insert fragments. If the library contains
a large proportion of short-insert fragments, these reads provide little to no improvement in
scaffolding. In order to assess the accuracy of the initial insert size estimates and to estimate the
proportion of contaminating short-insert fragments within the MP libraries, we estimated PE and
MP insert sizes by mapping pairs of reads to the assembly and determining the distance between
them (see S4).
We observed bimodal distributions for the 500 bp library and one of the two 1 kb libraries (Figure
S6a and b). The assembly-based insert size distributions from a 1 kb library prepared at a
different sequencing center did not appear bimodal (Figure S6c), which suggested that the
additional peaks in the two distributions were artifacts caused by laboratory protocols, rather than
assembly errors. Since SOAPdenovo assumes a normal distribution of insert sizes for each
library, such bimodal distributions could be problematic for the assembler, leading to an incorrect
estimation of library means, and thus potentially affecting the contiguity of contigs and scaffolds.
Assuming that the bimodal distributions of insert sizes represented a mixture of two normal
distributions, as visual inspection suggested, we estimated the two means and standard
deviations using the R package ‘mixtools,’ which provides a set of functions based on
expectation-maximization (EM) algorithms for analyzing finite mixture models (Benaglia et al.
2009). The normalmixEM() function was used to calculate the posterior probabilities of each
insert size falling in each of two distributions, and the results were then used to assign read pairs
to the distribution with the higher posterior probability for their mapped insert size. This resolved
the distribution of insert sizes for the target 500 bp library into two component distributions, with
mean insert sizes of 205 bp and 480 bp, and it resolved the 1 kb library distribution into
distributions with mean insert sizes of 371 bp and 958 bp. These estimated insert sizes were then
used to load the read pairs assigned to each distribution separately. This procedure resulted in a
25% increase in the contig N50 for QCA (16.3 kb) and a 28.9% increase for SCA (17.9 kb; Table
S2), as well as a reduction in the percentage of chaff contig bases and singletons for both
assemblies.
Since MP libraries contain a combination of inward and outward facing reads, we used Stampy
(Lunter & Goodson 2011) for MP alignments. Stampy trains insert size distributions separately for
each orientation of read pairs and subsequently attempts the realignment of mates within each
distribution. For re-estimation of insert sizes, we first mapped a small fraction of the data and
used reads that fell within three standard deviations of the means as a training set for future
mapping in Stampy. The resulting distributions were used as initial parameters, and the
procedure was repeated until estimates approximately converged. We found approximately 7%
and 17% of PE contamination in the 3 kb and 8 kb libraries, respectively. These PE reads (with
mean insert sizes of 420 bp) were separated from the MP reads and loaded along with the other
PE libraries for the assembly. From the reads determined to map as MP, we used the outward
facing reads that mapped within three standard deviations of the estimated mean insert size for
the final assembly. Figure 2b shows the increase in scaffold N50 values for QCA upon sequential
addition of different PE and MP libraries.
Appendix S7. Command line parameters for assembly generation
All de novo assemblies were done with a k-mer size of 33 because we obtained the best N50
statistics with this k-mer size. We used the SOAPdenovo63mer executable for all computations.
The first step of the de novo assembly is to load all the reads into a de Bruijn graph, which was
constructed using the following command:
./SOAPdenovo63mer pregraph -s config_file -p 12 -K 33 -o Quake_K33
Contigs were constructed by merging similar sequences and resolving tiny repeats from read
paths using the following command:
./SOAPdenovo63mer contig -g Quake_K33 -M 3 –R
Reads were then mapped back to the contigs to transfer the paired-end information from the
reads, using the following command:
./SOAPdenovo63mer map -s config_file -g Quake_K33 -p 12
Finally, scaffolds were constructed with an option to fill small gaps using the following command:
./SOAPdenovo63mer scaff -g Quake_K33 -F -p 12
At this point, we obtained scaffolds with masked repeats appearing as gaps in the assembled
sequence. In order to disambiguate the repeats and fill in the gaps, we ran the ‘GapClosure’
module on the scaffolds. This program aligns the PE reads to the scaffolds and gathers the pairs
in which one end has aligned and the other end falls in a gap region. A local assembly of the gap
region is then done. The following command was used for gap closure:
./GapClosure –a Quake_K33.scafSeq –b config_file –o Quake_K33.gapClosed.scafSeq –p 31 –t
12
This procedure closed 79% of gaps.
For the black lemur assembly, we used the recommended pipeline within BWA version 0.5.9 (Li &
Durbin 2009). We first aligned reads separately for each end of the PE library and for the SE
reads whose pair failed quality filters using the following command:
./bwa -aln -n 0.04 -o 1 -q 15
We then converted the alignments of the PE reads and the SE reads to separate SAM files using
the following command:
./bwa sampe -a 1000 -n 1
and
./bwa samse -n 1
We filtered out reads with mapping quality less than 10 and converted to a BAM file in SAMtools
version 0.1.17 using the following command:
./samtools view -S -h -q 10 -b
To build a consensus genome sequence for the black lemur, we used the Genome Analysis
Toolkit (GATK; McKenna et al. 2010; DePristo et al. 2011) to call polymorphic sites from the
resulting BAM file and to produce an all-sites Variant Call Format (VCF) file including genotypes
at all sites (see S11). We then used a custom python script to retain calls at all sites with at least
three reads confidently mapped; at sites identified as polymorphic, the more common allele
among the reads was selected as the consensus. Note that the minimum coverage for reporting a
base pair within the consensus sequence is lower than that used for calling high confidence sites
for the pairwise sequential Markovian coalescent (PSMC; Li & Durbin 2011) and other population
genetic analyses.
Custom scripts or modifications of available scripts used for processing these data are available
at https://github.com/sorrywm/genome_analysis.
Appendix S8. Details of memory usage and run times
The peak memory consumption and run times for different stages of the assembly pipeline are
shown in Figure S2. Note that this is for the error-corrected reads, whereas assembling the raw
reads required a peak memory of 379 GB during the graph construction stage. This difference
between memory usage for raw and corrected reads illustrates how, although in theory the size of
the de Bruijn graph should depend primarily on the size of the reference genome, in practice
sequencing errors create their own nodes in the graph and thus require substantial memory
resources.
Peak memory consumption was during the final step of the assembly, the Gap (gap closure)
stage, whereas the maximum run time was for error correction. We performed all memoryintensive computations on two computers: one with 16 quadcore Intel Xeon 2.4 Ghz CPUs with
288 Gb memory installed and the other with 48, 12-core AMD Opteron Processors with 512 Gb
memory installed.
Using these machines, both QCA and SCA required approximately 5 days from the start of k-mer
counting to the completion of the gap closure step.
Similar figures were observed for each assembly step in SCA, and the error correction for SCA
required 21 GB of RAM.
Appendix S9. Aligning blue-eyed black lemur contigs to black lemur bacterial artificial
chromosomes (BACs)
We aligned the contigs of the blue-eyed black lemur genome assembly to the eight genomic
regions (4.86 Mb total sequence) of the black lemur that have been sequenced using a BAC-byBAC shotgun sequencing strategy (NISC Comparative Sequencing Initiative; NCBI trace archive
library ID CHORI-273; GenBank accession numbers DP000024.1, DP0011503.3, DP000556.1,
DP000490.3, DP000520.1, DP000468.3, DP000905.1, and DP001278.1). Contig alignments
were created as described in Salzberg et al. (2012). Briefly, we used the ‘dnadiff’ wrapper script,
which is a part of the MUMmer package (Kurtz et al. 2004), to create the alignments. We used
the BAC sequence as the reference and blue-eyed black lemur contigs (excluding chaff contigs)
as the query, and we aligned these using the ‘nucmer’ program. Use of the shorter sequence as
the reference minimized the memory requirement for alignment. Using the supermap algorithm,
optimal alignments were created, while retaining large scale re-arrangements (Dubchak et al.
2009). Alignments with <95% identity or >95% overlap with another alignment were discarded
using the ‘delta-filter’ program. The resulting high-quality one-to-one alignment mappings were
used to calculate divergence by parsing the one-to-one alignment blocks, considering
substitutions where ≥20 bp matched exactly on either side of the alignment.
Appendix S10. Core Eukaryotic Gene Mapping Approach (CEGMA) analysis
We ran the CEGMA pipeline (Parra et al. 2007, 2009) against our draft scaffolds to estimate the
proportion of conserved genes represented in our assembly. The set of 248 highly conserved
core eukaryotic genes (CEGs) have been divided into four groups based on the degree of
conservation. Group 1 contains the least conserved proteins, while Group 4 contains the most
conserved proteins. ‘Complete’ refers to a predicted protein that yields an alignment to the profile
hidden Markov model (HMM) for that protein that is at least 70% of the profile HMM length and
exceeds a pre-computed minimum alignment score, obtained by calculating the maximum
alignment score for all non-CEG genes. Matches that exceed this threshold are generally fulllength proteins (Parra et al. 2009). ‘Partial’ refers to proteins with alignments that are shorter but
still exceed the minimum alignment score. These partial matches may correspond to shorter
fragments of proteins, such as domains. While we are able to map 94.7% of the proteins as
partial matches, we only map 59.2% of the proteins as complete. One explanation could be that
even though the N50 of our scaffolds is high, 421 kb, contigs within scaffolds are separated by
gaps. If the average gene length for CEGs in Eulemur flavifrons is 24.6 kb (the mean length for
the most highly conserved CEGs in humans), some genes may extend into gaps within the
assembly because only 12.5% of contigs exceed 70% of this length. In addition, some genes may
overlap repetitive regions of the genome that have been excluded as singleton contigs.
Appendix S11. SNP calling
We called SNPs from the blue-eyed black lemur BAM files generated to assess PE correctness
and insert size distribution (see S4), renaming the merged BAM files to assign all data to the
same sample using SAMtools ‘reheader,’ and the black lemur BAM file used for the referencebased assembly (see Methods and S7). To call SNPs, we used the Genome Analysis Toolkit
(GATK; McKenna et al. 2010; DePristo et al. 2011) version 2.2-16. PCR duplicates had already
been removed from each library of the blue-eyed black lemur data separately (see S4). For the
black lemur, we marked duplicates using the MarkDuplicates tool within Picard
(http://picard.sourceforge.net/) version 1.81, with the following arguments:
ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
Because this duplicate marking process disrupts the read group information in the BAM file, we
then replaced the read groups using Picard’s AddOrReplaceReadGroups, with the following
arguments:
SORT_ORDER=coordinate RGLB=Em500PE RGPL=illumina RGPU=unknown
RGSM=EmHarmonia CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT
We realigned reads locally around indels using GATK’s RealignerTargetCreator and
IndelRealigner. Finally, we called SNPs using UnifiedGenotyper, with the following arguments:
-out_mode EMIT_ALL_SITES -mbq 20 -rbs 50000000
The ‘EMIT_ALL_SITES’ option was used to make calls for both monomorphic and polymorphic
sites, and only those bases that had a base quality of 20 or greater were initially called. The raw
SNP calls contain many false positives due to sequencing artifacts, mismapped reads, and
indels, as well as potential false negatives. We therefore performed filtering to eliminate potential
false positive SNPs and to mask monomorphic sites at which a SNP could not have been
confidently called using several recommended criteria (Abecasis et al. 2010; Auton et al. 2012),
excluding:
i) sites with base quality < 30
ii) sites with read depth < half the mean read depth, or > twice the mean depth
iii) sites with MQ (root mean square of the mapping quality of all reads spanning a site) < 20
When calculating diversity within and divergence between species, we directly used the QUAL
value (the Phred-scaled probability estimated by GATK of a type I or type II error at a site)
provided in the VCF for the base quality cutoff. For the conversion of the VCF to the fastq files
used for PSMC, we further weighted the QUAL value by the GQ (the Phred-scaled probability of
an error in the called genotype) in cases in which the called genotype was not homozygous
reference. We additionally modified the ‘vcf2fq’ function in the vcfutils.pl script provided by
SAMtools in order to enable genotype calls at sites where both alleles were non-reference and to
mask sites with insufficient data, rather than calling them homozygous reference (see
‘vcf2fqnonref’ in https://github.com/sorrywm/genome_analysis/vcfutils_mod.pl). When estimating
diversity and divergence, we used a more stringent MQ filter, excluding sites with MQ < 30.
Appendix S12. Sanger-based resequencing of additional samples within the scaffold
containing the OCA2 ortholog
We generated Sanger sequence data for subsections of the scaffold containing the OCA2
ortholog in a larger sample of individuals. To this end, we used DNA samples for five blue-eyed
black lemurs and five black lemurs from a previous study of ours (Meyer et al. 2013). DNA
samples from three additional blue-eyed black lemurs and one additional black lemur were
obtained from liver samples provided by the Duke Lemur Center, Durham, NC, using the DNeasy
Blood and Tissue Kit (Qiagen).
We designed primers to amplify regions with a target size of 1.2 kb and a Tm of 60ºC, using a
custom script employing Primer3
(https://github.com/sorrywm/genome_analysis/DesignOCA2Primers.py). We filtered the resulting
primers to obtain one primer pair per 30 kb of sequence, with the properties that the amplicons
captured the maximum number of “fixed differences” between reference genomes and that there
were no polymorphisms in the primer sequence based on the four chromosomes already
sequenced. Primers yielding single amplicons for all samples, along with additional internal
sequencing primers, are listed in Table S3.
We amplified 20 ng of DNA in a total volume of 20 µL with 1X PCR buffer, 1.5 mM MgCl2, 500
µM, 100 nM each primer, and 0.5 units Platinum® High Fidelity Taq (Invitrogen); to improve
specificity, 1 µL each Betaine and DMSO were added to some reactions. The PCR conditions
were as follows: 94ºC for 2 min, 30 cycles of 94ºC for 30 s, 55ºC for 30 s, and 68ºC for 75 s, 68ºC
for 10 min. PCR products were purified using Exo-SAP. For sequencing, 1.2 μL of purified PCR
product and 1.2 μL of 4 μM primer were added to 2 μL BigDye® v3.1 (Applied Biosystems) and 2
μL water, and this reaction mix was cycled 50 times for 5 sec at 96ºC, 10 sec at 50ºC, and 3 min
at 60ºC. Water was added to approximately 25 μL, and reactions were purified by Sephadex® G50 Superfine (GE Healthcare Bio-Sciences AB) in a 96-well filter plate (Pall). These were directly
placed on the Applied Biosystems 3730XL or 3130 capillary sequencer and run as a standard 2
hour run using POP-7™ polymer (Applied Biosystems).
Resulting sequences were determined from trace files using a python script utilizing the abifpy
module (https://github.com/bow/abifpy/), and any bases with quality score < 20 were checked by
eye from the chromatogram. In total, we obtained sequence data for regions totaling 9.9 kb in
length from the blue-eyed black lemur and 15.1 kb from the black lemur. Within these regions, we
observed 88.6% concordance of Sanger calls with GATK-derived polymorphisms (out of 35
heterozygous sites called by GATK, there were four discordant sites, all of which occur within the
same amplicon) and only three sites where GATK failed to identify a SNP present in Sanger
sequencing. We visually inspected the chromatogram for the amplicon containing the four sites
that had been called heterozygous in the whole genome sequencing but homozygous in Sanger
sequencing, and in three of these cases, the Sanger call appeared to be in error. Assuming no
additional errors, these results suggest that the probability of incorrectly calling a homozygous
site heterozygous is approximately 0.00012, and the probability of incorrectly calling a
heterozygous site homozygous is approximately 0.029. We therefore concluded that the initial
GATK calls were of high quality, and we used these data without any further recalibration in
subsequent analyses.
In order to determine haplotypes and estimate FST, sequences obtained for each region were
phased using SHAPEIT2 (Delaneau et al. 2013). All polymorphic sites identified from sequencing
were combined into one map file for phasing, and a constant recombination rate of 1 cM/Mb was
assumed for determining genetic distance between markers. Effective population size was
estimated from pairwise differences as follows: Ne = ((π Ef + 2* π Ef_Em + π Em)/4)/4μ, where π Ef
and π Em represent genome-wide within-species diversity for blue-eyed black lemur and black
lemur, respectively, and π Ef_Em represents genome-wide divergence. SHAPEIT2 was run using
the following parameters:
Shapeit --input-bed AllAmpliconsMerged_bfile.bed AllAmpliconsMerged_bfile.bim
AllAmpliconsMerged_bfile.fam -M AllAmpliconsMerged.gmap --exclude-ind
UnsequencedLemurs.ind -W 0.5 --effective-size 35000
Appendix S13. Simulations of pairwise sequential Markovian coalescent (PSMC)
performance on scaffold data and with a population split
The PSMC method (Li & Durbin 2011) was developed for use on data from linear genomes and
initially tested on whole human genome sequences. In contrast, our assembly contains much
more missing data, and the sequences are present in several thousand scaffolds rather than a
small number of chromosomes. In order to determine how these features could influence the
estimates of PSMC, we simulated data with similar properties to those of our draft genome
assembly. Specifically, for a given demographic history (similar to the demographic history of the
blue-eyed black lemur, estimated by PSMC), we simulated a sequence of length 2.5 x 109 bp (the
approximate length of the blue-eyed black lemur genome), using a modification of msHOT
(Hudson 2002; Hellenthal & Stephens 2007) developed by Heng Li and available at
https://github.com/lh3/foreign/tree/master/msHOT-lite. Using a python script, we further simulated
100 assembly-like sequences by selecting random segments from the whole data, with lengths
drawn at random from the blue-eyed black lemur scaffold lengths and adding up to 2.0 x 109 bp
(the approximate length of the assembly used for PSMC). In addition, we replaced blocks of
these segments with missing values (N’s), where the lengths and distribution of these blocks
were determined using the distribution of missing data from the blue-eyed black lemur assembly.
The resulting plot of trajectories obtained by applying PSMC on simulated assemblies suggests
that using scaffolds rather than the full data increases the variance of the inferred demographic
trajectory, as might be expected, but does not lead to any marked biases (Figure S3b).
Importantly, this increase in variance is negligible in the parameter range in which PSMC is
expected to produce reliable estimates (i.e., excluding very recent or ancient times).
The ms input for the simulated demographic history (Figure S3b) was:
msHOT-lite 2 100 -t 50440 -r 6710 25000000 -l -eN 0.011 0.380 -eN 0.042 0.987 -eN 0.111 0.638
-eN 0.301 1.171 -eN 0.694 0.925 -eN 1.910 2.725 -eN 4.158 2.323
In addition, we observed that the time at which the PSMC-inferred trajectories for the blue-eyed
black lemur and black lemur began to overlap was more ancient than the split time inferred from
divergence and shared polymorphisms (see S14). To investigate whether this discrepancy might
be due in part to PSMC smoothing of abrupt changes in the effective population size, we ran the
method on simulated data generated with a sudden population split. We simulated demographies
imitating the demographic histories of the two species initially inferred by PSMC, with a split time
based on our estimates using diversity levels (Figure S3c). We further generated bootstrap
resampling data (100 bootstraps of the total genome length using 500 kb intervals) using PSMC
and obtained a distribution for the estimated time at which the two trajectories split based on the
PSMC output (Figure S3c). As we suspected, the estimated split time tended to be greater than
the simulated split time, consistent with the discrepancy between the PSMC output and the split
time estimated from diversity and divergence data (Figure S3d; see also supporting information
for Prado-Martinez et al. (2013) and Freedman et al. (2014)).
The ms command for this simulation was:
msHOT-lite 4 100 -t 14647.1163653362 -r 2084.10374328881 5000000 -I 2 2 2 0 -en 0.0334 1 1 en 0.06 1 3 -en 0.38 1 5 -ej 1.095 2 1 -en 1.095 1 3.5 -en 1.6335 1 3.5 -en 5 1 10 -en 10 1
5.8929 -en 0.0241 2 1.0000 -en 0.0958 2 3 -en 0.33 2 2 -en 1.095 2 3.5
Appendix S14. Estimating species split time
The probability P(D) that the allele at a site differs between one chromosome from each of the
species is given by the average divergence between lineages and the mutation rate. Under a
simple isolation model and Wright-Fisher assumptions, the average time to the most recent
common ancestor (TMRCA) of two lineages, one sampled in each of two species, is the time to
the species split (t) plus the average time to coalescence in the ancestral population (2N a). Thus,
P(D) = 2μ(t + 2Na), where μ is the mutation rate.
Under the same assumptions, if the effective population size of both descendant species is Ne,
then the probability P(S) of a polymorphism with both alleles shared identical by descent (IBD) in
a sample of two diploid individuals (one per species) is:
P(S) = (e-t/2Ne)2 x 2/3 x (2Na) x μ
This equation represents the product of the mutation rate and three terms:
1) The probability of no coalescences occurring in either species prior to time t, which is (e -t/2Ne)2.
2) The probability of the four lineages in the ancestral population coalescing in an order that
could lead to a shared polymorphism. Namely, the first coalescence must occur between
lineages from different species, an event that has probability 2/3.
3) The total length of lineages on which a mutation leading to a shared polymorphism could
occur. Once the first coalescent event has occurred (between lineages from different
species), there are three possible orders for the second coalescent event. In one of these,
the second coalescence does not involve the lineage resulting from the first coalescent event,
and the expected branch length on which a mutation leading to a shared polymorphism could
occur is two times the expected TMRCA for three lineages minus the expected time to the
first coalescence in a sample of three, or 2 x (8Na/3) - 2Na/3 = 14Na/3. The other possibility is
a coalescence between the branch resulting from the first coalescence and one of the other
two lineages; in this case, the expected branch length on which a mutation leading to a
shared polymorphism could occur is the time to the first coalescence in a sample of three, or
2Na/3. Thus, summing over all three cases, the expected branch length is 1/3 x (14N a/3) + 2/3
x (2Na/3) = 2Na.
The product of these terms is the probability of a polymorphism shared IBD.
Alternatively, we can write these equations in terms of τ, the split time scaled by 2Ne, and Na/Ne,
the ratio of ancestral effective population size (Na) to current Ne, so the probability of a fixed
difference between the two genomes is:
P(D)=2 x (2Neτ + 2Na/Ne x Ne) x μ, and
P(S)=(e-τ)2 x 2/3 x (2Na/Ne x Ne) x μ.
These relationships can be written in terms of mean pairwise diversity (π = 4N eμ) as:
P(D)=π x (τ + Na/Ne)
P(S) = (e-τ)2 x 1/3 x (Na/Ne x π)
We used these equations to estimate the split time and Na by a moment estimator. Specifically,
we took the harmonic mean of the pairwise differences within each species to estimate π, and we
solved for τ and Na/Ne by substituting the observed genome-wide divergence between species for
P(D), and the observed fraction of shared polymorphic sites for P(S). We solved the system of
equations using the uniroot function in R version 2.15.2 (Brent 1973).
Appendix S15. Choice of parameters for PSMC and scaling of PSMC output and species
split time
Within the PSMC program, heterozygosity is summarized across a “window,” with a default size
of 100 bp. Based on the estimated SNP density within the two lemur genomes, we chose a
window size of 75 bp, in order to reduce the number of windows containing 2 or more SNPs to
approximately 1%, and thus strike a balance between information loss and computational
complexity. However, we additionally ran PSMC on the blue-eyed black lemur data with window
sizes ranging from 20 to 500 bp, in order to assess what effect this choice might have on our
conclusions. With the exception of 500 bp, the trajectory of ancestral population size appears
highly similar across window sizes (Figure S3a).
In order to translate the output of PSMC and the estimate of τ (see S14), both of which are given
in terms of 2Ne (two times the current effective population size) generations, to a more
interpretable temporal scale, we needed an estimate of the mutation rate. The genome-wide pergeneration mutation rate has not been directly estimated for lemurs; however, pedigree-based
estimates are available for human data (1.2 x 10-8 per bp per generation; Kong et al. 2012), and
estimates based on experimental mutation accumulation are available for mouse (Mus musculus:
3.8 x 10-8; Lynch 2010). Within primates, mutation rate has been shown to be inversely correlated
with generation time, which may explain the “hominoid slowdown” in molecular divergence rates
(Kim et al. 2006; Tsantes & Steiper 2009). This correlation may be due to DNA replication errorgenerated mutations, which accumulate over cell divisions; in contrast, mutations of CpG to T,
which do not result from replication error, tend to behave in a clocklike manner across mammals
(Hwang & Green 2004). This clocklike behavior suggests that the proportion of all mutations on a
given branch that are CpG to T should negatively correlate with the per-generation mutation rate.
Because the generation time of the blue-eyed black lemur is shorter than that of human (see
below) and the proportion of CpG to T mutations is lower in the lemur than in the human branch
(Hwang & Green 2004), we used a mutation rate estimate higher than that reported for humans
(but lower than that for mouse), or 2.0 x 10-8 per bp per generation. The generation times of the
blue-eyed black lemur and black lemur are also unknown; however, based on the reproductive
ages of individuals in captivity at the Duke Lemur Center (1.5 - 25 years) and on the mean age of
first reproduction of blue-eyed black lemurs in the wild (3 years; Volampeno et al. 2011), and
assuming some reproductive senescence, we assumed the generation time for both species to
be 5 years. The effects of using different point estimates within biologically plausible ranges (0.5
– 5.0 x 10-8 for mutation rate per generation and 2.5 - 10 years for generation time) are shown in
Figure S3e-f.
Because the black lemur was sequenced to lower coverage than the blue-eyed black lemur, we
assumed that the ability to detect polymorphisms in this dataset might be reduced, leading to an
under-estimate of overall diversity. To estimate the influence of coverage on SNP-calling ability,
we called SNPs using a subset of the blue-eyed black lemur data containing approximately the
same number of mapped reads as for the black lemur dataset and compared the results to those
obtained using the full dataset. The number of “callable” bases for the thinned blue-eyed black
lemur data turned out to be lower than that for the black lemur data, presumably due to
differences in library preparation and sequencing between centers. We therefore estimated the
effect of the reduced coverage by assuming a linear relationship between the number of callable
bases and the proportion of true SNPs called as SNPs. We incorporated this effect into the
PSMC output by reducing the mutation rate for the black lemur to 1.951 x 10-8 per generation per
bp. The plots for the full blue-eyed black lemur dataset and the subset with adjusted mutation rate
show strong concordance, indicating that this correction factor appears to adequately capture the
effect of lower coverage (Figure S3g).
We note that, because the two individuals we sequenced were female, some of the information
may be derived from the X chromosome, which has Ne ¾ that of an autosome. This may lead to
an underestimate of Ne by PSMC. Because karyotypic and chromosome painting work suggests
that the X chromosomes of humans and black lemurs are homologous (Müller et al. 1997), the X
should represent approximately 5.8% of the black lemur genome length, and thus the effect on N e
inference should be small.
Appendix S16. Identification of candidate regions for recent positive selection in one
species
Recent positive selection may be detectable as a reduction in diversity relative to divergence in
one of the species. In the absence of a genetic map for these species, we initially assessed
diversity and divergence on a physical scale of 100 kb, which is roughly the span over which a
recent, strong selective sweep would impact genetic diversity at linked sites (assuming a
selection coefficient of 1% and an average recombination rate similar to that of humans; Kaplan
et al. 1989; Kong et al. 2004). In order to identify regions that may have been subject to weaker
selection, as well as to assess the genome-wide empirical significance of individual regions, we
further considered non-overlapping windows of 20 kb. We calculated FST (1 – π w/ π b), but this
statistic is extremely noisy when using a sample size of two chromosomes from each population.
We also considered the following summary statistic:
Ps1 = (hs1 + ss) / (hs1 + hs2 + ss + fd),
where hs1 represents the number of sites heterozygous only within the focal species, hs2 the
number of sites heterozygous only within the other species, ss the number of shared
heterozygous sites, and fd the number of sites at which the two species are homozygous for
different alleles.
We expect recent selection to result in a decrease in hs1 and ss, and an increase in fd, but not
affect hs2, so we are particularly interested in regions for which this statistic is unusually small. In
order to delineate the boundaries of strongly selected regions for the annotation of genes based
on the initial two-sample dataset, we calculated Ps1 in each window of 100 kb, sliding by 10 kb.
Within all such windows with at least 80 kb of callable sites in both species, 3.9% had Ps1 = 0 for
the blue-eyed black lemur (0.5% for the black lemur). We obtained an initial set of candidate
regions to annotate for each species by identifying all windows that fell in the 3.9% tail of Ps1 for
100 kb sliding windows in that species and then combining any overlapping windows into larger
regions. To determine the empirical percentile of individual regions described in the main text, as
well as to identify regions subject to weaker selection in the larger dataset, we also calculated Ps1
and FST for 20 kb non-overlapping windows (Figure S7). We selected 20 kb windows in the 1%
FST tail from the larger dataset, combining any adjacent windows, to generate a list of regions
from the larger dataset to annotate for gene ontology analysis (see S21).
Appendix S17. Annotation of orthologs of OCA2 and additional human iris pigmentation
candidate genes within the blue-eyed black lemur genome
We downloaded the sequence of exons from the longest human transcript of OCA2 from
Ensembl GRCh37. We used BLASTN 2.2.26+ with default parameters to compare each human
exon to the reference genome assembly of the blue-eyed black lemur. From the six exons that
successfully mapped in this initial search, we determined that the coding region was likely to be
within scaffold 2503. We downloaded the RefSeq-annotated human OCA2 sequence, including
untranslated regions and all introns from the UCSC genome browser and used BLASTN to
compare it to scaffold 2503. We used the overlap of the BLASTN hits with the positions of human
exons from the RefSeq gene annotation to identify the positions of the remaining lemur exons
within the scaffold. Part of the first exon, which lacks strong conservation in mammals, could not
be annotated in this way. For this exon, we determined the first base pair of the sequence by
finding the start codon that would maintain the same translation frame nearest to the predicted
start position based on the human exon length. We obtained sequence for all exons for both
species from the SNP-containing references, and we translated the resulting open reading frames
using eBioX version 1.5.1 (http://www.ebioinformatics.org/).
We found a single difference between the protein sequences of the two species, namely, that
position 89 was L in blue-eyed black lemur, whereas in black lemur, it was heterozygous for L
and P. Because this site is segregating within black lemurs, it cannot be solely responsible for the
lemurs’ fixed iris pigmentation difference, yet it is possible that it could contribute to the
phenotype. To determine whether this site might play a role in iris pigmentation, we aligned the
inferred protein coding sequences of the two lemurs against coding sequences for nine other
mammals (chimpanzee, dog, horse, human, macaque, marmoset, mouse, mouse lemur, and
orangutan), downloaded from Ensembl. We observed that this site was not highly conserved, and
notably both L and P alleles occurred in other brown-eyed mammals, and we thus inferred that
this locus is likely not involved in iris color determination in lemurs.
In addition to OCA2, a number of other genes have been much more weakly associated with
natural iris pigmentation variation in humans or associated with disease or mutant phenotypes
resulting in blue irises in humans or mice (Kanetsky et al. 2002; Loftus et al. 2002; Frudakis et al.
2003; Sulem et al. 2007; Sulem 2008; Sturm et al. 2008; Liu et al. 2009, 2010; Valenzuela et al.
2010; Pingault et al. 2010; Hellström et al. 2011). We identified the location of orthologs of 16
such pigmentation candidate genes (ASIP, CYP1A2, DCT, DSCR9, HGS, MITF, MYO5A,
NPLOC4, PAX3, PMEL, RAB38, SLC24A4, SLC24A5, SLC45A2, TYR, and TYRP1) within the
blue-eyed black lemur genome using BLAST, and we looked for evidence of selection within
these regions. Specifically, we downloaded one DNA sequence for each gene from Ensembl
GRCh37 and used BLASTN version 2.2.27+ (Altschul et al. 1990, 1997) with default parameters
to identify the best match between each exon and the lemur reference genome. We examined the
regions containing multiple exons for signatures of selection in the blue-eyed black lemur. In our
initial, two-sample analysis, two of three 20 kb windows overlapping DCT and one of seven
windows overlapping ASIP had Ps1 = 0. In the analysis of the larger dataset, the signals of
selection at DCT and ASIP were somewhat reduced (DCT minimum Ps1 = 0.20, 9.1%-tile;
maximum FST = 0.80, 18.7%-tile; ASIP minimum Ps1 = 0.17, 6.6%-tile; maximum FST = 0.88,
6.4%-tile). However, several other candidate genes overlapped windows in the tail of these test
statistics in the larger dataset: MITF and TYR, which play known roles in the melanin biosynthesis
pathway and have been robustly associated with human pigmentation variation (Sturm et al.
2008; Visser et al. 2012), and LYST and NPLOC4, which overlap regions associated with
quantitative pigmentation variation in one recent study (Liu et al. 2010). When we corrected for
the number of windows tested for each gene by comparing FST and Ps1 statistics in these
windows to those in genome-wide regions matched for length (see Methods), only the signal at
MITF remained unusual (corrected p-values: ASIP FST p = 0.269, Ps1 p = 0.287; DCT FST p =
0.383, Ps1 p = 0.212, LYST FST p = 0.155, Ps1 p = 0.177; MITF FST p = 0.029, Ps1 p = 0.014;
NPLOC4 FST p = 0.061, Ps1 p = 0.107; TYR FST p = 0.351, Ps1 p = 0.132).
We determined the amino acid sequences for three of these genes that showed potential
signatures of selection in the larger dataset and whose function in melanin biosynthesis is wellestablished: ASIP, MITF, and TYR. We downloaded the sequence of exons from the two known
transcripts of ASIP, five known transcripts of MITF, and one known transcript of TYR from
Ensembl GRCh37. As in the annotation of OCA2, we used a BLASTN search with default
parameters to find the scaffold or scaffolds most likely to contain the gene. We then extracted the
sequence of those scaffolds and performed a discontiguous megablast of each human exon not
found in the initial search to the scaffolds using the online BLASTN tool
(http://www.ncbi.nlm.nih.gov/blast). In this way, we were able to identify the positions of all exons
for known transcripts of these genes. We obtained sequence for all exons for both species from
the SNP-containing references, and we translated the resulting open reading frames using eBioX
version 1.5.1 (http://www.ebioinformatics.org/).
Appendix S18. Identification of candidate regulatory changes within the scaffold containing
the OCA2 ortholog
We found no fixed amino acid differences between the blue-eyed black lemur and black lemur
OCA2 sequences obtained by translating the coding sequence annotated from the two genomes
(see S17). We therefore focused our search for candidate loci influencing iris pigmentation on
regions with potential regulatory function. In humans, the causal variant disrupts a HLTF binding
site; the blue-eyed allele is also associated with reduced binding of the transcription factors LEF1
and MITF at nearby motifs in comparison to binding at the brown-eyed allele, even though there
are no sequence changes in these motifs (Eiberg et al. 2008; Visser et al. 2012). These findings
suggest that all three factors (HLTF, LEF1, and MITF) may jointly regulate OCA2 expression in
humans, and thus mutations that disrupt binding of one of these factors in blue-eyed black lemurs
provide strong candidates for a causal mutation. We therefore searched for sequences predicted
to bind HLTF, LEF1, or MITF strongly in the black lemur and weakly in the blue-eyed black lemur.
Because in human all three of these factors bind near the causal site, we focused on inferred
differential binding sites in lemur where another transcription factor also had a motif nearby.
Previous research has demonstrated strong conservation of binding site preferences for some
transcription factors across vertebrates (Schmidt et al. 2010). Assuming that binding site
preferences for HLTF, LEF1, and MITF have been conserved between lemurs and humans, we
used human-derived motifs to search for potential binding site changes in lemurs. We
downloaded position weight matrices (PWMs) for RUSH (the mammalian ortholog of HLTF),
LEF1, and E-box (the type of binding site recognized by MITF) from TRANSFAC (Wingender et
al. 1996) or JASPAR (Sandelin et al. 2004). We used the PWMs to calculate "PWM scores,"
which measure how strongly observed sequences match a motif. Specifically, a PWM score is the
likelihood of binding of a transcription factor given the observed sequence. Conditional on
transcription factor binding at one allele, the difference in PWM scores between alleles has been
shown to be a good predictor of a change in transcription factor binding (McVicker et al. 2013).
We calculated PWM scores for each factor at each site within the scaffold 2503 sequence for
black lemur and blue-eyed black lemur. For each factor, we identified sites where the black lemur
and blue-eyed black lemur samples were fixed for different alleles, where the black lemur allele
resulted in a PWM score within the top 10% of scores within the scaffold, and where the blueeyed black lemur allele resulted in a 10-fold lower predicted binding (a log-10 reduction in PWM
score).
For these sites, we determined which allele was derived using one of two outgroups: aye-aye
(data downloaded from http://giladlab.uchicago.edu/data/AyeAyeGenome/) or mouse lemur (data
downloaded from ftp://ftp.ensembl.org/pub/release-74/fasta/microcebus_murinus/dna).
Specifically, we used BLASTN (Altschul et al. 1990, 1997) to identify the ortholog of the 100 bp
surrounding each allele in aye-aye and/or mouse lemur, and we determined whether each
sequence at the divergent site resulted in a match with the outgroup(s) in the top BLAST hit. Our
candidates are cases in which the black lemur allele at the divergent site matched the outgroup
allele and the blue-eyed black lemur allele did not (i.e., providing evidence that the black lemur
allele was ancestral and the blue-eyed black lemur allele was derived). We required that the
sequence surrounding both alleles have BLAST hits for at least one of the two outgroups, and
that neither outgroup be discordant (discordant cases were ones in which the black lemur allele
was inferred to be derived or in which both alleles had perfect matches in the outgroup). We then
identified the subset of these candidates residing within 200 bp of a site within the top 10% of
PWM scores for at least one of the other transcription factors (Figure 5b). The scripts used to
perform these searches are available at
https://github.com/sorrywm/genome_analysis/regulatory_annotation.
Appendix S19. Calculation of summary statistics from the combined sample using ANGSD
and ngsTools
Following raw read filtering, reads were aligned separately for each sample and lane using bwa
and samtools, with the same alignment procedure as for the high coverage black lemur data (SI
Text S7). Duplicates were marked with Picard, and reads were re-aligned locally around indels
using GATK’s RealignerTargetCreator and IndelRealigner. The resulting alignment (.bam) files
were used as input to ANGSD (http://popgen.dk/wiki/index.php/ANGSD). Initial site frequency
spectrum likelihoods were estimated per site based on individual genotype likelihoods, assuming
Hardy-Weinberg equilibrium (-doSaf 1), with a GATK model for genotype likeihoods (-GL 2). For
this step, sites were filtered to require a minimum mapping quality of 1, minimum quality of 20,
and a minimum of three individuals per species with data. The resulting likelihoods for 99% of all
sites were used to estimate a genome-wide site frequency spectrum for each species separately,
using emOptim2. This site frequency spectrum was used as a prior to estimate posterior
probabilities for per-site allele frequencies (-pest). For this step, sites were filtered as previously,
with the additional requirement of a minimum depth of 9.
The posterior probabilities of per-site allele frequency spectra were provided as input to the
ngsTools programs ngsStat and ngsFST (https://github.com/mfumagalli/ngsTools) (Fumagalli
2013; Fumagalli et al. 2013). These programs output summary statistics, which were then
summarized over 20 kb windows to assess the genome-wide distribution of the statistics Ps1 and
FST. Specifically, we calculated Ps1 for each species as ∑ss/∑Pvar, where ss represents the persite probability of a segregating site in individuals of that species, and Pvar represents the per-site
probability of a segregating site in the whole sample. We used the per-species heterozygosity
(2pq) and dXY (p1q2 + p2q1) output by ngsStat as HW and HB, respectively, to calculate single
species FST as 1 – Hw/HB. Across regions, we summarized FST as 1 – ∑Hw/∑HB. For the
calculation of both statistics, we excluded sites with Pvar < 0.8. We excluded regions in which
<90% of sites passed the data filters above from analysis.
Appendix S20. Assessment of admixture in the combined sample
We used a subset of high confidence polymorphic sites (Pvar ≥ 0.8 as in S19) in the combined
sample to perform principal components analysis (PCA) and an estimation of admixture
proportions. To minimize linkage disequilibrium between sites, we randomly sampled sites from
scaffolds longer than 100 kb, selecting at most one site per 100 kb. For PCA, we first called
genotypes using ANGSD –doGeno 32, and subsequently estimated the covariance matrix among
sites using ngsCovar (Fumagalli 2013; Fumagalli et al. 2013). We plotted the first two principal
components, which represent 34.6% and 12.4% of the total variation, respectively (Figure S5a).
For the estimation of admixture proportions, we generated Beagle format genotype likelihoods
using ANGSD –doGlf 2, and then ran NGSadmix (Skotte et al. 2013) with k = 2 to infer the
proportion of each sample derived from two source populations (Figure S5b). The command lines
for angsd, ngsCovar, and NGSadmix were as follows:
angsd -bam $BAMLIST -nInd 8 -doGeno 32 -doPost 1 -out $PCAOUT -doGlf 2 -P 5 -sites
$SITESFILE -GL 1 -doMajorMinor 1 -doMaf 2
ngsCovar -probfile $PCAOUT.geno -outfile $PCAOUT.covar -nind 8 -nsites $N_SITES -call 0
NGSadmix -likes $PCAOUT.beagle.gz -K 2 -P 4 -o $PCAOUT.ngsadmix
Appendix S21. Annotation of candidate selected regions and gene ontology analysis
We annotated genes in the regions identified in SI Text S17 (from the two-sample dataset, using
100 kb sliding windows, and from the full dataset, using 20 kb non-overlapping windows) by
aligning human protein sequences to the blue-eyed black lemur genome. We obtained protein
sequences for human genome build hg18 and used TBLASTN version 2.2.22+ (Altschul et al.
1990, 1997), with an e-value threshold of 10-5 (two-sample dataset) or 5 x 10-5 (full dataset) to
identify orthologs within the regions of the blue-eyed black lemur reference genome
corresponding to the 3.9% PS1 tail (two-sample dataset) or 1% FST tail (full dataset). We then took
the list of all human proteins with hits within candidate regions and performed TBLASTN for these
proteins against the entire lemur genome. We retained proteins whose best genome-wide match
(containing the lowest e-value or maximum mean percent identity) for any subset of the protein
sequence overlapped the candidate region. In cases in which multiple proteins mapped to the
same location (>50% protein length overlapping, presumably representing multiple transcripts of
the same gene or multiple genes in the same family), we retained the protein with the largest total
length spanned by initial TBLASTN hits or the largest mean percent identity. If multiple proteins
were equivalent by these two tests (generally representing multiple transcripts of the same gene),
we selected one at random for further testing. All transcripts mapped to regions in the 1% tail of
FST from the full dataset in either species are provided in Dataset S1 (available on Dryad at
doi:10.5061/dryad.rn745). We converted the candidate lists to unique Ensembl gene IDs for gene
ontology (GO) enrichment analysis.
We tested for an enrichment of specific GO categories among genes annotated within candidate
regions for the blue-eyed black lemur from the full dataset using the Database
for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 (Huang et al. 2008, 2009). In
order to focus on genes with evidence for selection specific to the blue-eyed black lemur lineage,
we excluded any genes mapping to regions with black lemur FST in the 20% tail. We performed
functional annotation with default settings in DAVID using this candidate gene list, subsampled to
include only one gene per region, with the background gene list generated from the hg18 unique
Ensembl gene IDs. We report fold enrichment and EASE score (adjusted Fisher Exact Test pvalue) from the functional annotation chart. Our initial analysis indicated an enrichment of genes
related to melanocyte development (Biocarta, 50.5-fold enrichment, p = 0.034) and within the
melanogenesis pathway (KEGG, 7.3-fold enrichment, p = 0.057); however, both of these
categories have median gene lengths that are substantially longer than those of the background
gene set. To correct for this, we ranked all genes in the background set by longest transcript
length and selected the sets of genes that would result in the same median gene length as those
of the categories of interest, starting with the longest genes; we then used these background sets
to re-run the GO enrichment analysis.
Melanocyte development was the most enriched category in our initial analysis. Other GO
categories with initial enrichments at least as strong as those for melanogenesis were
U2A'/phosphoprotein 32 family A, C-terminal (Interpro, 41.8-fold enrichment, p = 0.046); Spectrin
repeat (Interpro, 23.9-fold enrichment, p = 0.079) and repeat: Spectrin 1, 2, 3, 4, and 5
(UP_SEQ_FEATURE, 18.9 – 35.0-fold enrichment, p = 0.055 – 1); LRRcap (SMART, 34.6-fold
enrichment, p = 0.055); single fertilization and fertilization (GOTERM_BP_FAT, 12.8- and 9.7-fold
enrichment, p = 0.022 and 0.037, respectively); and transcription factor (SP_PIR_KEYWORDS,
11.2-fold enrichment, p = 0.029).
Appendix S22. Genome size estimation
One way to estimate the size of a genome is by dividing the total sequence contained within kmers of a given length by the peak depth of such k-mers. To do this, we calculated the total
sequence length of 17-mers within trimmed reads from the blue-eyed black lemur data (Ktotal).
The 17-mers present only once are unique k-mers, which are likely sequencing errors, and
considering them can lead to overestimation of the genome size. Thus, we subtracted Kunique (the
total sequence length of unique 17-mers) from Ktotal and divided the result by the peak 17-mer
sequencing depth to obtain the estimated genome size.
Ktotal = 130.259 Gb
Kunique = 1.554 Gb
Peak sequencing depth = 48X
The genome size is thus estimated to be (130.259 – 1.554) / 48 = 2.681 Gb. This estimate
corresponds well to molecular weight-based genome size estimates for the black lemur, which
range from 2.62 to 3.51 Gb (http://www.genomesize.com). In particular, the most recent
molecular estimate based on flow cytometry indicates a genome size of approximately 2.62 Gb
(Krishan et al. 2005).
Appendix S23. Identification of neighboring scaffolds to the scaffold containing the OCA2
ortholog
In order to investigate patterns of divergence and diversity beyond the 600 kb in scaffold 2503,
where the orthologs of HERC2 and OCA2 reside, we identified the neighboring scaffolds using
two different methods. First, we used BLASTN to identify the regions of the human genome (build
hg19) orthologous to the first and last 100 kb of scaffold 2503. We then downloaded these
sequences, along with the 100 kb beyond the mapping position of the end of scaffold 2503, from
the UCSC genome browser (http://genome.ucsc.edu/), and used BLASTN to compare each
region to the lemur assembly. Assuming synteny had been conserved to human, we inferred that
the scaffold with the highest scoring match to the 100 kb beyond the mapping position of scaffold
2503 was likely the adjacent scaffold. The BLASTN results indicated that the first 60 kb of
scaffold 3393 “preceded” scaffold 2503 (i.e., was adjacent to the 0 end) with approximately 1.6 kb
of unmappable sequence separating the two, and that scaffold 2174 “followed” scaffold 2503 (i.e.,
was adjacent to the 600 kb end), separated by approximately 7 kb.
We additionally used information about the mapping of MP paired reads to provide further
evidence that the first 60 kb of scaffold 3393 mapped adjacent to the 0 end of scaffold 2503. We
used SAMtools to identify read pairs from the unfiltered BAM file that did not map as “proper
pairs” with both reads mapping in the appropriate orientation to the same scaffold (using “view –f
2”), and in which one read mapped to scaffold 2503. Reads within the first 3 kb and 8 kb of
scaffold 2503 had pairs mapping to 56.6 kb and 51.5 kb to 59.2 kb of scaffold 3393 for 3 kb and 8
kb MP libraries, respectively. No reads within the last 8 kb of scaffold 2503 had pairs that mapped
to different scaffolds, suggesting that the “following” scaffold was separated by too long a region
of unmappable DNA (presumably repetitive DNA) to be identified in this way.
The patterns of diversity and divergence across scaffold 2174 suggested that the orientation of
this scaffold was not the same as that of scaffold 2503, relative to the human genome. We
hypothesize that an inversion may have occurred within this region since the ancestor of lemurs
and humans, and we represent scaffold 2174 in Figure 4a (the scaffold to the right of scaffold
2503) assuming such an inversion.
Supporting References
Abecasis GR, Altshuler D, Auton A et al. (2010) A map of human genome variation from
population-scale sequencing. Nature, 467, 1061–73.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool.
Journal of Molecular Biology, 215, 403–10.
Altschul SF, Madden TL, Schäffer AA et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic acids research, 25, 3389–402.
Auton A, Fledel-Alon A, Pfeifer S et al. (2012) A fine-scale chimpanzee genetic map from
population sequencing. Science (New York, N.Y.), 336, 193–8.
Benaglia T, Chauveau D, Hunter DR, Young DS (2009) mixtools: An R Package for Analyzing
Mixture Models. J Stat Softw, 32, 1–29.
Brent RP (Richard P. (1973) Algorithms for minimization without derivatives. Prentice-Hall,
Englewood Cliffs ;[Hemel Hempstead.
Chaisson MJ, Brinza D, Pevzner PA (2009) De novo fragment assembly with short mate-paired
reads: Does the read length matter? Genome research, 19, 336–46.
Delaneau O, Zagury J-F, Marchini J (2013) Improved whole-chromosome phasing for disease
and population genetic studies. Nature methods, 10, 5–6.
DePristo M a, Banks E, Poplin R et al. (2011) A framework for variation discovery and genotyping
using next-generation DNA sequencing data. Nature Genetics, 43, 491–8.
Dubchak I, Poliakov A, Kislyuk A, Brudno M (2009) Multiple whole-genome alignments without a
reference organism. Genome research, 19, 682–9.
Earl D, Bradnam K, St John J et al. (2011) Assemblathon 1: a competitive assessment of de novo
short read assembly methods. Genome research, 21, 2224–41.
Eiberg H, Troelsen J, Nielsen M et al. (2008) Blue eye color in humans may be caused by a
perfectly associated founder mutation in a regulatory element located within the HERC2
gene inhibiting OCA2 expression. Human Genetics, 123, 177–87.
Freedman AH, Gronau I, Schweizer RM et al. (2014) Genome Sequencing Highlights the
Dynamic Early History of Dogs (L Andersson, Ed,). PLoS Genetics, 10, e1004016.
Frudakis T, Thomas M, Gaskin Z et al. (2003) Sequences associated with human iris
pigmentation. Genetics, 165, 2071–2083.
Fumagalli M (2013) Assessing the effect of sequencing depth and sample size in population
genetics inferences. (L Orlando, Ed,). PLOS ONE, 8, e79667.
Fumagalli M, Vieira FG, Korneliussen TS et al. (2013) Quantifying population genetic
differentiation from next-generation sequencing data. Genetics, 195, 979–92.
Gnerre S, Maccallum I, Przybylski D et al. (2011) High-quality draft assemblies of mammalian
genomes from massively parallel sequence data. Proceedings of the National Academy of
Sciences of the United States of America, 108, 1513–8.
Hellenthal G, Stephens M (2007) msHOT: modifying Hudson’s ms simulator to incorporate
crossover and gene conversion hotspots. Bioinformatics (Oxford, England), 23, 520–1.
Hellström AR, Watt B, Fard SS et al. (2011) Inactivation of Pmel alters melanosome shape but
has only a subtle effect on visible pigmentation. (IJ Jackson, Ed,). PLoS genetics, 7,
e1002285.
Huang DW, Sherman BT, Lempicki RA (2008) Systematic and integrative analysis of large gene
lists using DAVID bioinformatics resources. Nat. Protocols, 4, 44–57.
Huang DW, Sherman BT, Lempicki R a (2009) Bioinformatics enrichment tools: paths toward the
comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37, 1–13.
Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation.
Bioinformatics (Oxford, England), 18, 337–8.
Hwang DG, Green P (2004) Bayesian Markov chain Monte Carlo sequence analysis reveals
varying neutral substitution patterns in mammalian evolution. Proceedings of the National
Academy of Sciences of the United States of America, 101, 13994–14001.
Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. Journal of
computational biology : a journal of computational molecular cell biology, 2, 291–306.
Kanetsky P a, Swoyer J, Panossian S et al. (2002) A polymorphism in the agouti signaling protein
gene is associated with human pigmentation. American journal of human genetics, 70, 770–
5.
Kaplan NL, Hudson RR, Langley CH (1989) The “hitchhiking effect” revisited. Genetics, 123,
887–99.
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of
sequencing errors. Genome Biology, 11, R116.
Kim S-H, Elango N, Warden C, Vigoda E, Yi S V (2006) Heterogeneous genomic molecular
clocks in primates. PLoS genetics, 2, e163.
Kong A, Barnard J, Gudbjartsson DF et al. (2004) Recombination rate and reproductive success
in humans. Nature Genetics, 36, 1203–6.
Kong A, Frigge ML, Masson G et al. (2012) Rate of de novo mutations and the importance of
father’s age to disease risk. Nature, 488, 471–5.
Krishan A, Dandekar P, Nathan N et al. (2005) DNA index, genome size, and electronic nuclear
volume of vertebrates from the Miami Metro Zoo. Cytometry - Part A, 65, 26–34.
Kurtz S, Phillippy A, Delcher AL et al. (2004) Versatile and open software for comparing large
genomes. Genome biology, 5, R12.
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics (Oxford, England), 25, 1754–60.
Li H, Durbin R (2011) Inference of human population history from individual whole-genome
sequences. Nature, 475, 493–6.
Li H, Handsaker B, Wysoker A et al. (2009) The Sequence Alignment/Map format and SAMtools.
Bioinformatics (Oxford, England), 25, 2078–9.
Li R, Zhu H, Ruan J et al. (2010) De novo assembly of human genomes with massively parallel
short read sequencing. Genome Research, 20, 265–72.
Liu F, van Duijn K, Vingerling JR et al. (2009) Eye color and the prediction of complex
phenotypes from genotypes. Current biology : CB, 19, R192–3.
Liu F, Wollstein A, Hysi PG et al. (2010) Digital quantification of human eye color highlights
genetic association of three new loci. PLoS genetics, 6, e1000934.
Loftus SK, Larson DM, Baxter LL et al. (2002) Mutation of melanosome protein RAB38 in
chocolate mice. Proceedings of the National Academy of Sciences of the United States of
America, 99, 4471–6.
Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of
Illumina sequence reads. Genome research, 21, 936–9.
Luo R, Liu B, Xie Y et al. (2012) SOAPdenovo2: an empirically improved memory-efficient shortread de novo assembler. GigaScience, 1, 18.
Lynch M (2010) Evolution of the mutation rate. Trends in genetics : TIG, 26, 345–52.
Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of
occurrences of k-mers. Bioinformatics (Oxford, England), 27, 764–70.
McKenna A, Hanna M, Banks E et al. (2010) The Genome Analysis Toolkit: a MapReduce
framework for analyzing next-generation DNA sequencing data. Genome research, 20,
1297–303.
McVicker G, van de Geijn B, Degner JF et al. (2013) Identification of genetic variants that affect
histone modifications in human cells. Science (New York, N.Y.), 342, 747–9.
Meyer WK, Zhang S, Hayakawa S, Imai H, Przeworski M (2013) The convergent evolution of blue
iris pigmentation in primates took distinct molecular paths. American Journal of Physical
Anthropology, 151, 398–407.
Miller JR, Delcher AL, Koren S et al. (2008) Aggressive assembly of pyrosequencing reads with
mates. Bioinformatics (Oxford, England), 24, 2818–24.
Müller S, O’Brien PCM, Ferguson-Smith MAF, Wienberg J (1997) Reciprocal chromosome
painting between human and prosimians (Eulemur macaco macaco and E. fulvus
mayottensis). Cytogenetic and Genome Research, 78, 260–271.
Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in
eukaryotic genomes. Bioinformatics (Oxford, England), 23, 1061–7.
Parra G, Bradnam K, Ning Z, Keane T, Korf I (2009) Assessing the gene space in draft genomes.
Nucleic Acids Research, 37, 289–97.
Pevzner PA, Tang H, Waterman MS (2001) An Eulerian path approach to DNA fragment
assembly. Proceedings of the National Academy of Sciences of the United States of
America, 98, 9748–53.
Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: finding the elusive misassembly. Genome biology, 9, R55.
Pingault V, Ente D, Dastot-Le Moal F et al. (2010) Review and update of mutations causing
Waardenburg syndrome. Human Mutation, 31, 391–406.
Prado-Martinez J, Sudmant PH, Kidd JM et al. (2013) Great ape genetic diversity and population
history. Nature, 499, 471–5.
Salzberg SL, Phillippy AM, Zimin A et al. (2012) GAGE: A critical evaluation of genome
assemblies and assembly algorithms. Genome research, 22, 557–67.
Salzberg SL, Yorke JA (2005) Beware of mis-assembled genomes. Bioinformatics (Oxford,
England), 21, 4320–1.
Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B (2004) JASPAR: an openaccess database for eukaryotic transcription factor binding profiles. Nucleic acids research,
32, D91–4.
Schatz MC, Delcher AL, Salzberg SL (2010) Assembly of large genomes using secondgeneration sequencing. Genome research, 20, 1165–73.
Schatz MC, Phillippy AM, Shneiderman B, Salzberg SL (2007) Hawkeye: an interactive visual
analytics tool for genome assemblies. Genome biology, 8, R34.
Schmidt D, Wilson MD, Ballester B et al. (2010) Five-vertebrate ChIP-seq reveals the
evolutionary dynamics of transcription factor binding. Science (New York, N.Y.), 328, 1036–
40.
Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed
data structures. Genome research, 22, 549–56.
Simpson JT, Wong K, Jackman SD et al. (2009) ABySS: a parallel assembler for short read
sequence data. Genome research, 19, 1117–23.
Skotte L, Korneliussen TS, Albrechtsen A (2013) Estimating Individual Admixture Proportions
from Next Generation Sequencing Data. Genetics, 195, 693–702.
Sturm R a, Duffy DL, Zhao ZZ et al. (2008) A single SNP in an evolutionary conserved region
within intron 86 of the HERC2 gene determines human blue-brown eye color. American
Journal of Human Genetics, 82, 424–31.
Sulem P (2008) Two newly identified genetic determinants of pigmentation in Europeans. Nature
genetics, 40, 835.
Sulem P, Gudbjartsson DF, Stacey SN et al. (2007) Genetic determinants of hair, eye and skin
pigmentation in Europeans. Nature genetics, 39, 1443–52.
Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing:
computational challenges and solutions. Nature reviews. Genetics, 13, 36–46.
Tsantes C, Steiper ME (2009) Age at first reproduction explains rate variation in the strepsirrhine
molecular clock. Proceedings of the National Academy of Sciences of the United States of
America, 106, 18165–70.
Valenzuela RK, Henderson MS, Walsh MH et al. (2010) Predicting phenotype from genotype:
normal pigmentation. Journal of forensic sciences, 55, 315–22.
Visser M, Kayser M, Palstra R-J (2012) HERC2 rs12913832 modulates human pigmentation by
attenuating chromatin-loop formation between a long-range enhancer and the OCA2
promoter. Genome Research, 22, 446–55.
Volampeno MSN, Masters JC, Downs CT (2011) Life history traits, maternal behavior and infant
development of blue-eyed black lemurs (Eulemur flavifrons). American journal of
primatology, 73, 474–84.
Wingender E, Dietze P, Karas H, Knüppel R (1996) TRANSFAC: a database on transcription
factors and their DNA binding sites. Nucleic acids research, 24, 238–41.
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn
graphs. Genome research, 18, 821–9.
Supporting Information Figures
Pipeline for generation and analysis of genomic data
Contig: contiguous sequence of DNA obtained by merging
overlapping k-mers
Blue-eyed black lemur
(Eulemur flavifrons)
Black lemur
(Eulemur macaco)
k-mer: sequences of k bp in length, derived from sequencing reads
and used in generating a de novo assembly
Mate pair: outward-facing pairs of short (32 bp) reads sequenced
from the ends of large (2 – 10 kb) circularized amplicons
N50: The size of the smallest contig/scaffold such that 50% of the
genome is contained in scaffolds of that size or larger.
Paired-end: inward-facing pairs of short (50 – 150 bp) reads
sequenced from the ends of small (200 – 1000 bp) amplicons
Scaffold: sets of contigs merged using paired-end or mate pair
data, with Nʼs separating contiguous sequence
1) Prepare
genomic DNA
libraries
Additional
individuals
resequenced
to low
coverage
Photos courtesy D. Haring
180, 500, 1000 bp insert size
paired-end
500 bp insert size paired-end
350 bp insert size paired-end
3 kb + 8 kb mate-pair
2) Sequence
9 lanes Illumina HiSeq
(achieved 52x mean coverage)
1 lane Illumina HiSeq
(achieved 21x mean coverage)
1 lane Illumina HiSeq (all 6
samples barcoded and pooled)
(achieved 4 - 15x coverage)
3) Perform
quality control
Filter reads with FASTX-Toolkit
Filter reads with FASTX-Toolkit
Filter reads with FASTX-Toolkit
Align to assembly using bwa (Li and
Durbin 2009) and SAMtools (Li et al.
2010)
Align to assembly using bwa
(Li and Durbin 2009) and
SAMtools (Li et al. 2010)
Trim reads with bwa (Li and
Durbin 2009)
Error correction with Quake
(Kelley et al. 2010)
4) Assemble/
align
de novo assembly using
SOAPdenovo (Li et al. 2009)
(scaffold N50: 420 kb)
Generate reference-based assembly
using custom python script
5) Identify
variable sites
Identify polymorphic and
divergent sites using GATK
(DePristo et al. 2011)
Identify polymorphic and divergent sites
using GATK (DePristo et al. 2011)
Estimate regional heterozygosity and divergence
Estimate Ne and species split time, and identify signatures of selection
Use PSMC (Li and Durbin 2011) to estimate historic Ne
Generate genotype likelihoods
using ANGSD and ngsTools
(Fumagalli et al. 2013)
Estimate regional heterozygosity
and divergence
Identify signatures of selection
Figure S1. Overview of assembly and analysis pipeline. A schematic highlighting the steps in
the assembly and variant detection pipelines for the two individuals sequenced to high coverage,
as well as the additional samples sequenced to low coverage for further population genetic
analyses. In the upper left is a brief glossary of key assembly-related terms.
Figure S2. Peak memory consumption and time to completion for genome assembly steps.
Jellyfish: q-mer counting; Quake: error correction; Graph: de Bruijn graph construction; Contig:
merging sequences into contigs; Map: mapping reads to contigs; Scaffold: using PE information
to construct scaffolds; Gap: using local assemblies to close scaffold gaps.
Figure S3. Simulations to assess impact of window size, scaffolds, generation time,
mutation rate, adjustment for coverage, and population split on output of PSMC. A)
Different window sizes produce similar PSMC output. B) PSMC output for scaffolds is comparable
to that for full data. The simulated trajectory was simulated using the parameters in SI Text S13.
Figure S3, continued. C) True and inferred trajectories including a population split. Red: PSMC
estimate, Purple: True (simulated) trajectories; Green: Bootstrap estimates. D) Histogram of the
bootstrapped PSMC estimates of the population split time, using the same scaling parameters as
in Figure 4. The red dashed line denotes the simulated split time. E) Influence of mutation rate
estimates on inferred demographic history. F) Influence of generation time estimates on inferred
demographic history. G) Adjusted mutation rate provides an adequate correction for the reduced
ability to call SNPs in lower coverage data.
0
20
40
60
Mapped read coverage
80
0.020
0.000
0.010
Proportion of all k−mers
0.020
0.000
0.010
Proportion of all q−mers
2.5
2.0
1.5
1.0
0.5
0.0
Percent of genome
C
B
A
0
10
20
30
40
Q−mer coverage
50
60
0
50
100
150
200
250
K−mer coverage
Figure S4. Coverage distributions for mapped reads, q-mers, and k-mers. A) Distribution of
mapped read coverage for blue-eyed black lemur assembly. B) Distribution of coverage weighted
by quality score for 17-mers, which was used to choose the coverage cutoff of 3 (red line) for
Quake error correction. The mean of the distribution excluding all 17-mers with coverage ≤ 3
(assumed to represent true kmers) is 23.35, and the variance is 60.72. This shows that k-mer
coverage distribution before error correction does not correspond to the expectation of a single
Poisson, likely due in part to sequencing errors. C) Distribution of frequency of 17-mers in the
unfiltered dataset. Peak k-mer coverage is at 48X. The first valley in k-mer frequency is at 14,
shown by the red line. We used this coverage cutoff to correct the reads using SOAP’s error
correction. For A) and B), the plotted region represents 99.5% of the total. For C), the line at 250
represents k-mers from repetitive regions that are present more than 250 times.
Figure S5. Principal components analysis (PCA) and estimation of admixture proportions
indicate the absence of admixture in the combined sample. A) The samples’ projections
along the first and second principal component are displayed, with blue-eyed black lemurs in blue
and black lemurs in orange. The majority of variation separates the two species. B) Shown are
inferred proportions of ancestry from each of two source populations (arbitrarily colored in purple
and green), with Ef1 – 4 representing blue-eyed black lemur samples and Em1 – 4 black lemur
samples. None of the samples appears to have admixed ancestry.
B
0
200
400
600
800
1000
1200
insert size
C
30000
10000
0
500
1000
1500
2000
2500
insert size
2e+05
4e+05
good 1Kb library
0e+00
frequency
bad 1Kb library
0
100000
frequency
250000
bad 500bp library
0
frequency
A
0
200
400
600
800
1000
1200
1400
insert size
Figure S6. Bimodal distributions of estimated insert sizes indicate the presence of artifacts
in some library preparations. The mean insert sizes estimated from fitting a mixture model with
two normal distributions using mixtools (Benaglia et al. 2009) in R were 205 and 480 for the 500
bp library (A). For the first 1 kb library (B), the mean insert sizes from the mixture model were 371
and 958 bp, respectively. The second 1 kb library (C) did not show evidence of an artifactual
inclusion of smaller insert sizes.
3000
0
1000
2000
Frequency
6000
B
0
Frequency
A
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
Blue−eyed black lemur PS1 (two samples)
0.8
1.0
3000
0
1000
Frequency
4000
2000
0
Frequency
0.6
D
6000
C
< −1.0
−0.5
0.0
0.5
< −1.0
1.0
Blue−eyed black lemur FST (two samples)
−0.5
0.0
0.5
1.0
Blue−eyed black lemur FST (two samples)
F
0
600
0 200
200
400
Frequency
600
1000
800
E
Frequency
0.4
Black lemur PS1 (two samples)
0.0
0.2
0.4
0.6
0.8
0.0
0.2
Blue−eyed black lemur PS1 (full dataset)
0.4
0.6
0.8
1.0
Black lemur PS1 (full dataset)
H
600
0
200
400
Frequency
400
200
0
Frequency
600
800
800
G
< 0.0
0.2
0.4
0.6
0.8
Blue−eyed black lemur FST (full dataset)
1.0
< 0.0
0.2
0.4
0.6
0.8
1.0
Black lemur FST (full dataset)
Figure S7. Distributions of summary statistics from scans for selection in two-sample and
full datasets. Histograms of Ps1 and FST for each species, summarized across 20 kb windows. A)
Ps1 for the blue-eyed black lemur from the two-sample dataset. B) Ps1 for the black lemur from the
two-sample dataset. C) FST for the blue-eyed black lemur from the two-sample dataset.
Figure S7, continued. D) FST for the black lemur from the two-sample dataset. E) Ps1 for the
blue-eyed black lemur from the full dataset. F) Ps1 for the black lemur from the full dataset. G) FST
for the blue-eyed black lemur from the full dataset. H) FST for the black lemur from the full dataset.
Supporting Information Tables
Table S1. Read counts for each blue-eyed black lemur library
Libraries
(Fragment Sizes)
# of Raw Reads
(read length)
# of lanes
# of Quake
corrected Reads
(read length)
# of SOAP
corrected reads
(read length)
180bp
1
157441980
(100)
155080642
(80.76)
130918792 (99)
500bp
6
1211058780
(100)
961903196
(79.13)
851845445
(84)
1Kb
3
574221050 (100)
564925769
(86.74)
514601130
(91.5)
3 Kb / 8Kb
1
318329630 (35)
194896994 (35)
194896994
(35)
Total
11
2261051440
1876806601
1692262361
Average read lengths are shown in parentheses below the read counts. Although SOAP discards
more reads as errors, it produces a longer corrected read length and achieves comparable
coverage to that from Quake.
Table S2. Statistics for QCA and SCA
S2a Assembly statistics prior to resolution of bimodal insert size libraries
Estimate
d
Genome
Size (Gb)
QCA 2.6813
SCA 2.6813
Largest
contig
%chaf contig scaffold # of
N50 # of
f
size
N50
scaffold
(kb) contigs bases (kb)
(Mb)
s
13.0 398910 0.88 178.1 328.6
28413
13.9 385282 1.38 156.5 197.0
34879
%
Largest
single scaffold
tons size
2.1
2423633
2.7
1708178
%bases
assemble
d
from
scaffolds
80.46
79.8
S2b Assembly statistics following resolution of bimodal insert size libraries
QCA 2.6813
SCA 2.6813
16.3
17.9
280211 0.39
263463 0.86
222.3
195.1
421
323.1
21210
22730
0.6
1.1
3296881 79.8
2934864 79.6
Table S3 Primers
Name
Sequence
EmEf_12197_F
GCCATCCTGTTTTCATTTCG
Position on
scaffold 2503
12417
EmEf_12197_R
GTGTCTGAAAGCCCATCTCC
13653
EO-1R
EmEf_35470_F
CAGAGTCTTTCCCGACCTTG
35725
EO-2F
EmEf_35470_R
GGAAGGGTGAGAAAGGTGGT
36928
EO-2R
EmEf_67146_F
TGCACTTGAGTGAAGGACCTAA
67308
EO-3F
EmEf_3int_R
ATCCCATCATTCTGCCTCTG
67937
EO-3intR
EmEf_67146_R
AAAGGTCTTCTGCGACTTGC
68551
EO-3R
EmEf_142843_F
TCTCCTTTCCCTGCTCTGTG
143079
EO-5F
EmEf_142843_R
GGACCAGTGAAGGCAAGATT
144299
EO-5R
EmEf_192282_F
GGGTTTTGGTTGTAGGTCCA
192539
EO-7F
EmEf_192282_R
CCAAAGGAGAAAAGCAGGAG
193761
EO-7R
EmEf_211317_F
ATTTAAGTGGGTGCCAATGC
211500
EO-8F
EmEf_211317_R
CCACTGAATAGGAGAAAACATCC 212730
EO-8R
EmEf_263789_F
TAGTGGGAATGGGAAGCAAA
263979
EO-9F
EmEf_263789_R
TGTTTCCAAACTGCGGTCTA
265211
EO-9R
EmEf_335375_F
AGGGGCAACCTAATGCTCTC
335554
EO-12F
EmEf_12int_R
CCTTTTACAGTGGCCTGTAGC
336282
EO-12intR
EmEf_335375_R
GATGTGGGGGCAGAGTGTAG
336791
EO-12R
EmEf_357225_F
ATTTTCCTGAGCCCTTCTGG
357419
EO-12.5F
EmEf_357225_R
CCTCACGGCAGATTCTTAGC
358647
EO-12.5R
EmEf_371731_F
CAGGCCATTTCTTTCCCTTT
371917
EO-13F
EmEf_13int_F
CTGCAGGAAAACTCGTGGAT
372462
EO-13intF
EmEf_371731_R
TAGACATGTCCCAGCTCCTG
373168
EO-13R
EmEf_405684_F
GATTGCTGGCCAGAGTTTTT
405885
EO-14bF
EmEf_405684_R
ATCAAAGCTAGCACCCCAAA
407113
EO-14bR
EmEf_447752_F
TCCACGGTCTATTTGTTTGG
447978
EO-15bF
EmEf_447752_R
CCTTCAAGCGTGACAATTCC
449223
EO-15bR
EmEf_450326_F
CGTTGGCACATCTCCACTTA
450549
EO-16bF
EmEf_450326_R
AGCTGAGCAATCCCTGATGT
451782
EO-16bR
EmEf_492021_F
CATTTTTCATCTCCGCCAGT
492262
EO-17bF
EmEf_492021_R
ACCCTGAGTTAAGCAAAGATTG
493491
EO-17bR
EmEf_509600_F
GGAAAGAGCTGGAGGAACAA
509841
EO-17F
EmEf_17int_F
GCAGCATGCACTGTCTTGAT
510724
EO-17intF
EmEf_509600_R
AGTGGCTGAAAGCAGAGTCC
511059
EO-17R
EmEf_513958_F
AGGGGTTCTTGAGCTCTGTG
514179
EO-18F
EmEf_18int_R
TGCTTAATGGCCTTCAGAGG
514908
EO-18intR
SeqName
Notes
EO-1F
internal primer
for amplicon 3
internal primer
for amplicon 12
internal primer
for amplicon 13
internal primer
for amplicon 17
internal primer
for amplicon 18
EmEf_513958_R
GGGGGTTATGGCTTCAACTT
515411
EO-18R
EmEf_591196_F
GACAGTACTGGGGGCTCAAA
591291
EO-20F
EmEf_20int_F
GTATCCGGGAGCAGTTCTCA
591829
EO-20intF
EmEf_591196_R
TGGAGGTCATGCCTCTTTTC
592616
EO-20R
internal primer
for amplicon 20
Download
Related flashcards
Peptides

79 Cards

Molecular biology

92 Cards

Peptides

79 Cards

Create flashcards