MKARhalRh3 - Center for Comparative Genomics and

advertisement

Recent natural selection in human noncoding sequences

Heather A. Lawson, Joel Martin, David C. King, Belinda Giardine, Webb Miller, Hiroshi

Akashi, and Ross C. Hardison

Center for Comparative Genomics and Bioinformatics, Huck Institutes of Life Sciences, and Departments of Anthropology, Biology, Computer Science and Engineering,

Biochemistry and Molecular Biology, The Pennsylvania State University, University

Park, PA 16802

Running Title: Recent selection in noncoding sequences

Contact information for corresponding author:

Ross Hardison

Email: rch8@psu.edu

Phone: 814-863-0113

FAX: 814-863-7024

1

Abstract

It has long been speculated that adaptive changes in noncoding gene regulatory sequence, rather than in coding sequence, underlie many of the phenotypic differences between humans and other primates. Therefore, identification of noncoding regions bearing the signature of recent positive or negative selection is an important step toward understanding the evolutionary history of our species. Using the ratios of polymorphic sites to divergent sites in noncoding sequences relative to those in a widely distributed model for neutral DNA (repeat sequences ancestral to catarrhines) to evaluate deviation from neutral expectation, we examine the ability of current polymorphism data

(dbSNP126 filtered for quality) and divergence from chimpanzee and rhesus genomes to provide useful information in this variation of the McDonald-Kreitman (MKAR) test.

The test can be conducted on almost half the human genome, and at a false discovery rate of 10%, only 0.26% of the genome shows signficant deviation from neutrality. The windows implicated as non-neutral by the MKAR test are supported by other lines of evidence for selection: they show significant overlap with genes previously implicated as targets of recent selection, and the non-neutral windows suggestive of positive selection are enriched in genomic duplications and in intervals showing evidence of positive selection by Tajima’s D. The intervals deviating from neutrality overlap little when divergence is determined with chimpanzee versus rhesus, revealing distinctive signatures on different evolutionary time scales that may reflect the unique adaptive histories among these lineages. Some of the non-neutral noncoding segments contain

DNase hypersensitive sites; these are strong candidates for further experimental investigation of the role of gene regulation in human evolution. Results of this study are a useful addition to comparative genomics resources aimed at finding functional noncoding DNA sequences.

2

Introduction

Only a small fraction of mammalian genomes, roughly 5% ( 1-4 ), is thought to carry out functions conserved in all mammals. This fraction should be subject to negative

(purifying) selection and thus can be detected as sequences showing significantly less change relative to neutrally evolving DNA over the span of mammalian evolution. Other functional sequences show significantly more change relative to neutral expectations along certain mammalian lineages and hence appear to be under positive (adaptive) selection. Such a phylogenetic approach has been used to identify protein-coding genes showing evidence of selection on the human lineage( 5-9 ). Most standard tests for lineage-specific selection have been applied to protein-coding regions, e.g. the

McDonald-Kreitman test ( 10 ), relative rate tests ( 11 ), and a variety of tests based on differences in synonymous and nonsynonymous amino acid sites, e.g. D

N

/D

S

( 12 ).

However, functional sequences under selection include both coding and non-coding sequences, such as cis -regulatory modules, and identifying signatures of selection in non-coding DNA is an important line of inquiry into mammalian evolution in general, and into human evolution in particular. For example, mutations in a regulatory region near the lactase ( LCT ) gene are associated with selection for adult lactase persistence ( 13 ), and a mutation in the promotor region of the Duffy ( FY ) gene has been associated with selection for P. vivax malarial resistance in African populations ( 14 ). Other recent dramatic examples of presumptive positive selection in non-coding regions are the highly accelerated regions of the human genome ( 15, 16 ), and the conserved noncoding sequences found to be undergoing accelerated evolution by examining humanspecific substitutions in interspecies comparisons ( 17 ). Human chromosomal intervals suggestive of positive selection have been identified by skews in the frequency spectrum of polymor phisms (e.g. Tajima’s D in the study of Carlson et al. 2006) and extended regions of homozygosity (Akey ….). These latter studies utilize polymorphism frequency data in both coding and noncoding DNA sequences, and the results tend to be interpreted as recent selection in coding regions.

3

Another approach to finding DNA segments under recent selection compares the number of polymorphic sites (reflecting mutation rate) in one species with the number of sites divergent in related species (reflecting the fixation rate for mutations). Neutral sites are expected to accumulate intraspecific changes (polymorphisms) and interspecies changes (divergences) at proportional rates, e.g. a neutral region that mutates frequently in a population will also so a high number of divergent sites when compared with a related species. McDonald and Kreitman introduced an explicit test of deviation from neutral expectation in protein-coding DNA sequences ( 10 ) by comparing counts of polymorphisms and interspecies divergent sites at nonsynonymous (test) positions within coding regions with counts of polymorphisms and divergences at synonymous sites, which are assumed to be neutral. Most test sites should show a ratio of polymorphism to divergence counts indistinguishable from that at neutral sites ( 18 ); these will have a ratio of polymorphism to divergence relative to that at neutral sites ( r pd

) of around 1 [Akashi, 1995 #67]. Test sites under positive selection will diverge more between species than the neutral sites, giving an r pd

less than 1. Conversely, test sites under negative selection will diverge less between species than the neutral sites, producing an r pd

larger than 1. Population-level processes such as nonrandom mating, population growth, and population bottlenecks can also affect the r pd

( 19 ) but, in general, selection will differentially affect specific sequences whereas population-level processes will affect all sequences in a genome ( 20 ).

In this report, we explore the ability of current genome-wide data to formulate a useful

McDonald-Kreitman test for recent selection in noncoding DNA sequences in humans.

The human genome was partitioned into windows large enough to get sufficiently high counts for polymorphism and divergence; we found that 10kb windows gave a good balance between resolution and coverage of the genome (about half is covered).

Repetitive DNA ancestral to the species compared was used as the model for neutral sites, because these sites occur frequently and are interspersed with nonrepetitive DNA on a 10kb scale. Divergences of human from two primates, the closely related chimpanzee and the more distant rhesus macaque, were both used to examine effects of selection over two different time scales. The largest set of human polymorphism data

4

(dbSNP126) was used because it was less biased than the HapMap polymorphic sites; more comprehensive and less biased polymorphism data can be included as they become available.

Our results indicate that the currrent application of the test does give insights into evidence suggestive of recent selection on non-coding DNA sequences in the half of the human genome examined. The segments of DNA that show statistically significant deviation from neutral expectation by this test also appear to be biologically meaningful, because they overlap with several loci whose protein products appear to be under selection in humans. In contrast to the observations for recent selection in coding sequences, the non-neutral segments of non-coding DNA show an r pd

suggestive of negative selection more frequently than positive selection. The non-neutral DNA segments that overlap with DNase hypersensitive sites in chromatin are particularly strong candidates for gene regulatory regions under recent selection. The MKAR test is a useful and timely addition to comparative whole-genome resources aimed at finding functional non-coding DNA sequences.

Results

McDonald-Kreitman test in noncoding sequences

We implemented a variation of the McDonald-Kreitman test (MKAR test) to search for evidence of non-neutrality in equally-sized segments of DNA across the human genome. While developing this, we investigated many issues that affect the outcome of the test. First, we decided to exclude certain sites in the human genome. Coding portions of exons were masked so the that only non-coding DNA sequences contribute to the test, and the hypermutable CpG dinucleotides are not included. The next issue was the choice of the neutral model. Previous studies [Waterston, 2002 #54; Hardison,

2003 #15] had shown that sites in interspersed repeats ancestral to mouse and human behave similarly to other sites that are good models for neutral DNA, but they are far more frequent and are highly interspersed with nonrepetitive noncoding DNA, which are

5

the sites we want to investigate for evidence of recent selection. We evaluated the effectiveness of repeats deduced to be ancestral to euarchontoglires (e.g. rodents and primates), simians and catarrhines (Old World Monkeys, apes and humans) as the neutral model, and chose those ancestral to catarrhines (Fig. 1) because their higher abundance led to greater coverage of the human genome. Other models for neutral sites that are highly abundant, such as intergenic or intronic DNA

[The_Chimpanzee_Sequencing_and_Analysis_Consortium, 2005 #56] could not be used because they cover the potential targets for recent selection.

The closest relative to humans with a sequenced genome is chimpanzee, and divergences between human and chimp were computed in one version of the MKAR test. This gives insight into selection over the past 6 million years, but the two species do share some polymorphic sites

[The_Chimpanzee_Sequencing_and_Analysis_Consortium, 2005 #56]. Thus divergence from the Old World Monkey rhesus macaque [Gibbs, 2007 #76] was also computed, and resulting version of the MKAR test gives information on possible selection over the past 23 million years since the divergence between rhesus and human.

The choice of polymorphism data was largely determined by the size of available datasets. Build 126 of dbSNP has 5,608,719 validated entries that map to a single location in the human genome. This is the largest collection of SNPs available and it was chosen for the current implementations of the MKAR test. We considered restricting the polymorphisms to those genotyped in the HapMap project

[The_International_HapMap_Consortium, 2005 #33], but these were under-represented in the more recent repeats (Fig. 2), as might be expected from the requirement that they produce unambiguous genotyping signals. The entries in dbSNP are biased toward the more common SNPs [Clark, 2005 #34]. However, this bias affects both our test and AR classes of sites, and the test is not dependent on frequency of polymorphisms, but rather counts of polymorphic and divergent sites. Thus we expect the bias toward

6

common SNPs to not preclude an effective MKAR test. Nevertheless, it will be important to repeat the test on more comprehensive SNP datasets as they become available.

With these selections of the target regions in the human genome, the neutral model, the comparison species and the source of polymorphism data, we could then evaluate the size of windows for genome partitioning. The windows must to be large enough for the

ARs to be intercalated with the test sites and to have sufficient counts of polymorphic and divergent sites, yet small enough to give an informative resolution across the human genome. After examining window sizes from 1kb to 50kb, we found that almost half the 10kb non-overlapping windows had the desirable properties, whereas other sizes compromised coverage or resolution to an unacceptable extent. An assumption of the McDonald-Kreitman test is that the classes of sites compared are intercalated and hence have the same genealogical history. The catarrhine ARs are widely dispersed; all but 0.24% of the 10kb windows have ARs, and the windows have an average of 14 individual repeats (containing an average of 4371 AR nucleotide sites) per window.

These individual repeats are interspersed with nonrepetitive DNA, with an average distance between them of 444 bp. This distance is much smaller than the 10kb window, thus supporting an appropriate level of interspersion of neutral and test sites for the

MKAR test. We restricted the MKAR test to windows with a minimum expected count of five for each category. Of the total of 281,871 windows for the human genome assembly

(hg18), 139,701 had sufficient counts for the MKAR test with divergence from chimpanzee. Thus using 10kb windows allowed the test to be run in 46% of the human genome.

For each window, the number of polymorphic and divergent sites were counted in the

ARs and in the rest of the unmasked portion of the window. The ratio of polymorphism to divergence in the non-AR sites divided by that ratio for AR sites gives the relative ratio, called r pd

, for that window. The significance of the deviation from neutral expectation was evaluated by Fisher’s exact test followed by FDR correction for multiple testing [Storey, 2002 #25]. For this report, the 737 windows passing an FDR threshold of 0.10 are considered to deviate significantly from neutrality (based on divergence from

7

chimpanzee), and they will be called non-neutral. It is likely that other windows not passing this FDR threshold may have informative signals suggestive of selection, and thus the full data on r pd

, p-values and FDR values for each window tested by MKAR is made available as a custom track on the UCSC Genome Browser ( 49 ) at http://genometest.cse.ucsc.edu/ (hg18 March 2006 assembly, Track Group = “Experimental Tracks”, track name = “McDonald-Kreitman AR 4”). {We will request that these tracks be included in the UCSC Genome Browser once this report is accepted for publication.}

The data for all 139,701 windows with sufficient counts for the test can be downloaded from http://www.bx.psu.edu/~ross/dataset/MKARchimpEnufCtsPfdr.txt.

Features of windows with non-neutral noncoding DNA

The results of the MKAR test for many windows differ when divergence is computed relative to chimpanzee or rhesus (Table 1). A substantially larger number of windows pass the FDR threshold when divergence is determined from chimpanzee, and only half the non-neutral windows observed with rhesus divergence are in common with those observed with chimp divergence. Thus, different regions of the genome show unique signatures of selection over the two evolutionary timeframes of approximately 6 million

(chimp) and 23 million years ago (rhesus). We will focus on the windows that are show significant results with comparison to chimpanzee for most of our examples, but it is important to keep in mind that other windows are significant when the comparison is with rhesus.

For the non-neutral windows, the direction of selection can be inferred from the value of r pd

, with values greater than 1 suggestive of negative selection and values less than 1 suggestive of positive selection. Most of the non-neutral windows show an r pd suggestive of negative selection (Fig. 3); this is the case whether divergence from chimpanzee or from rhesus is considered (Fig. 4). The dominance of negative selection was also observed in a genome-wide survey of selection in protein-coding regions based on a comparison of rates of substitution between human and chimpanzee at synonymous and nonsynonymous sites [Bustamante, 2005 #41]. In both our MKAR

8

tests and the Bustamante et al. [, 2005 #41] study, the divergence from a related species contributes to the signal being evaluated, and negative selection has had sufficient time to impact the signal. In contrast, tests for selection based on allele frequencies or other data only in humans may be influenced more by the disadvantageous alleles still in the human population and thus provide a stronger signal for positive selection [Nielsen, 2007 #98]. {Hiroshi - am I misinterpreting or mis-stating this? - rh} The limited overlap in windows detected as non-neutral when divergence is considered from chimpanzee versus rhesus is found regardless of the suggested direction of selection (Fig. 4).

The non-neutral windows detected by this test tend to cluster in certain regions of the human genome, including many with segmental duplications (Fig. 3). For example, the pericentromeric regions of chromosomes 1, 2, 4, 9, 10, 15 and 16 are enriched in both non-neutral windows and segmental duplications. These two features overlap in many other chromosonal regions, such as chromosomal band 11p15.4, which contains a large family of olfactory receptor genes that show a clear signal for selection {ref}. Genes with duplicated copies and regions with segmental duplications tend to be targets of recent positive selection {refs}, and thus one may expect that the windows deviating from neutrality by the MKAR test would be enriched for signals suggestive of positive selection. Indeed, this was observed (Table __). Almost 30% of the windows suggestive of positive selection overlap with segmental duplications, compared with only 6% of all the windows tested by MKAR and 15% of the windows suggestive of recent negative selection. The extensive overlap between windows deviating from neutrality and regions with segmental duplication, along with the concordance in direction of selection, lends support for the validity of the MKAR test. The large number of windows deviating signficantly from neutral expectation by the MKAR in these bands supports the validity of the test.

Several of the windows that are suggestive of positive selection by the MKAR test are congruent with an independent assay for evidence of recent positive selection in the human genome. Carlson et al. {ref} used HapMap (?) data to find chromosomal

9

intervals with a site frequency spectrum skewed toward rare alleles, as measured by

Tajima’s D {ref}. Contiguous regions with negative value for Tajima’s D are consistent with the expectation for positive selection. In the genome-wide study of Carlson et al., about 10% of the windows have negative values for Tajima’s D. Of the windows suggestive of positive selection by the MKAR test, 14% of them overlap with intervals suggestive of positive selection by Tajima’s D, whereas about 10% of all non-neutral windows by MKAR overlap with these intervals with negative values of Tajima’s D.

As a group, the non-neutral 10kb windows have a similar amount of overlap with functional regions as do windows for the entire genome (Table 2). The sets of functional regions compiled throughout the human genome include two groups of experimentally determined functional regions, i.e. DNase hypersensitive sites observed in CD4+ cells

(DHS) [Boyle, 2008 #99], which are markers for most gene-regulatory regions (in that cell line) and exons of protein-coding genes [Hsu, 2006 #74], plus a set of DNA segments predicted to be involved in gene regulation based from two independent analyses of multi-species alignments [PRPs \Miller, 2007 #2231]. A similar amount of overlap with functional features is seen for all the windows with sufficient counts for the

MKAR tests and the ones that are non-neutral, regardless of whether the overlap is measured as fraction of windows that overlap the functional segments or as the portion of DNA in the intervals composed of functional segments (Table 2). Interestingly, the non-neutral windows suggestive of recent positive selection in humans show a significant reduction in overlap with the two functional categories that are changing more slowly over the span of mammalian evolution [Miller, 2007 #2231]. This lends support to the evolutionary inferences based on the MKAR test, and suggests that some positively selected noncoding DNA elements are separated from negatively selected ones by at least the window size used here (10kb). In addition, this analysis shows that the half of the genome lacking sufficient counts of polymorphism and/or divergence for the current version of the MKAR test is slightly enriched in functional regions (Table 2).

Comparison with genes implicated as targets of recent selection

10

A good test of the effectiveness of this implementation of the MKAR test is to examine overlap with protein-coding genes previously implicated as being under recent positive or negative selection. While there is little concurrence among results of recent studies identifying genes affected by positive selection on the human lineage ( 35 ), these differences result at least in part from different types of evolutionary signals and distinct time-frames examined [Nielsen, 2007 #88]. Thus we looked at all the genes implicated in recent selection, and tested for overlap and proximity to the windows with noncoding

DNA deviating signficantly from neutrality in the MKAR test. An important question is whether or not a similar signal of selection is found in the noncoding regions of genes deviating from neutrality as is seen in their coding regions.

Of the 737 non-neutral windows (using chimp divergence), 11 overlap with genes previously implicated as targets of recent selection, whereas random expectation is that

6 will overlap. Conversely, of the 420 genes in our compiliation of genes under recent selection, 12 overlap with windows with non-neutral noncoding DNA, compared to 6 expected for randomly chosen windows. In contrast, extension of the windows to 100kb gives the same level of overlap as expected for random intervals. This shows that noncoding sequences within about 10 kb of coding sequences under selection also tend to show evidence of recent selection (p-value from a chi-squared test = ______), but this effect does not extend to distances of 100kb. This congruence with the results of multiple analyses of selection in coding genes lends support to the utility of the results of analysis of noncoding DNA by MKAR.

Within these non-neutral windows overlapping genes previously identified as targets of recent selection in the human lineage, several cases were found in which the direction of selection appears to be the same in both protein-coding and noncoding DNA. For simplicity of presentation, these examples are limited to those observed by the MKAR test using divergence from chimpanzee, and additional examples with divergence from rhesus are in ______ Supplementary Table ___. Three genes previously identfied as being positively selected on the human lineage overlap non-neutral windows also suggestive of positive selection in noncoding DNA. These are PCDH15 ( 37 ), linked to

11

inherited forms of deafness, ABCC1 ( 38 ), an anion transporter related to immune function in cells, and SLC30A9 ( 37 ), which is expressed in human embryonic lung. ____ genes showing evidence of negative selection on the human lineage overlapped windows also suggestive of negative selection on noncoding DNA. These include

IQGAP2 , HLA-DPA1 , and CNTN5 . Although considerably more experiments are required to test this hypothesis, these examples are consistent with the expectation for loci in which both the protein-coding exons and and the proximal DNA sequences involved in regulation are undergoing similar adaptive evolution (coding and noncoding sequences with signals suggestive of positive selection) or purifying selection (both classes showing signals for negative selection).

Several examples of different directions of selection in coding versus noncoding DNA sequences were also observed. DOCK1 , a signaling factor required for neurite outgrowth and showing evidence of negative selection ( 6 ), overlaps an MKAR window suggestive of positive selection. Two genes with evidence of positive selection overlapped non-neutral windows suggestive of negative selection. These are TLL2 , associated with bone morphogenesis ( 5, 9 ), and MCPH1 , mutations in which are associated with primary microcephaly ( 39, 40 ). It is important to note, however, that other work finds evidence of negative selection in MCPH1 ( 6 ), and quite a bit of controversy surrounds claims of human-specific adaptive evolution in this gene ( 35 ).

Despite the ambiguities for MCPH1 , it is certainly possible for the directions of selection on coding sequences to differ from that in noncoding sequences. For example, changes in a protein structure may be adaptive in one lineage, but the tissues and developmental stages at which the protein is produced remain constant across different lineages.

Candidates for cis -regulatory modules under selection in the human lineage

The MKAR test is executed across the human genome, and thus it will find non-neutral windows regardless of their position with respect to genes. Although many cis regulatory modules (CRMs) are within 5 kb of transcription start sites for genes [Birney,

2007 #100], others can be hundreds of kb or more from the promoters they regulate,

12

particularly for genes under complex regulatory control such as those underlying embryonic development ( 43 ). We have detected strong signals suggestive of positive and negative selection in noncoding regions both proximal to genes and in gene-sparse regions that may house CRMs. These regions are good candidates for further investigation of potential regulatory elements, and this should be most productive when combined with biochemical data indicative of gene regulatory regions.

The DNaseI hypersensitive sites (DHSs) in chromatin of CD4+ T-cells [Boyle, 2008 #99] comprise a good dataset for a feature frequently associated with cis -regulatory modules, i.e. a discrete region with altered chromatin structure such that the transcription machinery (and nucleases added experimentally) have access to the DNA.

DHSs have been identified as the the locations of regulatory elements such as promoters, silencers, enhancers and locus-control regions ( 47, 48 ). Tables 4 and 5 list

MKAR windows both significantly deviating from neutrality and containing DNaseI hypersensitive sites. Fig. 6 illustrates a non-neutral window with noncoding DNA significantly suggestive of positive selection. This window also contains a DHS along with many segments of high regulatory potential [Taylor, 2006 #27]. The DNase1 hypersensitive site in this window is 183bps long and located 112bp upstream of the gene MTRF1L , which encodes a protein related to a mitochondrial translational release factor. The human-chimp pairwise alignment for this site has 3 diverged bases and the human-rhesus alignment has 17 diverged bases ( FIG 7 ), but no SNPs are in this DHS in the dataset used for our tests. The excess of divergence relative to polymorphism in this site is a signature of positive selection. It is reasonable to speculate that this site has undergone adaptive evolution since the human lineage split from that of the catarrhines, although more experimentation is necessary to understand the precise function of this DNaseI hypersensitive site. The MTRF1L gene is similar to the MTRF1 gene, presumably arising by gene duplication. Duplicated genes are more likely to undergo adaptive changes, and this may be an example of adaptive change in its pattern of expression. {Do the GNF tracks suggest any differences in the patterns of expression?}

13

Discussion

{This is just a holding area for various ideas now.}

Results of the MKAR test can be used to find noncoding regions throughout the human genome that are candidate targets of natural selection over the past 23Myr

(comparisons with rhesus) or 6Myr (comparisons with chimp) of primate evolution.

These are potentially functional noncoding sequences, some of which may be CRMs.

By leveraging polymorphism counts and recent divergence, the MKAR test explores a phylogenetic span, and likely some functional regions, different from those investigated by studies of conservation among non-primate mammalian species or conservation from mammals to other vertebrates, such as birds and fish. Thus the results of this test are a useful addition to comparative genomics resources aimed at finding functional

DNA sequences. These results also point to candidates for noncoding regions that play a role in understanding evolution of the human lineage. Follow-up of these candidates promises to be a fruitful area of further investigation.

Previous similar modifications to the McDonald-Kreitman test have been successfully used to evaluate evidence of selection in non-coding DNA that is not truly intercalated in drosophila, separating sites into transcription factor binding (test) sites and nontranscription factor binding (neutral) sites ( 29, 30 ), or into noncoding (test) and synonymous (neutral) sites ( 31 ). Our study represents the first such modification to evaluate evidence for negative and positive selection in noncoding DNA spanning the entire human genome, including regions far from genes. Furthermore, by comparing counts in the noncoding test sites to those in AR sites for each 10kb window individually, we capture local variations in neutral evolutionary rates.

Compare with Tajima’s D analysis and extended homozygosity data?

14

However, the MKAR test does not resolve the actual sequence under selection. Rather it shows the 10kb window housing the sequence having the signature of selection. The source of the signal, and potential target of selection, could be the putative CRM, or it could be the surrounding region. As our understanding of molecular evolution of noncoding DNA in general, and of CRMs in particular, improves, we will be able to refine our methods of searching for signatures of selection in these regions and tease out the source sequence deviating from neutrality.

Methods

Datasets examined

15

Divergence data for the MKAR test was generated from three-way whole-genome alignments of human-chimp-macaque computed on the March 2006 human assembly

(hg18), the March 2006 chimp assembly (panTro2), and the January 2006 macaque assembly (rheMac2) using MULTIZ ( 50 ). Alignments are available for bulk download from the UCSC Genome Browser ( http://genome.ucsc.edu/ ). Polymorphism data for the

MKAR test was obtained from two sources: 1) dbSNP build 126 ( 51 ) downloaded from the UCSC Genome Table Browser and filtered for SNPs having a known validation status that map to a single genomic location; and 2) The International HapMap Project

( 33 ) non-redundant public release #22 ( http://www.hapmap.org/index.html

). All polymorphism variants and coordinates are for the forward (+) strand. Intervals corresponding to interspersed repeats ancestral to human and macaque, defined as

LTR, SINE, LINE and DNA elements excluding AluY, L1PA1-L1PA7, and L1HS, were obtained from the UCSC Genome Table Browser using the Galaxy metaserver ( 52 )

( http://g2.trac.bx.psu.edu/ ) . These ARs were filtered to remove MIR121 family elements, shown to be evolving non-neutrally ( 24 ), and socalled ‘exapted repeats’, conserved non-exonic elements deposited by mobile elements and shown to have acquired functional regulatory roles ( 25 ). All coding exon start and stop intervals were obtained from the UCSC Genome Table Browser UCSC Genes track ( 53 ).

DNaseI hypersensitive site coordinates were downloaded

( http://research.nhgri.nih.gov/DNaseHS/chimp2006 ) and start and stop coordinates were converted from the July 2003 human genome assembly (hg16) to the March 2006 assembly (hg18) using the UCSC genome coordinate conversion tool, liftOver . Custom python scripts were used to extract sites overlapping significant MKAR windows

(passing 10% FDR threshold).

MKAR Test

CpG sites were masked out of the multiple alignments according to the ‘inclusive’ definition, where all sites are divided into CpG and non-CpG sites ( 54 ) in a pairwise

16

comparison (human-chimp and human-macaque). CpG sites are those that are CG in at least one species. Non-CpG sites are those not CG in either species.

Non-overlapping windows of size 10kb were tiled across the genome. A 2X2 contingency table was generated for each window by apportioning SNP counts and divergence counts within a window into AR and TEST categories. A SNP count is registered when a SNP is present in an aligned position. A divergence count is registered either when aligned bases differ and no SNP is present at an aligned position, or when aligned bases differ and a SNP is present, but all the known SNP variants differ from the aligned base. Thus, at any given aligned base, the MKAR test registers one of four possibilities: 1) a SNP count, or 2) a divergence count, or 3) both a

SNP and a divergence count, or 4) neither a SNP nor a divergence count ( FIG 8 ). An

AR site is registered when either a SNP or diverged base fall in an ancestral repeat site within a window. A TEST site is registered if either a SNP or a diverged base fall outside an AR site within a window. Gaps or regions that do not align in ether species pair (human-chimp or human-macaque) are not considered.

The minimum expected value for cells in each 2X2 contingency table is calculated based on the apportionment of SNP and divergence counts for each window and, for those windows passing a minimum expected value threshold of 5.0 for each cell, a

Fisher’s Exact Test tests the significance of the table’s deviation from neutrality. The false-discovery rate (FDR) multiple tests correction procedure is applied to the ranked distribution of Fisher’s Exact Test p-values as implemented in R, using the qvalue library and the bootstrap method of estimating the proportion of true null hypotheses

( 28 ).

All MKAR results can be viewed and analyzed as a custom track on the UCSC Genome

Browser ( 49 ) at http://hgwdev-giardine.cse.ucsc.edu/cgi-bin/hgTracks (track name =

“McDonald-Kreitman AR”), and in Supplementary Materials .

{I took this from the primary text of the previous version - please merge it in with the rest of the methods, removing any redundancy.}

17

The estimated rates of neutral substitutions in ARs co-varies with that of four-fold degenerate sites ( 21 ), and ARs are frequently used ( 1, 2, 22 ), and indeed have been quantified ( 23 ), as a good model of neutral molecular evolution. Still, there is some evidence that certain repetitive elements have evolved non-neutrally ( 24, 25 ) and we removed these elements from our neutral class of sites. We mask out the coding portions of exons so each window’s test regions are composed only of presumptive non-coding DNA. Additionally, we mask out hypermutable CpG dinucleotide sites, which may lead to underestimates of true divergence in the MKAR test, particularly when divergence is reckoned with rhesus ( 26, 27 ). A 2X2 contingency table is constructed for each window binning polymorphism and divergence counts in AR and in non-coding test sites and, for windows passing a minimum expected value threshold of

5 counts for each cell of each window’s contingency table, deviation from neutrality is evaluated for significance by a Fisher’s Exact Test. To correct for multiple testing, a qvalue is comp uted from the ranked distribution of Fisher’s Exact p-values and used to estimate the false discovery rate (FDR) ( 28 ).

An assumption of the McDonald-Kreitman test is that the classes of sites compared are intercalated and hence have the same genealogical history. While the classes of sites examined here (noncoding DNA and ARs) are certainly not intercalated to the extent that synonymous and nonsynonymous amino acid sites are in a coding sequence, the

ARs examined here are genomically ubiquitous as well as quite evenly distributed with an average distance of 443.7 nucleotides between AR sequences in each window

( Supplementary Table 7 ). This rather homogenous intercalation suggests the risk of type-1 error due to partial linkage is minimal.

When interpreting any test for selection using polymorphism and divergence data, one must be aware of the limitations of the input data. Some of the SNPs in dbSNP126 come from datasets with known ascertainment biases ( 32 ), however these biases affect both our test and AR classes of sites, so their effects to the MKAR results are negligible.

A more serious concern is the technical difficulty genotyping SNPs in ARs, particularly in the younger repetitive sequences ( 33 ), leading to an under-representation of SNPs in

18

our neutral class of sites. We examined SNP density using both dbSNP126 and

HapMap in the ARs tested here binned by percent divergence from a consensus sequence. An individual ARs’ percent divergence from an overall consensus sequence, which is assumed to approximate the ancestral repetitive sequence, is a proxy for AR age. As expected, we found SNP density is indeed higher in older repeats, regardless of which human polymorphism dataset examined. However, the distribution of SNP density is both greater and more even across all our ARs using dbSNP126 ( FIG 2 ), and so in this main text we report results generated using human polymorphism data from dbSNP126. The MKAR test was run on both SNP datasets, however, and there are highly significant positive correlations in results generated with dbSNP126 and with

HapMap ( Supplementary Tables 1 and 2; Supplementary Figs 1 and 2 ). It will be important to continue to evaluate signals for selection as more polymorphism data become available, especially for less biased data. The divergence data also have limitations, largely in the quality of the assemblies of the comparison species, but also in the contribution of non-orthologous regions to the alignments. As genome assemblies and alignments improve, it will be important to re-run these tests. Nevertheless, the

MKAR test is a useful and timely addition to comparative whole-genome resources aimed at finding functional non-coding DNA sequences.

{The comparison we had in the first submission seemed to me to have value - shouldn’t we continue to include it? The text follows.}

When interpreting any test for selection using polymorphism and divergence data, one must be aware of the limitations in the input data. Some of the data in dbSNP126 come from datasets with known biases toward more frequent alleles ( 22,

23 ). This bias will apply equally to both AR and nonAR sites, so the possible effect on the MKAR results may not be large. One approach to evaluating the magnitude of the effect of this ascertainment bias is to compare MKAR results obtained using dbSNP 126 data with results obtained using a polymorphism dataset that should be less biased toward frequent alleles. This dataset combined resequenced HapMap data ( 22 ) with

HapMap Phase II data in the 10 resequenced ENCODE ( 24 ) regions. For the 417 10kb windows that could be tested in these resequenced ENCODE regions, the p -value for

19

deviation of windows from neutrality correlated between the MKAR tests with an r of

0.535 (using divergence from chimp). A similar comparison using divergence from rhesus gave an r of 0.538. The NI values correlated as well, with r = 0.531 and 0.539 for divergence from chimpanzee and rhesus, respectively ( p -value < 2.2x10

-16 in both cases). Thus even with known biases in the current polymorphism data, the test is fairly robust. However, it will be important to continue to evaluate signals for recent selection as more polymorphism data become available, especially for less biased data. The divergence data also have limitations, largely in the quality of the assemblies of the comparison species, but also in the contribution of non-orthologous regions to the alignments (for example, see Fig. 3). As genome assemblies and alignments improve, it will be important to re-run these tests.

Literature Cited

1. R. H. Waterston et al.

, Nature 420 , 520 (2002).

2. F. Chiaromonte et al.

, Cold Spring Harb Symp Quant Biol 68 , 245 (2003).

20

3. K. Lindblad-Toh et al.

, Nature 438 , 803 (2005).

4. A. Siepel et al.

, Genome Res 15 , 1034 (2005).

5. A. G. Clark et al.

, Science 302 , 1960 (2003).

6. C. D. Bustamante et al.

, Nature 437 , 1153 (2005).

7. The_Chimpanzee_Sequencing_and_Analysis_Consortium, Nature 437 , 69

(2005).

8. R. A. Gibbs et al.

, Science 316 , 222 (2007).

9. R. Nielsen et al.

, PLoS Biol 3 , e170 (2005).

10. J. H. McDonald, M. Kreitman, Nature 351 , 652 (1991).

11. A. L. Hughes, M. Nei, Nature 335 , 167 (1988).

12. Z. Yang, J. P. Bielawski, Trends in Ecology and Evolution 15 , 496 (2000).

13. S. A. Tishkoff et al.

, Nat Genet 39 , 31 (2007).

14. M. T. Hamblin, A. Di Rienzo, Am J Hum Genet 66 , 1669 (2000).

15. K. S. Pollard et al.

, PLoS Genet 2 (2006).

16. K. S. Pollard et al.

, Nature 443 , 167 (2006).

17. S. Prabhakar, J. P. Noonan, S. Paabo, E. M. Rubin, Science 314 , 786 (2006).

18. H. Akashi, Genetics 139 , 1067 (1995).

19. A. Eyre-Walker, Genetics 162 , 2017 (2002).

20. M. W. Nachman, in Evolutionary Genetics: Concepts and Case Studies C. W.

Fox, J. B. Wolf, Eds. (Oxford University Press, Oxford, 2006) pp. 103-118.

21. R. C. Hardison et al.

, Genome Res 13 , 13 (2003).

22. E. M. Kvikstad, S. Tyekucheva, F. Chiaromonte, K. D. Makova, PLoS Comput

Biol 3 , 1772 (2007).

23. G. Lunter, C. P. Ponting, J. Hein, PLoS Comput Biol 2 , e5 (2006).

24. M. Kamal, X. Xie, E. S. Lander, Proc Natl Acad Sci U S A 103 , 2740 (2006).

25. C. B. Lowe, G. Bejerano, D. Haussler, Proc Natl Acad Sci U S A 104 , 8005

(2007).

26. J. Meunier, L. Duret, Mol Biol Evol 21 , 984 (2004).

27. R. D. Hernandez, S. H. Williamson, C. D. Bustamante, Mol Biol Evol 24 , 1792

(2007).

28. J. D. Storey, Journal of the Royal Statistical Society Series B-Statistical

Methodology 64 , 479 (2002).

29. M. Z. Ludwig, M. Kreitman, Mol Biol Evol 12 , 1002 (1995).

30. D. L. Jenkins, C. A. Ortori, J. F. Brookfield, Proc Biol Sci 261 , 203 (1995).

31. P. Andolfatto, Nature 437 , 1149 (2005).

32. A. G. Clark, M. J. Hubisz, C. D. Bustamante, S. H. Williamson, R. Nielsen,

Genome Res 15 , 1496 (2005).

33. The_International_HapMap_Consortium, Nature 437 , 1299 (2005).

34. D. M. Rand, L. M. Kann, Mol Biol Evol 13 , 735 (1996).

35. R. Nielsen, I. Hellmann, M. Hubisz, C. Bustamante, A. G. Clark, Nat Rev Genet

8 , 857 (2007).

36. P. C. Sabeti et al.

, Science 312 , 1614 (2006).

37. P. C. Sabeti et al.

, Nature 449 , 913 (2007).

38. E. C. Walsh et al.

, Hum Genet 119 , 92 (2006).

39. P. D. Evans, J. R. Anderson, E. J. Vallender, S. S. Choi, B. T. Lahn, Hum Mol

Genet 13 , 1139 (2004).

21

40. Y. Q. Wang, B. Su, Hum Mol Genet 13 , 1131 (2004).

41. M. Ashburner et al.

, Nat Genet 25 , 25 (2000).

42. T. Beissbarth, T. P. Speed, Bioinformatics 20 , 1464 (2004).

43. O. S. Akbari et al.

, Development 135 , 123 (2008).

44. L. A. Lettice et al.

, Proc Natl Acad Sci U S A 99 , 7548 (2002).

45. M. A. Nobrega, I. Ovcharenko, V. Afzal, E. M. Rubin, Science 302 , 413 (2003).

46. J. Taylor et al.

, Genome Res (2006).

47. G. E. Crawford et al.

, Nature Methods 3 , 503 (2006).

48. The_ENCODE_Project_Consortium, Science 306 , 636 (2004).

49. W. J. Kent et al.

, Genome Res 12 , 996 (2002).

50. M. Blanchette et al.

, Genome Res 14 , 708 (2004).

51. S. T. Sherry et al.

, Nucleic Acids Res 29 , 308 (2001).

52. B. Giardine et al.

, Genome Res 15 , 1451 (2005).

53. F. Hsu et al.

, Bioinformatics 22 , 1036 (2006).

54. J. Taylor, S. Tyekucheva, M. Zody, F. Chiaromonte, K. D. Makova, Mol Biol Evol

23 , 565 (2006).

22

Download