Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. Christine Bird cpb@sanger.ac.uk Hypothesis: Conserved non-coding DNA has a function in the human genome Does human variation data suggest selection is acting on noncoding DNA? Are conserved non-coding sequences selectively constrained? Detection of fast evolving conserved noncoding sequence. Exploring the properties and genomic context of human fast evolving non-coding regions. The Human Genome: ~25,000 genes 1 to 1.5% of human DNA is coding Is the remaining 98.5% “junk”? Selective constraint in mammalian genomes Neutral Constrained 5% Waterston et al. Nature 2002 Proportions of Lineage Specific Conserved non-coding (CNC) sequences 418 MCSs (Multiple vertebrate Conserved Sequences) in 571Kb: 58 coding, 46 UTRs and 314 non-coding. ~ 27 species Margulies et al. PNAS 2005 CNCs are evenly distributed in the human genome Dermitzakis et al. Nat Rev Genet 2005 The density of CNCs and exons is negatively correlated Dermitzakis et al. Nat Rev Genet 2005 Why study conserved non-coding DNA? Abundance beyond that expected under neutral evolution. If function is gene regulation, understanding is limited. Gene regulation is considered a crucial contributor to evolutionary change (King and Wilson, 1975). Conserved non-coding sequences (CNCs) may well harbour critical regulatory changes that have driven recent human evolution. Conserved non-coding sequences Top conserved 5% of the human genome as detected with a phylogenetic hidden Markov model (phyloHMM) (Siepel, 2005). Best-in-genome pairwise alignments by blastz, followed by chaining. A multiple alignment constructed by MULTIZ. PhastCons constructs a two-state phylo-HMM for conserved and non-conserved regions. Remove overlap with Ensembl gene annotation. http://genome.ucsc.edu/ Are conserved non-coding sequences selectively constrained? Conservation of non-coding sequence due to forces acting on the human genome. CNC SNP density only 82% of noncoding nonconserved sequence. 3.9 x 10 vs. 4.8 x 10 ; chi = 686, 1 df; p<10 -4 -4 2 -99 Just due to low local mutation rates? Or Are New alleles deleterious, therefore less likely to be fixed in population? Address this by looking at the derived allele frequency (DAF) spectra as it is unaffected by local mutation rates. Drake et al. Nat Genet 2006 Derived Allele Frequency Selective constraint shifts the distribution of constrained alleles toward rarer frequencies (Fay & Wu, 2000). Allele frequencies in 4 populations from 210 unrelated individuals in the HapMap project: CEU - American of European ancestry (60) YRI - Yoruba from Nigeria (60) JPT - Japanese from Tokyo (45) CHB - Han Chinese from Beijing (45) Derived Allele Frequency (DAF) was generated for 1 million Phase I HapMap SNPs & 4 million Phase II. The ancestral allele was inferred by comparison to chimp and/or macaque. SNPs were assigned to defined genomic features to allow comparison. Drake et al. Nat Genet 2006 CNCs are selectively constrained 0.25 Selective constraint Conserved Non-conserved Fraction of SNPs 0.2 0.15 0.1 0.05 Low Binned Derived Allele Frequency Mann-Whitney-U test; P<<10-4 0. 9<1 0. 80. 9 0. 70. 8 0. 60. 7 0. 50. 6 0. 40. 5 0. 30. 4 0. 20. 3 0. 10. 2 >0 -0 .1 0 High Drake et al. Nat Genet 2006 CNCs have an excess of low frequency derived alleles compared to Introns 0.35 CNC Exons Introns Rest Fraction of SNPs 0.3 0.25 0.2 0.15 0.1 0.05 Low Binned Derived Allele Frequency Mann-Whitney-U test; CNC vs Introns P<<10-16 0. 9<1 .9 0. 80 .8 0. 70 .7 0. 60 .6 0. 50 .5 0. 40 .4 0. 30 .3 0. 20 0. 10 >0 -0 .1 .2 0 High CNC sequences are selectively constrained and not mutation cold spots Nucleotide variation revealed strong selective constraints upon CNCs in human populations. SNP density 82% lower in CNCs CNCs have an excess of low frequency derived alleles. CNCs subject to purifying selection in humans, likely to harbour functionally important variants. Drake et al. Nat Genet 2006 Why are they conserved? Regions of the genome are therefore selectively constrained despite being non-coding. But what is the reason for this conservation…? What is novel about their biology? How can we tackle this question for so many elements? What are the most interesting regions? A subset of CNCs undergoing rapid change with potential common properties or roles. Why study fast-evolving non-coding? If CNCs are part of chimpanzee-human lineage differentiation by changes in gene regulation then changes in their nucleotide sequence should be expected despite their overall conservation. Following gene duplication subfunctionalization by the partitioning of gene regulation among descendant copies (Force, 1999) Older models of gene duplication proposed an important role for positive selection after duplication (Bridges 1935, Ohno 1970, Ohta, 1987). Subfunctionalization Duplicated genes preserved through subfunctionalization by the Duplication-DegenerationComplementation model. Brain Heart Heart Duplicated gene and separated tissue specific regulation Lynch and Force, Genetics 2000 If CNCs are regulatory elements involved in this process they would have changed rapidly since duplication. Detecting fast-evolving non-coding sequences S1 Human Chimp Macaque Human GACTACGTTTGGTTTAGAGAT S2 Chimp GACTGGCTTTACTTTTGAGAT GTCTGGGTTTACTTTTCAGAT MULTIZ alignments (Webb Miller). Macaque Lineage Specific Substitutions Tajima’s Relative rate test (S1 - S2)2 (S1 + S2) 5 1 2 = χ2 Tajima, Genetics 1993 χ2 test of base substitutions. Alignments Power to detect acceleration P < 0.05 Accelerated = 304,291 = 26,477 = 2,794 (11%) Accelerated in chimp = 1438 Accelerated in human = 1356 ANC (Accelerated Non-Coding) Are Accelerated Non-Coding (ANCs) sequences functional? Compare to 3 sets of control sequences: Power CNCs (not lineage specific): CNCs with >= 4 substitutions = 23,683 Non-accelerated CNCs: CNCs < 4 substitutions = 277,814 DAF controls 1&2: 1356 x 20Kb windows 500Kb from 5’ & 3’ of ANCs. Repeat analyses excluding potential confounder: Segmental Duplications (SD), Copy Number Variants (CNV), pseudogenes and retroposed genes. Are ANC sequences functional? Does nucleotide variation data indicate particular modes of selection implying function? (Is acceleration recent or ancient?) Derived allele frequency spectrum comparisons Population differentiation, FST Are ANCs involved in subfunctionalization? Is there enrichment in recently duplicated sequences? What function do these rapidly evolving sequences have? Association of ANC variation with expression levels of nearby genes Excess of high frequency derived alleles in ANCs 0.35 0.3 Fraction of SNPs NonAccelerated CNC Selective constraint Control 0.25 ANC 0.2 Loss of constraint & Directional Selection? 0.15 0.1 0.05 Binned Derived Allele Frequency Mann-Whitney-U test; Non-accelerated CNC vs ANCs P =1.63x10-6 0. 9<1 0. 80. 9 0. 70. 8 0. 60. 7 0. 50. 6 0. 40. 5 0. 30. 4 0. 20. 3 0. 10. 2 >0 -0 .1 0 Power CNCs are neutral 0.35 NonAccelerated CNC ANC Control Power Fraction of SNPs 0.3 0.25 0.2 0.15 0.1 0.05 Binned Derived Allele Frequency Mann-Whitney-U test; Power CNC vs Control P =0.15 0. 9<1 0. 80. 9 0. 70. 8 0. 60. 7 0. 50. 6 0. 40. 5 0. 30. 4 0. 20. 3 0. 10. 2 >0 -0 .1 0 Excess of rare alleles in ANCs excluding confounding elements 0.35 NonAccelerated CNC Control Power ANC ANC no confounding Fraction of SNPs 0.3 0.25 0.2 Loss of constraint & Directional Selection? 0.15 0.1 0.05 Binned Derived Allele Frequency Mann-Whitney-U test; ANCs vs ANC no confounders P =0.48 0. 9<1 0. 80. 9 0. 70. 8 0. 60. 7 0. 50. 6 0. 40. 5 0. 30. 4 0. 20. 3 0. 10. 2 >0 -0 .1 0 Detecting recent evolution and population-specific selection A measure of population structure, Wright’s FST. Compares the mean amount of genetic diversity found within subpopulations to the meta-population. Sampling from 2 diverged subpopulations as if it is a panmitic population gives an excess of homozygotes & a deficiency of heterozygotes. FST can be defined as: FST = HT - HS HT Calculated for ANCs MSG - mean square error within populations MSP - mean square error between populations nc - variance-corrected average sample size Weir and Cockerham, Evolution 1984 ANC FST values higher than nonaccelerated CNCs 0.35 ANCs No Confounding ANCs Power CNCs Non-Accelerated CNCs 0.3 Frequency 0.25 0.2 0.15 0.1 0.05 -0 .0 5 0 to 0 to 0. 0. 05 05 0. to 0 1 to .1 0. 0. 15 15 0. to 0 2 to .2 0. 0. 25 25 0. to 0 3 to .3 0. 0. 35 35 0. to 0 4 to .4 0. 0. 45 45 0. to 5 0.5 to 0. 0. 55 55 0. to 0 6 to .6 0. 0. 65 65 0. to 0 7 to .7 0. 0. 75 75 0. to 0 8 to .8 0. 0. 85 85 0. to 0 9 to .9 0 0. .95 95 to 1 0 Fst bins Mann-Whitney-U-test; Non-accelerated CNCs vs ANCs P = 0.0504 ; Non-accelerated CNCs vs ANCs no confounders P = 0.0363 Enrichment in Segmental Duplications Approximately 5-6% of the human genome in SDs (Bailey et al, Science 2002) ANCs 8% power CNCs 10% non-accelerated CNCs 5% Excess of ANCs and power CNCs in SDs (chi-square; P< 10-4). The general enrichment in SDs is not surprising, as it has been observed that sequence divergence is elevated in duplicated sequences. (Hurles et al. GenBio. 2004; She et al. GenRes. 2006). Excess of recent segmental duplications associated with ANCs Fraction of catergory overlapping SDs 0.2 Non-Accelerated CNCs Power CNCs ANC 0.18 0.16 Human Specific 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 90 91 92 93 94 95 96 % identity of SDs Mann-Whitney-U test; P<<10-4 97 98 99 100 Testing for evidence of involvement in Gene Regulation GENE ANC SNP Association mRNA ANC SNP- Expression Association What is the functional impact of ANC variation on gene expression phenotypes? 9.0 8.0 Associate SNPs genotypes within ANCs to transcript expression levels by linear regression. 8.5 47,294 transcripts probed in lymphoblastoid cell lines of 210 unrelated HapMap Expression level 9.5 Additive association model: Linear regression e.g. CC = 0, CT = 1, TT = 2. Statistical significance adjusted following 10,000 permutations per gene. CC CT TT Genotype 0 1 2 SNPs within ANCs are significantly associated with gene expression phenotypes. Significant SNPs at the 0.01 permutation threshold: 68% ANCs SNPs tested (496 out of 729) 9% Power CNCs SNPs tested (1047 out of 11468) A SNP within an ANC is 7 times more likely to be associated with gene expression levels than a SNP within a power CNC. Significant at the 0.01 permutation threshold: 16% of ANCs tested (59 out of 366) 3% of Power CNCs tested (165 out of 5968) Nucleotide variation within ANCs is 5 times more likely to be associated with gene expression levels than variation in a power CNC. Tendency for derived alleles within ANCs to be associated with lower expression levels. Summary CNCs are not mutation cold spots but selectively constrained. Fast evolving noncoding sequences in the human lineage have lost this constraint and some are potentially undergoing positive selection. This may have contributed to some recent differentiation in human populations. ANCs are enriched in the most recent segmental duplications. SNPs in ANCs are associated with significant change in gene expression phenotypes. Acknowledgements Thanks to my joint supervisors Emmanouil Dermitzakis and Matthew Hurles and the members of their teams; Barbara Stranger Dan Jeffares Catherine Ingle Julian Huppert Antigone Dimas Sarah Lindsay Dan Andrews Dan Turner Chris Barnes Particular thanks to my other co-authors, Webb Miller - human-chimpanzee-macaque alignments Daryl Thomas - DAF for both phase I and II SNPs Maureen Liu - quantifying gene density The Rhesus Macaque Genome Sequencing Consortium (RMGSC) and the HapMap consortium for making data available, and the Wellcome Trust and MRC for funding. Exploring the Role of NonCoding DNA in the Function of the Human Genome through Variation. By Christine Bird cpb@sanger.ac.uk Fig. 3. Phylogenetic tree of vertebrate species. By using the generated 27-species multisequence alignment, branch lengths were calculated based on analysis of synonymous coding positions. The branch lengths (as substitutions per synonymous site) between human and each species are listed (with additional pair-wise branch lengths provided in the supporting information). The last common ancestor among the catarrhine primates (A) is estimated at 25 mya (36, 37), between the rodents and primates (B) at 75 mya (5,6),between eutherians and metatherians (C) at 185 mya (14), between monotremes and other therians (D) at 200 mya (14), and between mammals and birds (E) at 310 mya (13). Margulies et al. PNAS 2005 Proportions of Lineage Specific Conserved non-coding sequences Fig. 4. Lineage specificity of MCSs. The proportion of nonexonic MCSs found in the sequences of species in each category is indicated. Note that virtually all MCSs overlapping known exonic sequences are present in all mammals (data not shown). All Mammals: cat, dog, cow, pig, rat, mouse, N.A. opossum, wallaby, and platypus; Eutherian: cat, dog, cow, pig, rat, and mouse; Marsupials: N.A. opossum and wallaby; and Other: species combinations containing 2% of the analyzed MCSs (see the supporting information for the complete data set). Hashed areas of ‘‘All Mammals’’ reflect portions lacking one or both rodents, and hashed portions of ‘‘Eutherian Marsupials’’ reflect portions lacking both rodents. Margulies et al. PNAS 2005 Distribution of large and small CNCs (Conserved Non-Coding sequences) and exons on Hsa21 Exons exons Frequency Frequency Frequency 400 300 200 Big’’CNGs CNCsbig’’ 100 Small CNCs ’’CNGs small’’ 0 0 10 20 30 Mb Mb Megabases (long arm) Big CNCs: 70% ID, 100 bps ungapped Small CNCs: 85% ID, 35-99 bps ungapped Dermitzakis et al. Nature 2002 Conservation of CNCs in multiple species Wallaby Platypus Elephant Shrew species Cat Bat Pig Rabbit Lemur Green Monkey Mouse Human 0 55 110 165 220 # conserved sequences human mouse Conserved block Dermitzakis et al. 2003 Science Drake et al. Nat Genet 2006 Testing DAF spectrum distributions Non-parametric distributions of unequal sample size Mann-Whitney U-test: Compares the median of two populations Uses the rank order of values in the two samples. Kolmogorov-Smirnov test: Measures differences in the entire distributions of two samples in both shape and location of distributions, but at the cost that it is less sensitive to differences in location only. KS is less powerful with respect to the alternative hypothesis of differences in location than the Mann-Whitney U-test No. of significant CNC to gene associations Popul ation CEPH ANC Power CHB ANC Power CHB& JPT ANC Power JPT ANC Power YRI ANC Power No. of tested CNCs No. of SNPs No. of probes tested 387 555 8673 6232 8388 356 No. of association s No. of significant CNCs of those tested 0.01 0.001 0.0001 0.01 0.001 0.0001 23330 77 9 0 59 15 % 9 2% 0 0 14906 350309 181 36 18 149 2% 33 1% 17 0 499 8092 21291 83 13 0 56 16 % 11 3% 0 0 5737 7579 14893 317518 202 41 15 159 3% 39 1% 15 0 342 466 7919 20163 109 11 1 59 17 % 9 3% 1 0 5474 7162 14852 301636 203 12 1 149 3% 12 0 1 0 355 490 8197 21166 88 12 0 59 17 % 11 3% 0 0 5674 7531 14852 315476 241 48 20 194 3% 42 1% 19 0 391 583 9118 24310 113 15 2 64 16 % 15 4% 2 1 % 6724 9218 14908 381407 196 32 15 173 3% 30 0 14 0