Conserved non-coding DNA has a function in the human genome

advertisement
Exploring the Role of Non-Coding
DNA in the Function of the Human
Genome through Variation.
Christine Bird
cpb@sanger.ac.uk
Hypothesis: Conserved non-coding DNA
has a function in the human genome
Does human variation data suggest selection
is acting on noncoding DNA?
 Are conserved non-coding sequences
selectively constrained?
 Detection of fast evolving conserved noncoding sequence.
 Exploring the properties and genomic context
of human fast evolving non-coding regions.
The Human Genome:
~25,000 genes
1 to 1.5% of human DNA is coding
Is the remaining 98.5% “junk”?
Selective constraint in mammalian genomes
Neutral
Constrained 5%
Waterston et al. Nature 2002
Proportions of Lineage Specific
Conserved non-coding (CNC) sequences
418 MCSs (Multiple vertebrate Conserved Sequences) in 571Kb:
58 coding, 46 UTRs and 314 non-coding. ~ 27 species Margulies et al. PNAS 2005
CNCs are evenly distributed in the human
genome
Dermitzakis et al. Nat Rev Genet 2005
The density of CNCs and exons is
negatively correlated
Dermitzakis et al. Nat Rev Genet 2005
Why study conserved non-coding DNA?
 Abundance beyond that expected under neutral
evolution.
 If function is gene regulation, understanding is limited.
 Gene regulation is considered a crucial contributor to
evolutionary change (King and Wilson, 1975).
 Conserved non-coding sequences (CNCs) may well
harbour critical regulatory changes that have driven
recent human evolution.
Conserved non-coding sequences
 Top conserved 5% of the human genome as
detected with a phylogenetic hidden Markov
model (phyloHMM) (Siepel, 2005).
 Best-in-genome pairwise alignments by blastz,
followed by chaining.
 A multiple alignment constructed by MULTIZ.
 PhastCons constructs a two-state phylo-HMM for
conserved and non-conserved regions.
 Remove overlap with Ensembl gene
annotation.
http://genome.ucsc.edu/
Are conserved non-coding sequences
selectively constrained?

Conservation of non-coding sequence due to forces
acting on the human genome.

CNC SNP density only 82% of noncoding nonconserved sequence. 3.9 x 10 vs. 4.8 x 10 ; chi = 686, 1 df; p<10
-4
-4
2
-99
Just due to low local mutation rates?
Or
Are New alleles deleterious, therefore less likely to be fixed
in population?

Address this by looking at the derived allele frequency
(DAF) spectra as it is unaffected by local mutation rates.
Drake et al. Nat Genet 2006
Derived Allele Frequency


Selective constraint shifts the distribution of constrained
alleles toward rarer frequencies (Fay & Wu, 2000).
Allele frequencies in 4 populations from 210 unrelated
individuals in the HapMap project:
CEU - American of European ancestry (60)
YRI - Yoruba from Nigeria (60)
JPT - Japanese from Tokyo (45)
CHB - Han Chinese from Beijing (45)



Derived Allele Frequency (DAF) was generated for 1
million Phase I HapMap SNPs & 4 million Phase II.
The ancestral allele was inferred by comparison to
chimp and/or macaque.
SNPs were assigned to defined genomic features to
allow comparison.
Drake et al. Nat Genet 2006
CNCs are selectively constrained
0.25
Selective
constraint
Conserved
Non-conserved
Fraction of SNPs
0.2
0.15
0.1
0.05
Low
Binned Derived Allele Frequency
Mann-Whitney-U test; P<<10-4
0.
9<1
0.
80.
9
0.
70.
8
0.
60.
7
0.
50.
6
0.
40.
5
0.
30.
4
0.
20.
3
0.
10.
2
>0
-0
.1
0
High
Drake et al. Nat Genet 2006
CNCs have an excess of low frequency
derived alleles compared to Introns
0.35
CNC
Exons
Introns
Rest
Fraction of SNPs
0.3
0.25
0.2
0.15
0.1
0.05
Low
Binned Derived Allele Frequency
Mann-Whitney-U test; CNC vs Introns P<<10-16
0.
9<1
.9
0.
80
.8
0.
70
.7
0.
60
.6
0.
50
.5
0.
40
.4
0.
30
.3
0.
20
0.
10
>0
-0
.1
.2
0
High
CNC sequences are selectively
constrained and not mutation cold spots
 Nucleotide variation revealed strong selective
constraints upon CNCs in human populations.
SNP density 82% lower in CNCs
CNCs have an excess of low frequency
derived alleles.
 CNCs subject to purifying selection in humans,
likely to harbour functionally important variants.
Drake et al. Nat Genet 2006
Why are they conserved?
 Regions of the genome are therefore selectively
constrained despite being non-coding.
But what is the reason for this conservation…?
 What is novel about their biology?
 How can we tackle this question for so many elements?
 What are the most interesting regions?
 A subset of CNCs undergoing rapid change with
potential common properties or roles.
Why study fast-evolving non-coding?
 If CNCs are part of chimpanzee-human lineage
differentiation by changes in gene regulation then changes
in their nucleotide sequence should be expected despite
their overall conservation.
 Following gene duplication subfunctionalization by the
partitioning of gene regulation among descendant copies
(Force, 1999)
 Older models of gene duplication proposed an important
role for positive selection after duplication (Bridges 1935,
Ohno 1970, Ohta, 1987).
Subfunctionalization
 Duplicated genes preserved through
subfunctionalization by the Duplication-DegenerationComplementation model.
Brain Heart
Heart
Duplicated gene and separated tissue specific regulation
Lynch and Force, Genetics 2000
 If CNCs are regulatory elements involved in this
process they would have changed rapidly since
duplication.
Detecting fast-evolving non-coding
sequences
S1
Human
Chimp
Macaque
Human
GACTACGTTTGGTTTAGAGAT
S2
Chimp
GACTGGCTTTACTTTTGAGAT
GTCTGGGTTTACTTTTCAGAT
MULTIZ alignments (Webb Miller).
Macaque
Lineage
Specific
Substitutions
Tajima’s Relative rate test
(S1 - S2)2
(S1 + S2)
5
1
2
= χ2
Tajima, Genetics 1993
 χ2 test of base substitutions.
Alignments
Power to detect acceleration
P < 0.05 Accelerated
= 304,291
= 26,477
= 2,794 (11%)
Accelerated in chimp = 1438
Accelerated in human = 1356
ANC (Accelerated Non-Coding)
Are Accelerated Non-Coding (ANCs)
sequences functional?
 Compare to 3 sets of control sequences:
 Power CNCs (not lineage specific):
CNCs with >= 4 substitutions = 23,683
 Non-accelerated CNCs:
CNCs < 4 substitutions = 277,814
 DAF controls 1&2:
1356 x 20Kb windows 500Kb from 5’ & 3’ of ANCs.
Repeat analyses excluding potential confounder:
Segmental Duplications (SD), Copy Number Variants
(CNV), pseudogenes and retroposed genes.
Are ANC sequences functional?
 Does nucleotide variation data indicate particular modes
of selection implying function?
(Is acceleration recent or ancient?)
 Derived allele frequency spectrum comparisons
 Population differentiation, FST
 Are ANCs involved in subfunctionalization?
 Is there enrichment in recently duplicated sequences?
 What function do these rapidly evolving sequences
have?
 Association of ANC variation with expression levels of
nearby genes
Excess of high frequency derived alleles
in ANCs
0.35
0.3
Fraction of SNPs
NonAccelerated CNC
Selective
constraint
Control
0.25
ANC
0.2
Loss of constraint &
Directional Selection?
0.15
0.1
0.05
Binned Derived Allele Frequency
Mann-Whitney-U test; Non-accelerated CNC vs ANCs P =1.63x10-6
0.
9<1
0.
80.
9
0.
70.
8
0.
60.
7
0.
50.
6
0.
40.
5
0.
30.
4
0.
20.
3
0.
10.
2
>0
-0
.1
0
Power CNCs are neutral
0.35
NonAccelerated CNC
ANC
Control
Power
Fraction of SNPs
0.3
0.25
0.2
0.15
0.1
0.05
Binned Derived Allele Frequency
Mann-Whitney-U test; Power CNC vs Control P =0.15
0.
9<1
0.
80.
9
0.
70.
8
0.
60.
7
0.
50.
6
0.
40.
5
0.
30.
4
0.
20.
3
0.
10.
2
>0
-0
.1
0
Excess of rare alleles in ANCs excluding
confounding elements
0.35
NonAccelerated CNC
Control
Power
ANC
ANC no confounding
Fraction of SNPs
0.3
0.25
0.2
Loss of constraint &
Directional Selection?
0.15
0.1
0.05
Binned Derived Allele Frequency
Mann-Whitney-U test; ANCs vs ANC no confounders P =0.48
0.
9<1
0.
80.
9
0.
70.
8
0.
60.
7
0.
50.
6
0.
40.
5
0.
30.
4
0.
20.
3
0.
10.
2
>0
-0
.1
0
Detecting recent evolution and
population-specific selection
 A measure of population structure, Wright’s FST.
 Compares the mean amount of genetic diversity found
within subpopulations to the meta-population.
 Sampling from 2 diverged subpopulations as if it is a
panmitic population gives an excess of homozygotes & a
deficiency of heterozygotes.
 FST can be defined as:
FST = HT - HS
HT
 Calculated for ANCs
 MSG - mean square error within populations
 MSP - mean square error between populations
 nc - variance-corrected average sample size
Weir and Cockerham, Evolution 1984
ANC FST values higher than nonaccelerated CNCs
0.35
ANCs No Confounding
ANCs
Power CNCs
Non-Accelerated CNCs
0.3
Frequency
0.25
0.2
0.15
0.1
0.05
-0
.0
5
0 to 0
to
0. 0.
05 05
0. to 0
1
to .1
0. 0.
15 15
0. to 0
2
to .2
0. 0.
25 25
0. to 0
3
to .3
0. 0.
35 35
0. to 0
4
to .4
0. 0.
45 45
0. to
5 0.5
to
0. 0.
55 55
0. to 0
6
to .6
0. 0.
65 65
0. to 0
7
to .7
0. 0.
75 75
0. to 0
8
to .8
0. 0.
85 85
0. to 0
9
to .9
0
0. .95
95
to
1
0
Fst bins
Mann-Whitney-U-test; Non-accelerated CNCs vs ANCs P = 0.0504
; Non-accelerated CNCs vs ANCs no confounders P = 0.0363
Enrichment in Segmental Duplications
 Approximately 5-6% of the human genome in SDs
(Bailey et al, Science 2002)
ANCs
8%
power CNCs
10%
non-accelerated CNCs
5%
 Excess of ANCs and power CNCs in SDs
(chi-square; P< 10-4).
 The general enrichment in SDs is not surprising, as it
has been observed that sequence divergence is
elevated in duplicated sequences.
(Hurles et al. GenBio. 2004; She et al. GenRes. 2006).
Excess of recent segmental duplications
associated with ANCs
Fraction of catergory overlapping SDs
0.2
Non-Accelerated CNCs
Power CNCs
ANC
0.18
0.16
Human
Specific
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
90
91
92
93
94
95
96
% identity of SDs
Mann-Whitney-U test; P<<10-4
97
98
99
100
Testing for evidence of involvement in
Gene Regulation
GENE
ANC
SNP
Association
mRNA
ANC SNP- Expression Association
 What is the functional
impact of ANC variation on
gene expression
phenotypes?
9.0
8.0
 Associate SNPs
genotypes within ANCs to
transcript expression
levels by linear regression.
8.5
 47,294 transcripts probed
in lymphoblastoid cell lines
of 210 unrelated HapMap
Expression level
9.5
Additive association model:
Linear regression
e.g. CC = 0, CT = 1, TT = 2.
 Statistical significance
adjusted following 10,000
permutations per gene.
CC
CT
TT
Genotype
0
1
2
SNPs within ANCs are significantly
associated with gene expression
phenotypes.
 Significant SNPs at the 0.01 permutation threshold:
68% ANCs SNPs tested (496 out of 729)
9% Power CNCs SNPs tested (1047 out of 11468)
A SNP within an ANC is 7 times more likely to be
associated with gene expression levels than a SNP
within a power CNC.
 Significant at the 0.01 permutation threshold:
16% of ANCs tested (59 out of 366)
3% of Power CNCs tested (165 out of 5968)
Nucleotide variation within ANCs is 5 times more likely to
be associated with gene expression levels than variation
in a power CNC.
 Tendency for derived alleles within ANCs to be
associated with lower expression levels.
Summary
 CNCs are not mutation cold spots but selectively
constrained.
 Fast evolving noncoding sequences in the human
lineage have lost this constraint and some are potentially
undergoing positive selection.
 This may have contributed to some recent differentiation
in human populations.
 ANCs are enriched in the most recent segmental
duplications.
 SNPs in ANCs are associated with significant change in
gene expression phenotypes.
Acknowledgements
Thanks to my joint supervisors Emmanouil Dermitzakis and Matthew
Hurles and the members of their teams;









Barbara Stranger
Dan Jeffares
Catherine Ingle
Julian Huppert
Antigone Dimas
Sarah Lindsay
Dan Andrews
Dan Turner
Chris Barnes
Particular thanks to my other co-authors,
 Webb Miller - human-chimpanzee-macaque alignments
 Daryl Thomas - DAF for both phase I and II SNPs
 Maureen Liu - quantifying gene density
The Rhesus Macaque Genome Sequencing Consortium (RMGSC) and the
HapMap consortium for making data available, and the Wellcome Trust and MRC
for funding.
Exploring the Role of NonCoding DNA in the Function of
the Human Genome through
Variation.
By Christine Bird
cpb@sanger.ac.uk
Fig. 3. Phylogenetic tree of
vertebrate species. By using
the generated 27-species
multisequence alignment,
branch lengths were calculated
based on analysis of
synonymous coding positions.
The branch lengths (as
substitutions per synonymous
site) between human and each
species are listed (with
additional pair-wise branch
lengths provided in the
supporting information).
The last common ancestor
among the catarrhine primates
(A) is estimated at 25 mya (36,
37), between the rodents and
primates
(B) at 75 mya (5,6),between
eutherians and metatherians
(C) at 185 mya (14), between
monotremes and other
therians
(D) at 200 mya (14), and
between mammals and birds
(E) at 310 mya (13).
Margulies et al. PNAS 2005
Proportions of Lineage Specific
Conserved non-coding sequences
Fig. 4. Lineage specificity of
MCSs. The proportion of
nonexonic MCSs found in the
sequences of species in each
category is indicated. Note that
virtually all MCSs overlapping
known exonic sequences are
present in all mammals (data not
shown). All Mammals: cat, dog,
cow, pig, rat, mouse, N.A.
opossum, wallaby, and platypus;
Eutherian: cat, dog, cow, pig, rat,
and mouse; Marsupials: N.A.
opossum and wallaby; and Other:
species combinations containing
2% of the analyzed MCSs (see the
supporting information for the
complete data set). Hashed areas
of ‘‘All Mammals’’ reflect portions
lacking one or both rodents, and
hashed portions of ‘‘Eutherian
Marsupials’’ reflect portions lacking
both rodents.
Margulies et al. PNAS 2005
Distribution of large and small CNCs
(Conserved Non-Coding sequences) and
exons on Hsa21
Exons
exons
Frequency
Frequency
Frequency
400
300
200
Big’’CNGs
CNCsbig’’
100
Small CNCs
’’CNGs
small’’
0
0
10
20
30
Mb
Mb
Megabases (long arm)
Big CNCs: 70% ID, 100 bps ungapped
Small CNCs: 85% ID, 35-99 bps ungapped
Dermitzakis et al. Nature 2002
Conservation of CNCs in multiple species
Wallaby
Platypus
Elephant
Shrew
species
Cat
Bat
Pig
Rabbit
Lemur
Green Monkey
Mouse
Human
0
55
110
165
220
# conserved sequences
human
mouse
Conserved
block
Dermitzakis et al. 2003 Science
Drake et al. Nat Genet 2006
Testing DAF spectrum distributions
 Non-parametric distributions of unequal sample size
 Mann-Whitney U-test:
 Compares the median of two populations
 Uses the rank order of values in the two samples.
 Kolmogorov-Smirnov test:
 Measures differences in the entire distributions of two samples in both
shape and location of distributions, but at the cost that it is less sensitive
to differences in location only.
 KS is less powerful with respect to the alternative hypothesis of
differences in location than the Mann-Whitney U-test
No. of significant CNC to
gene associations
Popul
ation
CEPH
ANC
Power
CHB
ANC
Power
CHB&
JPT
ANC
Power
JPT
ANC
Power
YRI
ANC
Power
No. of
tested
CNCs
No. of
SNPs
No. of
probes
tested
387
555
8673
6232
8388
356
No. of
association
s
No. of significant
CNCs of those
tested
0.01
0.001
0.0001
0.01
0.001
0.0001
23330
77
9
0
59
15
%
9
2%
0
0
14906
350309
181
36
18
149
2%
33
1%
17
0
499
8092
21291
83
13
0
56
16
%
11
3%
0
0
5737
7579
14893
317518
202
41
15
159
3%
39
1%
15
0
342
466
7919
20163
109
11
1
59
17
%
9
3%
1
0
5474
7162
14852
301636
203
12
1
149
3%
12
0
1
0
355
490
8197
21166
88
12
0
59
17
%
11
3%
0
0
5674
7531
14852
315476
241
48
20
194
3%
42
1%
19
0
391
583
9118
24310
113
15
2
64
16
%
15
4%
2
1
%
6724
9218
14908
381407
196
32
15
173
3%
30
0
14
0
Download