Evolutionary and genomic approaches to find gene regulatory sequences

advertisement
Evolutionary and genomic approaches to
find gene regulatory sequences
Penn State University, Center for Comparative Genomics and
Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton
Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison
University of California at Santa Cruz: David Haussler, Jim Kent
Children’s Hospital of Philadelphia: Mitch Weiss
NimbleGen: Roland Green
University of Nebraska, Lincoln February 14. 2007
Major goals of comparative genomics
• Identify all DNA sequences in a genome that are
functional
– Selection to preserve function
– Adaptive selection
• Determine the biological role of each functional sequence
• Elucidate the evolutionary history of each type of
sequence
• Provide bioinformatic tools so that anyone can easily
incorporate insights from comparative genomics into their
research
Known types of gene regulatory regions
G.A. Maston, S.K. Evans, M.R. Green (2006) Ann. Rev. Genomics & Human Genetics 7:29-59.
Regulatory regions tend to be clusters of
transcription factor binding sites
Sequence-specific
SV40 promoters and enhancer
Properties of known regulatory regions
• Binding sites for transcription factors, many with
sequence specificity
• Clusters of binding sites
• Conventional promoters encompass major start
sites for transcription
• Conserved over evolutionary time???
Structures involved in transcription are probably
more complex
Middle image:
Green: active transcription (Br-UTP label)
Red: all nucleic acids
HeLa cell
Sides: EM spreads of transcripts
Peter R. Cook, Oxford University,
http://users.path.ox.ac.uk/~pcook/images/Images.html
Domain opening is associated with movement to nonheterochromatic regions
Schubeler, Francastel, Cimbora, Reik, Martin, Groudine (2000) Genes & Dev. 14: 940-950
Other possible activities for sequences
involved in gene regulation
• Opening or closing a chromosomal domain
• Move a gene to or away from a transcription factory
• Control how long a gene is in a transcription factory
– Long association
• High level expression
• Really long gene
– Short association
• Lower level expression
• Rapid regulation
• Are these conserved over evolutionary time?
3 modes of evolution
Sequence matches at longer phylogenetic distances could reflect purifying selection
Sequence differences at closer phylogenetic distances could reflect adaptive evolution.
Conservation vs. Constraint
• Conserved sequences are those that align between two
species thought to be descended from a common
ancestor
• Constrained sequences show evidence in their
alignments of negative (purifying) selection
– E.g. change at a rate significantly slower than “neutral”
DNA
Ideal cases for interpretation
Human vs mouse
Negative selection
(purifying)
Similarity
Neutral DNA
Human vs rhesus
Neutral DNA
Similarity
Positive selection
(adaptive)
P (not neutral)
Neutral DNA
Position along chromosome
DNA segments with a function common to
divergent species.
DNA segments in which change is beneficial to
at least one of the two species.
Messages about evolutionary approaches to
predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
phylogenetic distance.
• Incorporation of pattern and composition information
along with with conservation can lead to effective
discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
GATA-1 binding motif is an effective predictor of enhancer
activity.
• In vivo occupancy by GATA-1 suggests other activities in
addition to enhancers.
• Comparison of polymorphism and divergence from closely
related species can reveal regulatory regions that are
under recent selection.
Finding all gene regulatory regions is a
challenge for comparative genomics
•
•
•
•
Known regulatory regions for the HBB complex
23 total
19 conserved (align) between human and mouse
Many others show no significant difference in a measure
of constraint (phastCons) from the bulk or neutral DNA
Two
extremes
of
constraint
in TRRs
ENCODE projects
• ENCODE (ENCyclopedia Of DNA Elements): consortium
aiming to find function for all human DNA sequences
– Phase I focused on 1% of human DNA
– 30 Mb, 44 regions
• About 10 regions had known genes of interest (CFTR, HOX)
• Others were chosen to get a sampling of regions varying in gene
density and alignability with mouse
• Major areas
–
–
–
–
–
Genes and transcripts
Transcriptional regulation
Chromatin structure
Multiple sequence alignment
Variation in human populations
Biochemical assays for protein-binding sites in DNA
Purified protein
& Naked DNA
Chromatin Immunoprecipitation:
DNA sites occupied by a protein
inside cells.
ChIP-on-chip to examine many sites
Putative transcriptional regulatory regions = pTRRs
• Antibodies vs 10 sequence-specific factors:
– Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe, PU1, RA
Receptor A
– High resolution ChIP-chip platforms: Affymetrix and NimbleGen
– Data from several different labs in ENCODE consortium
• High likelihood hits for ChIP-chip
– 5% false discovery rate
• Supported by chromatin modification data
– Modified histones in chromatin: H4Ac, H3Ac, H3K4me, H3K4me2,
H3K4me3, etc.
– DNase hypersensitive sites (DHSs) or nucleosome depleted sites
• Result: set of 1369 pTRRs
A small fraction of cis-regulatory modules are
conserved from human to chicken
Millions of
years
91
173
310
• About 4% of pTRRs, 4% of
DNase HSs, 4-7% of promoters
active in multiple cell lines
• Tend to regulate genes whose
products control transcription
and development
450
David King
Most pTRRs are conserved in eutherian mammals
Percentage of class that align no further than:
pTRRs
Millions of
years
DNase HSs Promoters
Primates: 3%
11%
1-13%
Eutherians: 71%
70%
63%
Marsupials: 21%
14%
16-28%
Tetrapods: 4%
4%
4-7%
Vertebrates: 1%
1%
2-4%
91
173
310
450
Within aligned noncoding DNA of eutherians, need to distinguish constrained
DNA (purifying selection) from neutral DNA.
Measures of conservation and constraint
capture only a subset of pTRRs
Fraction overlapping
an MCS
Stringent constraint
phastCons
(background rate
corrected)
Allows a range
of constraint
Composite alignability
(background rate
corrected)
Aligns, but no
inference about
purifying selection
Sensitivity
Different measures perform better on specific
functional regions
1-Specificity
Examples of clade-specific pTRRs
Messages about evolutionary approaches to
predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
phylogenetic distance.
• Incorporation of pattern and composition information
along with with conservation can lead to effective
discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
GATA-1 binding motif is an effective predictor of enhancer
activity.
• In vivo occupancy by GATA-1 suggests other activities in
addition to enhancers.
• Comparison of polymorphism and divergence from closely
related species can reveal regulatory regions that are
under recent selection.
Regulatory potential (RP) to distinguish
functional classes
Good performance of ESPERR for gene
regulatory regions (RP)
-
James Taylor
Francesca
Chiaromonte
Messages about evolutionary approaches to
predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
phylogenetic distance.
• Incorporation of pattern and composition information
along with with conservation can lead to effective
discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
GATA-1 binding motif is an effective predictor of enhancer
activity.
• In vivo occupancy by GATA-1 suggests other activities in
addition to enhancers.
• Comparison of polymorphism and divergence from closely
related species can reveal regulatory regions that are
under recent selection.
Conservation of predicted binding sites for
transcription factors
Binding site for GATA-1
Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1.
Can rescue by expressing an estrogen-responsive form of GATA-1
Rylski et al., Mol Cell Biol. 2003
Predicted cis-Regulatory Modules (preCRMs)
Around Erythroid Genes
B:Yong Cheng, Ross, Yuepin Zhou, David King
F:Ying Zhang, Joel Martin, Christine Dorman, Hao
Wang
preCRMs with conserved consensus GATA-1 BS
tend to be active on transfected plasmids
preCRMs with conserved consensus GATA-1
BS tend to be active after integration into a
chromosome
Examples of validated preCRMs
Correlation of Enhancer Activity with RP Score
Validation status for 99 tested fragments
preCRMs with High RP and Conserved
Consensus GATA-1 Tend To Be Validated
CACC box helps distinguish validated from
nonvalidated preCRMs
All validated preCRMs
Same parameters
All nonvalidated preCRMs
Compare the outputs
Consensus for EKLF binding site:
CCNCMCCCW
Ying Zhang
CCNCMCCCW
CCNCMCCCW
Messages about evolutionary approaches to
predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
phylogenetic distance.
• Incorporation of pattern and composition information
along with with conservation can lead to effective
discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
GATA-1 binding motif is an effective predictor of enhancer
activity.
• In vivo occupancy by GATA-1 suggests other activities in
addition to enhancers.
• Comparison of polymorphism and divergence from closely
related species can reveal regulatory regions that are
under recent selection.
preCRMs with conserved consensus GATA-1 binding
sites are usually occupied by that protein: ChIP assay
Design of ChIP-chip for occupancy by GATA-1
1.
Non-overlapping tiling array with 50bp probe and 100bp
resolution (NimbleGen)
50
50
100
2.
Cover range
Mouse chr7:57225996-123812258 (~70Mbp)
3. Antibody against the ER portion of GATA-1-ER protein
in rescued G1E-ER4 cells
Yong Cheng, with Mitch Weiss & Lou Dore
(CHoP), Roland Green (NimbleGen)
Signals in known occupied sites in Hbb LCR
HS1
HS2
1) Cluster of high signals
2) “hill” shape of the signals
HS3
Peak Finding Programs
• TAMALPAIS
Mark Bieda from Peggy Farmham’s lab
Focus more on the cluster of the signals
4 thresholds based on number of consecutive probes
with signals in the 98th or 95th percentiles
• MPEAK
Bing Ren’s lab
Focus more one the “hill” shape of the signal
4 thresholds, for a series of probes with at least one that
is 3, 2.5, 2 or 1 standard deviations above the mean
ChIP-chip hits for GATA-1 occupancy
Technical replicates of ChIP-chip
with antibody against GATA1-ER
Mpeak
275 hits in both
TAMALPAIS
59
216
60
321 total ChIP-chip hits
276 hits in both
ChIP-chip hits validate at a high rate
Validation determined by quantitative PCR.
19 of the 321 hits were tested.
13 (~70%) were validated.
ChIP DNA
Validation rate is similar at different thresholds
9 regions were “hits” in only one of the two technical replicates.
None were validated.
Association of WGATAR and conservation
with ChIP-chip Hits
1. 249 out of the 321 (78%) have WGATAR
motifs, binding site for GATA-1
2. Of the GATA-1 binding motifs in those 249
hits, 112 (45%) are conserved between
mouse and at least one non-rodent species.
Expected and unexpected ChIP-chip hits
Distribution of ChIP-chip hits on 70Mb of
mouse chr7
Yong Cheng, Yuepin Zhou and Christine Dorman
H
P
G 18
H 1
P1
G GH 0
H P
G P1 7
H 8
P3 2
0
G GH 9
H P
G P1 1
H 8
P2 6
0
G GH 5
H P
G P3 4
H 1
G P1 4
H 7
P 2
G 16
G HP 7
H 7
P 4
G 19
H 3
P2
G GH 7
H P
P 9
G 17
H 0
G P1
G HP 8
H 1
P 6
G 24
H 3
G P1
H 5
G P2
H 8
G P1
H 7
G P3
G HP 1
H 1
G P1 1
H 9
P 8
G 16
G HP 9
H 1
P 4
G 17
G HP 3
H 2
P 9
G 19
H 9
P
G 12
G HP
G HP 3
H 2
P 4
G 16
H 4
G P1
H 3
G P3
H 0
G P1
G HP 9
H 2
G P1 6
H 6
G P1 1
H 9
G P1 1
H 9
G P1 7
H 8
P1 3
G 84
G HP
G HP 6
H 2
G P2 3
H 0
G P1 6
H 9
P2 4
0
G GH 2
H P
P2 0
0
G GH 0
H P
G P1 8
H 8
P 5
G 11
G HP 8
H 2
P2 0
G 04
H
G N5
H
G N034
H
G N106
H
G N033
H 3
N 7
3
G Y 22
H C
N 3
21
3
G
Fold change over parent
Almost half the GATA-1 ChIP-chip hits increase
expression of a transgene, K562 cells
4
15 6 6
3
2
1
0
GATA-1 occupied sites by ChIP-chip
24 validated out of 56 fragments with ChIP-chip hits tested
43%
No GATA-1
Conserved and nonconserved ChIP-chip
hits can be active as enhancers
Conserved, active
Conserved, not active
Not conserved, active
Messages about evolutionary approaches to
predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
phylogenetic distance.
• Incorporation of pattern and composition information
along with with conservation can lead to effective
discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
GATA-1 binding motif is an effective predictor of enhancer
activity.
• In vivo occupancy by GATA-1 suggests other activities in
addition to enhancers.
• Comparison of polymorphism and divergence from closely
related species can reveal regulatory regions that are
under recent selection.
Polymorphism as a transient phase of evolution
Slide from Dr. Hiroshi Akashi
Test of neutrality using polymorphism and
divergence data
Test for recent selection in human noncoding DNA
•
•
•
•
McDonald-Kreitman test
Use ancestral repeats as neutral model (MKAR test)
Count polymorphisms in human using dbSNP126
Count divergence of human from
– Chimpanzee (great Ape, diverged from human lineage 6 Myr ago)
– Rhesus macaque (Old World Monkey, diverged from human lineage 23
Myr ago)
• Tiled windows, most analysis on 10kb windows
• Compute p-value for neutrality by chi-square test
• Ratio of polymorphism to divergence ratios gives indication of
direction of inferred selection
Heather Lawson, Anthropology, PSU
pTRR apparently under positive selection
A promoter distal to the beta-like globin genes
has a signal for recent purifying selection
Selection on a primate-specific promoter
The distal promoter is close to the locus control
region for beta-globin genes
Messages about evolutionary approaches to
predicting regulatory regions
• Regulatory regions are conserved, but not all to the same
phylogenetic distance.
• Incorporation of pattern and composition information
along with with conservation can lead to effective
discrimination of functional classes (regulatory potential).
• Regulatory potential in combination with conservation of a
GATA-1 binding motif is an effective predictor of enhancer
activity.
• In vivo occupancy by GATA-1 suggests other activities in
addition to enhancers.
• Comparison of polymorphism and divergence from closely
related species can reveal regulatory regions that are
under recent selection.
Many thanks …
B:Yong Cheng, Ross, Yuepin Zhou, David King
F:Ying Zhang, Joel Martin, Christine Dorman, Hao
Wang
PSU Database crew: Belinda Giardine,
Cathy Riemer, Yi Zhang, Anton Nekrutenko
RP scores and other bioinformatic input:
Francesca Chiaromonte, James Taylor, Shan Yang,
Alignments, chains, nets, browsers, ideas, … Diana Kolbe, Laura Elnitski
Webb Miller, Jim Kent, David Haussler
Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Computing Regulatory Potential (RP)
Alignment
seq1
seq2
seq3
Collapsed alphabet
G
G
A
1
T
T
T
2
A
G
G
1
C
T
T
3
C
C
C
4
T
G
A
5
A
7
C
7
T
A
A
6
A
G
A
8
C
C
T
3
G
C
G
6
C
C
T
3
A
A
A
9
•A
3-way alignment has 124 types of columns. Collapse these to a smaller alphabet
with characters s (for example, 1-9).
•Train two order t Markov models for the probability that t alignment columns are followed
by a particular column in training sets:
–positive (alignments in known regulatory regions)
–negative (alignments in ancestral repeats, a model for neutral DNA)
–E.g. Frequency that 3 4 is followed by 5:
0.001 in regulatory regions
0.0001 in ancestral repeats
•RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of
alignment characters in known regulatory regions vs. ancestral repeats.
RP 

a in segment
pREG (sa | sa1 ...sa t ) 
log 

 pAR (sa | sa1 ...sa t ) 
Stage 1: Reduced representations
gap
ESPERR: Evolutionary
Sequence and Pattern Extraction
using Reduced Representations
G
T
Stage 2: Improve encoding
Train models for classification
6 6 2 may occur frequently in positive
training set and rarely in the negative
training set, and thus contribute to
discrimination.
If the positive training set is known
regulatory regions, this would
contribute to a positive RP.
Note that many different columns are reduced to single “encoding” (a number in the
figure). E.g. Four different columns are each called “3”.
Categories of Tested DNA Segments
Example that suggests turnover
GATA-1 BSs
Additional methods find CACC box as distinctive for
validation
All validated preCRMs
All nonvalidated preCRMs
CLOVER (Zlab)
Background:
Mouse chr 19
(42.8% C+G) NCBI Build 30
EKLF PWM
(Dr. Perkins)
Output for validated preCRMs
Motif
P(mm_chr19.m)
EKLF
0.0008
Output for nonvalidated preCRMs
Motif
P(mm_chr19.m)
none
none
6-mer
7-mer
8-mer
9-mer
Hexamer Counting
ELPH
(UMaryland)
counts
validated nonvalidated
NCACCC
60
32
CACCCW
56
27
expected validated nonvalidated
NCACCC
16.31
5.81
CACCCW
11.74
4.36
validated
non-validated
TTATYT
GGCAGR
CCWCAGM
RGRCAGR
CASCCWGC
CAGGGAWR
CCWGGCWGM CWGRGAWRA
Using Galaxy to find predicted CRMs
Download