Gill: Functional Genomics - A computational tour of the human genome

advertisement
CS273A
Lecture 16: Functional Genomics
MW 12:50-2:05pm in Beckman B302
Profs: Serafim Batzoglou & Gill Bejerano
TAs: Harendra Guturu & Panos Achlioptas
http://cs273a.stanford.edu [BejeranoFall13/14]
1
Gene set enrichment analysis:
The gene regulatory version
http://cs273a.stanford.edu [BejeranoFall13/14]
2
Cluster all genes for differential expression
Experiment Control
(replicates) (replicates)
genes
Most significantly up-regulated genes
Unchanged genes
Most significantly down-regulated genes
http://cs273a.stanford.edu [BejeranoFall13/14]
3
The Gene Sets to test come from the Ontologies
TJL-2004
4
Ask about whole gene sets
+
Exper. Control
Gene
Set 1
Gene
Set 2
Gene
Set 3
Gene set 3
up regulated
ES/NES statistic
Gene set 2
down regulated
http://cs273a.stanford.edu [BejeranoFall13/14]
Combinatorial Regulatory Code
2,000 different proteins can bind specific DNA sequences.
Proteins
DNA
Protein binding site
Gene
DNA
A regulatory region encodes 3-10 such protein binding sites.
When all are bound by proteins the regulatory region turns “on”,
and the nearby gene is activated to produce protein.
http://cs273a.stanford.edu [BejeranoFall13/14]
6
ChIP-Seq: first glimpses of the
regulatory genome in action
Peak Calling
Cis-regulatory peak
http://cs273a.stanford.edu [BejeranoFall13/14]
7
What is the transcription factor
I just assayed doing?
• Collect known literature of the form
• Function A: Gene1, Gene2, Gene3, ...
• Function B: Gene1, Gene2, Gene3, ...
• Function C: ...
• Ask whether the binding sites you discovered are
preferentially binding (regulating) any one or more
of the functions listed above.
• Form hypothesis and perform further experiments.
Cis-regulatory peak
Gene transcription start site
http://cs273a.stanford.edu [BejeranoFall13/14]
8
Example: inferring functions of Serum Response
Factor (SRF) from its ChIP-seq binding profile
Gene transcription start site
SRF binding ChIP-seq peak
• ChIP-seq identified 2,429 SRF binding peaks
in human Jurkat cells1
• SRF is known as a “master regulator of the actin cytoskeleton”
• In the ChIP-Seq peaks, we expect to find binding sites
regulating (genes involved in) actin cytoskeleton formation.
http://cs273a.stanford.edu [BejeranoFall13/14]
9
Example: inferring functions of Serum Response
Factor (SRF) from its ChIP-seq binding profile
Gene transcription start site
SRF binding ChIP-seq peak
Ontology term (e.g. ‘actin cytoskeleton’)
Existing, gene-based method to analyze enrichment:
• Ignore distal binding events.
• Count affected genes.
• Rank by enrichment
hypergeometric p-value.
N
K
n
k
= 8 genes in genome
= 3 genes annotated with
= 2 genes selected by proximal peaks
= 1 selected gene annotated with
P = Pr(k ≥1 | n=2, K =3, N=8)
http://cs273a.stanford.edu [BejeranoFall13/14]
10
We have (reduced ChIP-Seq into) a gene list!
What is the gene list enriched for?
Pro: A lot of tools out there for
the analysis of gene lists.
Cons: These tools are built
for microarray analysis.
Does it matter ??
Microarray
data
Microarray
data
Gene
regulation
data
Microarray tool
http://cs273a.stanford.edu [BejeranoFall13/14]
11
SRF Gene-based enrichment results
• Original authors can only state: “basic cellular processes,
particularly those related to gene expression” are enriched1
SRF
SRF acts on genes both in
nucleus and cytoplasm, that
are involved in transcription
and various types of binding
Where’s the signal?
Top “actin” term is
ranked #28 in the list.
SRF
[1] Valouev A. et al., Nat. Methods, 2008
http://cs273a.stanford.edu [BejeranoFall13/14]
12
Associating only proximal
peaks loses a lot of information
Relationship of binding peaks to nearest genes for
eight human (H) and mouse (M) ChIP-seq datasets
SRF (H: Jurkat)
NRSF (H: Jurkat)
GABP (H: Jurkat)
Stat3 (M: ESC)
p300 (M: ESC)
p300 (M: limb)
p300 (M: forebrain)
p300 (M: midbrain)
Fraction of all elements
0.7
0.6
Restricting to
proximal peaks
often leads to
complete loss of
key enrichments
0.5
0.4
0.3
0.2
0.1
0
0-2
2-5
5-50
50-500
> 500
Distance to nearest transcription start site (kb)
http://cs273a.stanford.edu [BejeranoFall13/14]
13
Bad Solution: Associating distal peaks
brings in many false enrichments
Why bad? 14% of human genes tagged ‘multicellular organismal development’.
But 33% of base pairs have such a gene nearest upstream/downstream.
SRF ChIP-seq set has >2,000 binding events.
Throw a random set of 2,000 regions at the genome.
What do you get from a gene list analysis?
Term
Bonferroni
corrected p-value
nervous system development
5x10-9
system development
8x10-9
anatomical structure development
7x10-8
multicellular organismal development 1x10-7
developmental process
2x10-6
http://cs273a.stanford.edu [BejeranoFall13/14]
Large “gene deserts” are often
next to key developmental genes
14
Real Solution: Do not convert to gene list.
Analyze the set of genomic regions
Gene transcription start site
Ontology term ( ‘actin cytoskeleton’)
Gene regulatory domain
Genomic region (ChIP-seq peak)
p = 0.33 of genome annotated with
n = 6 genomic regions
k = 5 genomic regions hit annotation
GREAT = Genomic Regions
Enrichment of Annotations Tool
P = Prbinom(k ≥5 | n=6, p =0.33)
Since 33% of base pairs are near a ‘multicellular organismal development’
gene, we now expect 33% of genomic regions to hit this term by chance.
=> Toss 2,000 random regions at genome, get NO (false) enrichments.
http://cs273a.stanford.edu [BejeranoFall13/14]
15
How does GREAT know how to assign
distal binding peaks to genes?
Future: High-throughput assays
based on chromosome
conformation capture (3C)
methods will elucidate complex
regulation mechanisms
Currently: Flexible computational definitions allow assignment of
peaks to nearest gene, nearest two genes, etc.
• Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream
of transcription start site, extends to basal domain of nearest genes within 1 Mb
• Though some associations may be missed or incorrect, in
general signal richness and robustness is greatly improved by
associating distal peaks
http://cs273a.stanford.edu [BejeranoFall13/14]
16
GREAT infers many specific functions
of SRF from its binding profile
Top GREAT enrichments of SRF
Top gene-based
enrichments of SRF
Ontology
Term
# Genes Binomial Experimental
P-value
support*
30
31
7x10-9
5x10-5
Miano et al. 2007
Pathway
Commons
TRAIL signaling
32
Class I PI3K signaling 26
5x10-7
2x10-6
Bertolotto et al. 2000
TreeFam
FOS gene family
5
1x10-8
Chai & Tarnawski
2002
TF Targets
Targets of SRF
Targets of GABP
Targets of YY1
Targets of EGR1
84
28
44
23
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
Gene Ontology actin cytoskeleton
actin binding
(top actin-related
term 28th in list)
Miano et al. 2007
Poser et al. 2000
ChIp-Seq support
Natesan & Gilman
1995
* Known from literature – as in function is known, SOME of the
genes are known, and the binding sites highlighted are NOT.
Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq
[McLean et al., Nat Biotechnol., 2010]
http://cs273a.stanford.edu [BejeranoFall13/14]
17
Limb P300: I was blind and I can see
Gene List
http://cs273a.stanford.edu [BejeranoFall13/14]
18
GREAT works with ANY cis-regulatory rich set
Example: GWAS Compendium set
Heightassociated
unlinked
SNPs
http://cs273a.stanford.edu [BejeranoFall13/14]
19
GREAT analysis of histone mark combinations
http://cs273a.stanford.edu [BejeranoFall13/14]
20
GREAT includes multiple ontologies
• Twenty ontologies spanning broad categories of biology
• 44,832 total ontology terms tested in each GREAT run
(2,800 terms)
(5,215)
(834)
(6,700)
(3,079)
(911)
(5,781)
(427)
(456)
(150)
(1,253)
(288)
(706)
http://cs273a.stanford.edu [BejeranoFall13/14]
(615)
(19)
(222)
(9)
(6,857)
(8,272)
(238)
Michael Hiller
21
Advantages of the GREAT approach
Tailored to the biology of gene regulation:
• Distal sites are incorporated, not ignored
• Variable length gene regulatory domains
• Multiple bindings next to same target gene rewarded
• Binding sites associated to (both) TSS, not gene body
• Extensive ontologies, some home-made
http://cs273a.stanford.edu [BejeranoFall13/14]
22
Algorithmic Optimization: A it works; B make it efficient
http://cs273a.stanford.edu [BejeranoFall13/14]
23
enter GREAT.stanford.edu
Choose genome
Input peak list
Hit submit!
http://cs273a.stanford.edu [BejeranoFall13/14]
24
GREAT web app:
(Optional): alter association rules
http://great.stanford.edu
Three association
rule choices
Lnp
Evx2 HoxD cluster
Literature-curated
domains for a small
subset of genes
http://cs273a.stanford.edu [BejeranoFall13/14]
[adapted from Spitz, Gonzalez, & Duboule, Cell, 2003]
25
GREAT web app: output summary
Additional ontologies,
term statistics,
multiple hypothesis
corrections, etc.
Ontology-specific
enrichments
Cool visualization
opportunities!
http://cs273a.stanford.edu [BejeranoFall13/14]
26
GREAT web app: term details page
Genes annotated as “actin
binding” with associated
genomic regions
Genomic regions annotated
with “actin binding”
Drill down to explore how a particular peak
regulates Plectin and its role in actin binding
http://cs273a.stanford.edu [BejeranoFall13/14]
27
You can also submit any track
straight from the UCSC Table Browser
A simple, well documented
programmatic interface allows
any tool to submit directly to GREAT.
(See our Help / Inquiries welcome!)
http://cs273a.stanford.edu [BejeranoFall13/14]
28
GREAT web app: export data
HTML output displays all user selected rows and columns
Tab-separated values also available for additional postprocessing
http://cs273a.stanford.edu [BejeranoFall13/14]
29
GREAT Web Stats
200-400 job submissions per day, from 8,000 IP addrs
Over 175,000 jobs served, hundreds of citations
http://cs273a.stanford.edu [BejeranoFall13/14]
30
Adding a new species to GREAT
We need:
1. A good assembly
2. A high quality gene set
3. Good gene annotations*
*Most valuable for species with independent annotations!
http://cs273a.stanford.edu [BejeranoFall13/14]
31
Test case: early neocortex development
E12.5
E14.5
E16.5
Dissected the dorsal cerebral wall
We Performed p300, H3K27Ac ChIP-seq. What have we learned?
http://cs273a.stanford.edu [BejeranoFall13/14]
32
Collect, Call Peaks: Mostly Distal
Did the experiment work?
http://cs273a.stanford.edu [BejeranoFall13/14]
33
In silico validation
Most enriched terms: what you want,
where & when you want it
http://cs273a.stanford.edu [BejeranoFall13/14]
34
Are all cell populations represented?
SVZ-IZ
CP
VZ
[Ayoub et al., PNAS. 2011]
http://cs273a.stanford.edu [BejeranoFall13/14]
What are different enhancer groups doing?
http://cs273a.stanford.edu [BejeranoFall13/14]
36
Which genes are most regulated?
[Ayoub et al, 2011]
http://cs273a.stanford.edu [BejeranoFall13/14]
37
The most densely regulated genes
http://cs273a.stanford.edu [BejeranoFall13/14]
38
Mn1 – Cryba4 gene desert
http://cs273a.stanford.edu [BejeranoFall13/14]
39
ChIP-seq is a powerful technique
crosslink
ChIP-seq lessons:
Transcription factors (TF) bind next
to many relevant target genes.
Example: SRF regulates actin cytoskeleton.
Binds near 40 actin genes in immune cells.
Most binding is distal.
Ex: 75% of SRF binding sites near actin genes
well over 1kb from TSS (upto 300kb).
Binding is context specific.
Ex: SRF also regulates muscle development.
SRF is not enriched near muscle genes, when
assayed in immune cells.
fragment
antibody
immunoprecip.
high throughput
sequencing
map to genome
+ find peaks
Many TF functions yet undiscovered.
http://cs273a.stanford.edu [BejeranoFall13/14]
40
How can we discover new TF functions?
Exciting new technologies: designer genome editing.
Which TF to knock out next? Where to look for its effects?
ChIP-seq requires
• Quality antibody
• Enough cells of right context
• Sequencing costs
 Seldom used in exploratory mode
(ENCODE sampled 100 mostly basal TFs in very few contexts)
TF function exploration: expression perturbations, genetics screens.
Which TF next? Where to look? Educated guesses.
Goal: Devise a rapid method for TF function prediction
http://cs273a.stanford.edu [BejeranoFall13/14]
41
How can we discover new TF functions?
Observations:
1. Gene function annotation is very rich.
(Examples: GO, MGI expression & phenotype, etc. etc.)
2. Conserved non-coding, likely cis-regulatory, sequence
makes up 5-10% of the mammalian genomes.
3. Conserved binding site prediction has become accurate.
(Predict only with strong evidence, minimize false positives)
Idea:
Use transcription factor tendency to bind next to many
context-relevant genes to make function predictions.
http://cs273a.stanford.edu [BejeranoFall13/14]
42
Transcription Factor Function Prediction
1. Curate a rich library of TF binding site motifs.
Pro: hundreds are now known. (inc. from Tim, Jussi’s work)
2. Predict cross-species conserved binding sites.
Pro: careful conserved binding site prediction is accurate.
3. Search for extreme binding site concentration
next to genes of any particular function.
Pro: Leverage the observed phenomenon, and the very large
body of knowledge about all (target) genes in the genome.
Con: You will not get all binding sites of any function.
Pro: You may predict many diverse TF functions.
SRF
http://cs273a.stanford.edu [BejeranoFall13/14]
Predict SRF to regulate:
Actin cytoskeleton,
Muscle development, ...
43
This is what we did: 1. Build motif library
300 motifs
332 motifs for
300 transcription factors
(updating for many more)
http://cs273a.stanford.edu [BejeranoFall13/14]
44
2. Predict conserved binding sites
. = same as human
•
•
•
•
Human
Chimp
Gorilla
Orangutan
Rhesus
Tarsier
Mouse lemur
Bushbaby
Tree shrew
Mouse
Guinea pig
Squirrel
Rabbit
Alpaca
Cow
Cat
Microbat
Megabat
Hedgehog
Rock hyrax
Tenrec
Armadillo
Sloth
TTTCCCTTAAAAGGCTTAAATAAACTCACCAGTGTTTAATT
.........................................
.........................................
...T.....................................
...T.....................................
...T..............G...........G......T...
...T............C........AT..TG.....C....
...T................C....AT...G.....C....
...T.........................TG..........
...T.....................C...TG.....G.G..
...T.........................TG........CG
...T.........................TG..........
...T......................T...G......G...
...T.....................................
...T........................TTG..........
...T..........................GAC.......A
...T..........................C..........
...T......................T.-.CA.........
...T...............G.....................
...T.....................................
...T.....................................
...T.........................TG..........
...T.........................TG..........
We in fact allow:
•
Imperfect motif matches
•
Binding site / alignment wobble
•
Subset of species support
Guard against alignment fragmentations
Predict efficiently
Improve state of the art using
“Excess conservation” scoring
http://cs273a.stanford.edu [BejeranoFall13/14]
45
Compare to shuffled motifs & weed out!
SRF motif
shuffle #1
shuffle #2
shuffle #3
shuffle #4
…
shuffle #10
http://cs273a.stanford.edu [BejeranoFall13/14]
46
Use three reference genomes
rich in gene function annotations
Predicted for human, for mouse and for zebrafish
http://cs273a.stanford.edu [BejeranoFall13/14]
47
3. Predict binding site/TF functions
MYOG
MYF6
BMP4
NKX2-5
gene transcription start site
SRF binding site
CAV3
ACTA1
Enhancer to gene association
 SRF must regulate muscle structure development:
predicted to bind next to 157 genes (p=7.43×10-41)
http://cs273a.stanford.edu [BejeranoFall13/14]
48
Conservative multiple testing correction
• Control false positives when predicting
function for hundreds of transcription factors:
SRF motif
shuffle #1
shuffle #2
shuffle #3
shuffle #4
shuffle #10
Remove terms with
E[occurrences] ≥ 1
kidney morphogenesis
absent semicircular canals
dorsal spinal cord development
…
Retains only 5% of GREAT predictions.
2,543 transcription factor to function links (16% FDR)
http://cs273a.stanford.edu [BejeranoFall13/14]
49
PRISM vs. ChIP-seq
Actin cytoskeleton
SRF
Term
PRISM
T cell
ChIP-seq
actin cytoskeleton
Known
structural constituent of muscle
Known
dilated heart ventricles
Known
regulation of insulin secretion
Novel
muscle
SRF
http://cs273a.stanford.edu [BejeranoFall13/14]
heart
50
PRISM vs. ChIP-seq
Actin cytoskeleton
SRF
Term
PRISM
T cell
ChIP-seq
actin cytoskeleton
Known
structural constituent of muscle
Known
dilated heart ventricles
Known
regulation of insulin secretion
Novel
muscle
Every known function is supported
SRF
by dozens or hundreds
of novel binding
heartsites.
http://cs273a.stanford.edu [BejeranoFall13/14]
51
PRISM re-discovers known functions
TF
SRF
function
muscle structure development
p-value
7.43×10-41
target genes
157
GLI2
skeletal system development
7.07×10-48
192
Is the number of re-discovered known
functions impressive?
CRX
retinal photoreceptor degeneration 1.30×10-10
AR
abnormal spermiogenesis
1.19×10-6
http://cs273a.stanford.edu [BejeranoFall13/14]
34
26
52
PRISM re-discovers many known functions
Genes involved in
“muscle structure development”
SRF
http://cs273a.stanford.edu [BejeranoFall13/14]
53
Distal binding sites are important
3.6×
http://cs273a.stanford.edu [BejeranoFall13/14]
54
PRISM predicts many novel functions
TF
MYF6
GABPA
GATA6
…
function
abnormal pancreas development
transcription by RNA polymerase I
abnormal pancreas development
p-value
1.67×10-10
3.64×10-11
5.69×10-13
target genes
21
10
23
Nature Genetics,
Dec 2011
Function prediction word cloud
(word size correlates with its frequency in our function predictions)
http://cs273a.stanford.edu [BejeranoFall13/14]
55
PRISM works: Experimental validation
TF
MYF6
function
abnormal pancreas development
p-value
1.67×10-10
http://cs273a.stanford.edu [BejeranoFall13/14]
target genes
21
56
PRISM is built for the experimentalist
http://PRISM.stanford.edu
Search:
Search for:
• Human,
•
•
•
•
• Mouse, or
• Zebrafish
Transcription factor: e.g. “MYOG”,
Function: e.g. “insulin”,
Target gene: e.g. “ACTA1”, or
Target genomic region: e.g. 9p21
http://cs273a.stanford.edu [BejeranoFall13/14]
57
Download