ChIP-seq Methods & Analysis

advertisement
ChIP-seq Methods & Analysis
Gavin Schnitzler
Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC
gschnitzler@tuftsmedicalcenter.org
617-636-0615
ChIP-seq COURSE OUTLINE
• Day 1: ChIP techniques, library production,
USCS browser tracks
• Day 2: QC on reads, Mapping binding site
peaks, examining read density maps.
• Day 3: Analyzing peaks in relation to
genomic feature, etc.
• Day 4: Analyzing peaks for transcription
factor binding site consensus sequences.
• Day 5: Variants & advanced
approaches.
Day 5 Outline
• Introduction to variations on ChIP-seq
methods
• Extensions & variations on TFBS analysis
• Analyzing published data & across
platforms
• Downloading & installing programs
• Writing your own programs
Next-Generation Sequencing Analysis
“ChIP-Seq is the best thing that happened to
ChIP since the antibody. It is 100x better than
ChIP-Chip since it escapes most of the problems
of microarray probe hybridization. Plus it is
cheaper, and genome wide. But ChIP-Seq is
only the tip of the iceberg - there are many
inventive ways to use a sequencer.”
Quote from intro to Homer software at: http://biowhat.ucsd.edu/homer/ngs/index.html
Extensions of ChIP-seq
ChIP-Seq: Isolation and sequencing of genomic DNA "bound" by a specific transcription factor, covalently
modified histone, or other nuclear protein. This methodology provides genome-wide maps of factor
binding. Most of HOMER's routines cater to the analysis of ChIP-Seq data.
DNase-Seq: Treatment of nuclei with a restriction enzyme such as DNase I will result in cleavage of DNA at
accessible regions. Isolation of these regions and their detection by sequencing allows the creation of
DNase hypersensitivity maps, providing information about which regulatory elements are accessible in the
genome. (variant technique called FAIRE-seq)
MNase-Seq: Micrococcal Nuclease (MNase) is a restriction enzyme that degrades genomic DNA not
wrapped around histones. The remaining DNA represents nucleosomal DNA, and can be sequencing to
reveal nucleosome positions along the genome. This method can also be combined with ChIP to map
nucleosomes that contain specific histone modifications.
RNA-Seq: Extraction, fragmentation, and sequencing of RNA populations within a sample. The replacement
for gene expression measurements by microarray. There are many variants on this, such as Ribo-Seq
(isolation of ribosomes translating RNA), small RNA-Seq (to identify miRNAs), etc.
GRO-Seq: RNA-Seq of nascent RNA. Transcription is halted, nuclei are isolated, labeled nucleotides are
added back, and transcription briefly restarted resulting in labeled RNA molecules. These newly created,
nascent RNAs are isolated and sequenced to reveal "rates of transcription" as opposed to the total number of
stable transcripts measured by normal RNA-seq.
Hi-C: Genomic interaction assay for understanding genome 3D structure. This assay is much more
specialized - For more information about how to use HOMER to analyze Hi-C data, check out the Hi-C
analysis section.
Examining longrange interactions
by ChIP-seq
Two DNA fragments
associated with the same
IP’d protein are ligated
together.
Sequencing identifies both
short-range and long
range interactions.
Nature Reviews Genetics 2012 13:840
Fine scale
information from
DNAse-seq
Sequencing the ends of
DNAse cuts identifies
regions of bare DNA.
Fine scale analysis of this
data can identify individual
TF binding sites.
Nature Reviews Genetics 2012 13:840
Capturing allele-specific
information using SNPs in reads
CTCF binds
better to the A
variant
Mapping CpG DNA
methylation patterns
Approaches:
•IP of DNA fragments using
antibodies against meC or
meCpG binding proteins.
•Selection of DNA fragments
using methyl-sensitive restriction
enzymes.
•Whole genome bisulfite
sequencing.
Bormann Chung CA, Boyd VL, McKernan KJ, Fu Y, et al. (2010) Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding. PLoS ONE 5(2): e9320.
doi:10.1371/journal.pone.0009320 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0009320
Mapping nucleosome
positions
Approaches:
•1) Fragmentation to
mononucleosome size by
sonication or micrococcal
nuclease (MNase)
 ChIP w/ antibody
against histone modification
(H3K4me1) – can map
positions of nucleosomes
with this mark.
 Whole genome
sequencing.
Nat Struct Mol Biol. 2011 June; 18(6): 742–746.
Plotting ChIP-seq read density versus genomic features
Taking average normalized .bedgraph data relative to
TSSes…
ChIP-seq reads/10bp/promoter
60
LiERBS_v_LiD
LiERBS_v_LiU
LiERBS_v_LiNon-regl.
LiERBS_v_LiNon-expr.
AoERBS_v_AoD
AoERBS_v_AoU
AoERBS_v_AoNon-regl.
50
40
30
20
10
0
-2000
-1000
0
1000
BP from TSSes of gene group
2000
Using input chromatin read density to measure
nucleosome densities
Hypothesis: Sonication mostly cuts in nucleosome free regions or internucleosomal spacers. Thus, read positions give information about
nucleosome positions.
Initial support: Average normalized .bedgraph data from INPUT sample
relative to TSSes recapitulates the low nucleosome occupancy seen
genomewide over promoters.
Li INPUT shifted reads/ 10 bp/ promoter
8
7
6
5
LiINPUT_v_LiD_pros (norm'd)
LiINPUT_v_LiU_pros (norm'd)
LiINPUT_v_AoD_pros (norm'd)
LiINPUT_v_AoU_pros (norm'd)
4
3
2
1
0
-2000
-1500
-1000
-500
0
500
BP from Li Down promoter TSSes
1000
1500
2000
Day 5 Outline
• Introduction to variations on ChIP-seq
methods
• Extensions & variations on TFBS
analysis
• Analyzing published data & across
platforms
• Downloading & installing programs
• Writing your own programs
Many approaches to TFBS analysis
Hannenhalli S Bioinformatics 2008;24:1325-1331
Also, Ladunga I. An overview of the computational analyses and
discovery of transcription factor binding sites. Methods Mol Biol. 2010;674:1-22.
doi: 10.1007/978-1-60761-854-6_1. : Introduction to a set of about a dozen methods
papers.
De Novo Search Algorithms
The Gibbs sampler approach
1
k
Objective: Find
conserved segment of
length k in n unrelated
1
k
sequences
The program will need to run once for each k:
e.g. 6 bp, 7 bp, 8 bp sequences, etc. (either
automatically, or by hand).
1
1
k
From : Lawrence, C. et al.(1993) Detecting Subtle Sequence Signals: A Geibbs Sampler approach to Multiple Alignment. Science 262.208-
The EM approach (in MEME etc.)
Expectation Maximization algorithm, proceeds in
iterations until E & M converge. For an explanation of
the process see Nature Biotechnology 26, 897 - 899
(2008).
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
2
n
Two de novo search methods
• DME is part of the same CREAD
package that storm is in (run in UNIX)
• SEME some of the same refinements
as CentDist to do de novo searches:
http://biogpu.d1.comp.nus.edu.sg/~chip
seq/webseqtools2/
Extensions to Basic Models
Composite Patterns:
BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics
M2
Start
p12
Regulatory Modules:
p21
De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84
M3
M1
Gene A
Gene B
Stop
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
Combining Signals and other Data
Motifs
Coding regions
Expresssion and Motif Regression:
Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci. 100.3339-44
1.Rank genes by E=log2(expression fold change)
2.Find “many” (hundreds) candidate motifs
3.For each motif pattern m, compute the vector Sm of matching scores
for genes with the pattern
4.Regress E on Sm
Yg    m Smg  g
ChIP-on-chip -
1-2 kb information on protein/DNA interaction:
An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, 835-39
Protein binding
in neighborhood
Coding regions
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
Assessment of evolutionary
conservation
Modules shared across species
are most highly rated.
For use of evolutionary conservation information w/ individual
motifs see: Das & Dai 2007 BMC Bioinformatics 8:S21.
For regulatory modules see: Su J, Teichmann SA, Down TA (2010) Assessing Computational
Methods of Cis-Regulatory Module Prediction. PLoS Comput Biol 6(12): e1001020.
doi:10.1371/journal.pcbi.1001020
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1001020
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
Integrating data from multiple sources
w/ permutation of average ranks
Let’s say we want to combine data from several sources or
metrics to decide which are the most relevant enriched TFs.
e.g. 1) p.value in CentDist, 2) p.value in Storm & 3) p.value
of homologous sequence in DME
Establish a ranking metric for each (e.g. 1 best to 10 worst).
It doesn’t have to be the same for 1, 2 & 3, but you need to
apply the same rank system across different biological
conditions.
For each TF compute the average rank.
(1)
1
4
3
8
.
(2)
3
5
4
9
.
(3)
2
8
2
7
.
(avg)
2
5.7
3
8
.
Permutation of average ranks
Now take the same columns of ranks for (1), (2) & (3) and
randomize each one separately.
Repeat this several times (until
you have thousands of random
average ranks & plot frequency
vs avg. rank…
(1)
8
4
3
1
.
(2)
5
4
3
9
.
(3)
2
8
7
2
.
(avg)
5
5.3
4.3
4
.
2.0 observed 34/10,000 times in
permuted averages.
Estimated FDR ~3.4e-3
10
7
5
3
1
The number of times a given value is observed divided by
the total number of iterations gives an estimate of false
discovery rate.
Day 5 Outline
• Introduction to variations on ChIP-seq
methods
• Extensions & variations on TFBS analysis
• Analyzing published data & across
platforms
• Downloading & installing programs
• Writing your own programs
What if you want to know something from a published
dataset, but they’ve only provided the raw data on SRA?
Getting data from SRA
Go to: http://www.ncbi.nlm.nih.gov/sra
Find an experiment by searching, e.g. “encode h1-hesc h3k4me3”
Click on the name to the left of the smaller file (1.9M) & then on the
downloads tab.
Right click on the ftp link for the run & copy the link location.
Open putty & login to your account at cluster.uit.tufts.edu
Go to your /cluster/shared/[userID]/chip directory & do:
wget [pasted URL]
Decoding the .sra format
The SRR227387.sra file you now have is in a special file format, but it
does have all the original .fastq information in it.
To get that info do:
bsub /cluster/tufts/cbi*/Ch*/ESC*/sra*/bin/fastq-dump SRR227387.sra
[fastq-dump is part of a package of programs for handling .sra files that you can
download, unpack & run immediately from your shared directory – at least as far as
simple files like fastq-dump are concerned]
This gives you the same .fastq format you’re familiar with.
Use head to confirm the format, but then you might as well delete the
file with rm so as not to clutter up the cluster.
After this week you are now ready to do any analysis you want
on this data, from mapping reads to the genome (w/ bowtie) to
peak calling (w/ MACS), to TFBS analysis.
“Liftover” programs to convert
between genomes & builds
Several useful tools for this in Cistrome/Galaxy:
Liftover/Others
Convert between RefSeq, Gene Symbols to Entrez IDs using Bioconductor.
Liftover Wig Files Liftover wig files
[Galaxy]Convert genome coordinates between assemblies and genomes
Extract data from Wiggle Extract data for certain chromosome from a wiggle
file
Extract data from Bed Extract data for certain chromosome from a BED file
In the UCSC genome browser:
•Tools-> Liftover
•Choose the starting genome/build & the one you want to convert to.
•Upload a .bed file w/ the ranges you want & hit go (only works for bed
files… may work with bedGraph, although I haven’t confirmed this)
Day 5 Outline
• Introduction to variations on ChIP-seq
methods
• Extensions & variations on TFBS analysis
• Analyzing published data & across
platforms
• Downloading & installing programs
• Writing your own programs
Don’t be intimidated!
There’s nothing to prevent you from installing a program
you want to run in your cluster account.
Before you begin, though, type “module available” to see if it’s
already installed as a module.
Also go to /cluster/tufts/ngsp/ngsp/ to see if it’s installed there.
• Read the documentation from the creator’s lab, download,
unzip &/or unpack the file, read the INSTALL or README
files included, & give it a try.
• You may need to be running a specific version of perl or
python, etc. If so, check “module available” to see if it’s
installed on the cluster & use “module load [name]” to add
it.
• You may also need to set system variables using “export
VARIABLE=$VARIABLE:/new/path”. README files should
tell you enough to know what to try.
• If you get stuck, the cluster support folks are friendly &
helpful (and respond moderately fast). Contact them at:
cluster-support@tufts.edu.
A different integrated package of
tools to run in UNIX
HOMER
Software for motif discovery and next-gen sequencing analysis
http://biowhat.ucsd.edu/homer/ngs/index.html
Mapping to the genome (NOT performed by HOMER, but important to understand)
Creation Tag directories, quality control, and normalization. (makeTagDirectory)
UCSC visualization (makeUCSCfile, makeBigWig.pl)
Peak finding / Transcript detection / Feature identification (findPeaks)
Motif analysis (findMotifsGenome.pl)
Annotation of Peaks (annotatePeaks.pl)
Quantification of Data at Peaks/Regions in the Genome/Histograms and Heatmaps
(annotatePeaks.pl)
Quantification of Transcripts (analyzeRNA.pl)
Additional analysis strategies:
General sequence manipulation tools (homerTools)
Miscellaneous Tools for Sharing Data between programs, etc. (tagDir2bed.pl, bed2pos.pl,
pos2bed.pl ...)
Finding overlapping or differentially bound peaks (mergePeaks, getDifferentialPeaks)
ChIP-Seq analysis automation (analyzeChIP-Seq.pl)
Description of file formats
Could be very useful… & with (only a bit of) luck, you’ll be
able to install & run them yourself.
Installing a program in R
Check out the Key R Commands link at
http://sites.tufts.edu/cbi/resources/chip-seq/
This is not an introduction to programming in R!
Instead it gives basic instructions for how to:
1) install & run R packages that may be needed for
your research,
2) how to move data files into R
3) how to perform simple edits on this data that may be
required by the package
& 4) how to output your results.
Note: I find that the documentation for R packages is
generally quite good.
Day 5 Outline
• Introduction to variations on ChIP-seq
methods
• Extensions & variations on TFBS analysis
• Analyzing published data & across
platforms
• Downloading & installing programs
• Writing your own programs
Mastering simple UNIX tools
find, awk, grep, sort, sed & more
One line commands to let you search and
manipulate large data files w/o writing a
program or trying to use the kludgy and limited
tools in Galaxy.
Find out more at:
http://sites.tufts.edu/cbi/resources/rna-seqcourse/unix-resources/
Programming: Get your feet wet
Perl Tutorials - learn.perl.org
learn.perl.org/tutorials/
Many tutorials are available if you are interested in learning Perl. These
tutorials are introductions.
Beginning Perl (free) - www.perl.org
www.perl.org/books/beginning-perl/
This book is for those new to programming who want to learn with Perl.
A ton of Perl programs for you to use/adapt/modify:
http://www.bioperl.org/wiki/Main_Page
For learning R:
Check out Josh’s links at: http://sites.tufts.edu/cbi/resources/rnaseq-course/r-resources/
Also check out my notes on using R (specifically geared to the minimum
you need to install & use existing programs) & a brief reference sheet on
Perl at http://sites.tufts.edu/cbi/resources/chip-seq/
Look at examples, check the
web…
If you’re looking for a command in UNIX, R, Perl, Python, etc.
do a Google search (for R add “statistical” to your search to
specify what you mean).
If you’re wondering how to get a program to do something,
look at other programs & see how they did it.
You don’t need to memorize the language, beyond a few
basics, just look at what you (or someone else) did before &
copy it.
Questions?
What would you like to explore?
What’s the next bioinformatics
challenge in your research?
Course evaluation forms…
Download