Introduction to ChIPseq analysis

advertisement
Introduction to ChIPseq analysis
Sean Thomas, Ph.D.
a little about me…
BS Biology
BA Architecture
PhD Molecular Biology
2000
2006
studying antibiotic resistance
evolving buildings in a computer
regulation of gene expression
in a eukaryote with no known
protein-coding promoters
postdoc
numerous collaborations with
ENCODE and related projects
collaborative research portfolio
fee-based consultation
teaching, training and mentorship
generalized pipeline of a ChIPseq study
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
basic study design principles
three or more replicates per sample
libraries
X
sample
technical replicates are generally a waste of time and money
samples
X
✓
sequencing
replicates
libraries
sequencing
origin
many studies do not account for batch effects
i.
time
ii.
origin
so if you care about reproducibility
experiment
experiment1
experiment2
Experiment3…
time ------->
libraries, sequencing, etc
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
basic study design principles
each sample has a matched input
input
ChIP
replicates
library/sequencing
X
sequencing one input is not rigorous.
(different samples have different chromatin backgrounds )
input
ChIP
replicates
library/sequencing
input samples should ideally be sequenced comparably to ChIP samples
X
ChIP
under-sequenced input
input
✓
ChIP
replicates
library/sequencing
ChIP
well-sequenced input
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
basic study design principles
do not pool data
actual replicates
pooled data
X
✓
if you need to pool your data, then it is under-sequenced
under-sequenced data
pooled data
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
measure twice, cut once
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
fastqc – quality control
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
quality filter
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
adapter/barcode trimming
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
tag uniqueness
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
Bowtie2 - alignment
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
i.) build genome database
format:
bowtie2-build [options]* <fasta_reference_in> <bt2_index_base>
sample code:
bowtie2-build
myDB.fasta
myDB
ii.) test bowtie2 and database
bowtie2 -x myDB --no-head
-c
ACATACTTCTTTATATGCCCATA
iii.) run alignment
bowtie2 -x myDB -U seq.fq -S seq_alignedToMyDB.sam
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
i.) segment the genome into appropriately sized bins (e.g. 20 bp)
chr1
25
45
chr1
45
65
chr1
65
85
…
http://code.google.com/p/bedops/
ii.) count the number of tags within ‘x’ bp of each genomic bin (e.g. 75bp)
chr1
25
45
0
chr1
45
65
2
chr1
65
85
5
…
iii.) convert file to BigWig format (e.g. bedGraphToBigWig)
iv.) host file on public web server
v.) generate url link text:
track type=bigWig
name="SRX081879"
description="SRX081879"
bigDataUrl=http://lighthouse.ucsf.edu/public_files_no_password/sthomas/chicken/SRX081879_gal3.20bpdensity.bw
vi.) load tracks into UCSC genome browser (genome.ucsc.edu)
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
tag density distribution
reproducibility
similarity of coverage
…
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
unexpected signals
systematic biases
confounding factors
new biology
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
peak calling
not standardized, and is an active field of research
most respected genomics labs have their own unique methods, there are five
distinct methods just within the ENCODE project alone:
http://www.ncbi.nlm.nih.gov/pubmed/22955991
http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=wgEncodeTfBindingSuper
appropriate methodologies depend on data type
punctate
mixed signal
SPP
MACS
-
http://compbio.med.harvard.edu/Supplements/ChIP-seq/
http://liulab.dfci.harvard.edu/MACS/
broad signal
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
what do you need to get to the point of doing sequence tag alignments?
-
reproducible experimental system
molecular biology lab/reagents/expertise
well conceived study design
modern computer running bowtie and fastqc
reliable library construction and sequencing lab/reagents/expertise
in order to build and view tracks on the UCSC genome browser,
call ChIP peaks
- mac/linux machine
- a web server
- beginner bioinformatics expertise
(~8 hrs of training, for motivated novice)
in order to do solid downstream analyses
-
combination of advanced genomics, bioinformatics and biology
experience (either one individual or a team working together).
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
downstream analysis for labs without current computational capability
unsuccessful projects
- underestimate the importance of proper study design
- fail to appreciate or apply necessary expertise to the problem
successful projects
- effective collaboration with computational scientist
- lab member undergoes intensive mentoring with computational scientist
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
case study
A Temporal Chromatin Signature in Human Embryonic Stem Cells
Identifies Regulators of Cardiac Development
http://www.cell.com/retrieve/pii/S0092867412010586
design study
obtain input chromatin
perform precipitation
construct library
sequence library
filter sequences
align sequences
get tag density tracks
assess data quality
understand the data
proceed to downstream analyses
iterative exploration
expertise from a range of different
fields are necessary to synthesize
genomic data into understanding.
exploit all of the information present in your data
http://www.cell.com/retrieve/pii/S0092867412010586
experimental validation
Introduction to ChIPseq analysis
Sean Thomas, Ph.D.
Download