genetics_bootcamp_tolstorukov

advertisement
Analysis of ChIP-chip experiment
Pipeline:
 Raw data processing to obtain enrichment profile
 Visualization of the data
 Finding enriched regions and/or peaks
 Feature analysis:
–Calculating average profiles for different regions of interest
(gene, intergenic regions, exons, etc.)
–Analysis of the profiles for different genome regions and groups
of genes (heterochromatin vs. euchromatin, silent vs. expressed
genes, etc.)
ChIP-chip data shown below were obtained in the frame of modEncode project
(PIs G. Karpen, S. Elgin, V. Pirrotta, M.Kuroda, P.Park)
data analysis pipeline developed and maintained by Peter Kharchenko, the Park Lab
Raw data processing
Fluorescence intensity
for each probe for
ChIP and Control
(if control available)
Calculation of smoothed log2
intensity ratio profile or
P-value profile
Probe position on array
and in genome
(provided by array
manufacture)
Correction for probe copy
number and sequence bias
based on a slide by Peter Park
Raw data processing
Fluorescence intensity
for each probe for
ChIP and Control
(if control available)
Calculation of smoothed log2
intensity ratio profile or
P-value profile
Probe position on array
and in genome
(provided by array
manufacture)
Correction for probe copy
number and sequence bias
Data
look
great!
based on a slide by Peter Park
Software packages for tiling array
analysis
• Bioconductor packages for R:
– Oligo, Ringo, ACME…
(http://www.bioconductor.org)
• MAT, MA2C (the Liu Lab at Dana-Farber Cancer Inst)
http://liulab.dfci.harvard.edu/
• TAS (Affymetrix)
http://www.affymetrix.com/partners_programs/programs/
developer/tools/affytools.affx
Visualization
GBrowse
Perl based application; widely used for genome browsing
and visualization of various features and annotations; a very popular
browser in Genetic Model Organism Database project
UCSC Genome Browser
Integrative web-based application supported at
UCSC Genome Bioinformatics Site. The human Encode data are
available through this browser.
IGB (Integrated Genome Browser)
An application available through Affymetrix site for genome
visualization and exploration of data annotations from various sources
IGB installation
• Go to
http://www.affymetrix.com/partners_programs/programs/
developer/tools/download_igb.affx
press “Launch IGB (768 MB)”
or go to https://compbio.med.harvard.edu/wiki/x/g4RG
and follow the instructions on IGB installation
• NOTE: If Java is not installed on your system follow the
link: “Free Java Download” on the IGB page or go to
http://www.java.com and press “Free Java Download”
•
Download enrichment profiles to the browser from
https://compbio.med.harvard.edu/wiki/x/g4RG
•
Controls of IGB browser allow changing chromosome in view, data scale, graph color, etc.
Questions to explore
•
Which modifications correspond to activation and which to repression marks?
•
Compare expression and modification profiles in two cell lines. Do changes in expression
correlate with changes in modification levels?
–
Examples to examine:
K4me3/H3K36me3/mRNA profiles for S2 and BG3 cells at
chr2L: 21,808,548 - 21,857,626
chr3L: 22,243,063 - 22,460,589
chrX: 2,821,457 - 2,936,998
chrX: 2,965,624 - 3,103,080
chr3R: 11,356,133 - 11,384,217
Finding enriched regions/peaks
• Using Bayesian networks / Hidden Markov Models
e.g. BAC package
(bioconductor, http://www.bioconductor.org)
• Thresholding
e.g. TAS (Affymetrix, http://www.affymetrix.com)
Ringo (bioconductor, http://www.bioconductor.org)
Thresholding
How many
clusters?
Threshold – minimal allowed enrichment value
Min run – minimal width of enriched region to be identified as a ‘cluster’
Max gap – maximal allowed gap inside of a cluster
These parameters are manually adjustable in IGB!
• Calculating the threshold
– Generate randomizations
• More random implementations allow higher level of statistical
significance
– Find a threshold that corresponds to a given
expected value (EV) of false discovery rate (FDR) 
optimization problem
threshold
few clusters
many clusters
number of clusters
few clusters
Threshold value
that corresponds
to the required expected
value
EV
threshold value
Feature analysis
modEncode data
Average Gene Profiles
Expression quintiles
Meta-gene
Gene Size
5’
3’
Based on a slide by Peter Park
Calculating average enrichment profile for ‘meta-gene’
Challenges:
Genes have different sizes
Genes are oriented differently in genome
Array probes are distributed non-uniformly along genes
For each
gene in a set
Determine a set of probes that
belong to i-th gene
Calculate a relative gene coordinate for each probe in the set
Pchr  Gene.Starti
Prel 
Gene.Endi  Gene.Starti
yes
Direct gene
orientation?
no
Prel  1 
Group probes in the bins that are regularly
distributed along the gene to obtain
relative intensity profile for each gene
(or extrapolate the intensity values for
regularly located points along the gene)
Calculate position-wise average for the
relative intensity profiles over all genes
Pchr  Gene.Endi
Gene.Starti  Gene.Endi
modEncode data
average enrichments
H3K36me3 profile in S2 cells
all genes
expressed
silent
meta-gene
5’
3’
position relative to TSS
average enrichments
H3K4me3 profile in S2 cells
all genes
expressed
silent
meta-gene
5’
3’
position relative to TSS
Specific issues of ChIP-Seq data analysis
• Alignment of the sequenced tags to reference sequence
(genome):
– Eland, blat, Maq, SOAP
• Correction for the biases due to DNA fragmentation and
sequencing
• Accounting for specific patterns of sequenced tag
distribution at protein binding sites
MNase digestion and sonication have different
sequence signature at DNA fragmentation sites
Different
‘background’
GC-content
MNase
digestion
Sonication
GC-profiles for H3K4me3 enriched human nucleosomes:
MNase digested dataset from Barski et al, Cell, 2007
Sonicated dataset from Roberson et al, Genome Res, 2008
GC-content of Solexa tags is biased towards higher
values as compared to randomly selected genomic sequences
ChIP
control
NRSF Solexa data, Johnson et al, Science 2007

Kharchenko et al., 2008
Search for two peaks on positive and negative strands separated by
characteristic length 
Shift tag density profiles on positive and negative strands by /2 to
match peaks on both strands
based on a slide by Peter Park
Sequenced tag distribution around
well-positioned H3K4me3 nucleosomes
TIMM17A
Solexa sequencing data, Barski et al, Cell, 2007
Packages for analysis of ChIP-Seq data
• SPP – the Park Lab, Harvard Med School
(Kharchenko et al, Nat Biotechnol, 2008)
• QuEST – the Sidow Lab, Stanford University
(Valouev et al, Nat Methods, 2008)
• SISSRs – the Zhao Lab, NHLBI, NIH
(Jothi et al, NAR, 2008)
• CisGenome – the Wong Lab, Stanford University
(Ji et al, Nat Biotechnol, 2008)
Download