Differential Methylation lecture

advertisement
Differential Methylation
Analysis
Simon Andrews
simon.andrews@babraham.ac.uk
@simon_andrews
v2015-07
1
A basic question…
2
Factors to consider
•
•
•
•
•
Number of observations
Magnitude of effect
Technical considerations
Biological variability
Biological common sense
3
The problem of power…
• Ideally want to cover every Cytosine (CpG)
• Have to correct for the number of tests
• There’s no way you’ll collect enough data to
analyse each C and have p-values which
survive multiple testing correction
• Stats have to find a way to work round this.
4
Maximising power
• Options
– Analyse in windows
– Pre-filter
– Hierarchical or Adaptive filtering
5
Window sizes
Effect size
Small
•
•
•
•
•
Good resolution
Specific biological effects
High MTC burden
Small observations
High p-values
Large
•
•
•
•
•
Lots of data
High statistical power
Low MTC burden
Low p-values
Effect averaging
6
Hierarchical testing
• Test larger regions
– Windows / Features etc.
• Take significant hits and subdivide
– Smaller windows
– Individual CpGs
– Correct only for these tests
• Assemble hits together to make up DMRs
7
Hierarchical testing
CGI
CGI
CGI
Genome
CGI
X
CGI
CGI
X
CGI
Genome
CGI
X X
X
CGI
X
Genome
CGI
X X
CGI
CGI
CGI
CGI
CGI
CGI
CGI
CGI
Statistically ‘creative’ solution to not having enough data
8
Un-replicated Analysis
9
Contingency tests
• Chi-square / G-test / Fisher’s exact test
– Differ only at low observations
– Significant changes require enough observations
that any of these should give the same answer
• Operates on single replicates
• Technical measure of difference
10
Chi-Square results
11
Biological considerations
• Minimum relevant effect size?
– Balance power vs change
– What makes biological sense
– (what would you follow up?)
• Minimum coverage worth testing
– No point testing poorly covered regions
12
Effect of pre-filtering
13
Beta Binomial Models
What is the probability distribution for the true methylation level?
Simple model: Binomial stats to estimate confidence
Can we do better?
Genome-wide methylation profile.
All levels are not equally likely
Can inform the construction of a
Custom beta binomial distribution
14
Beta-binomial model
Measure 3/20 observations as methylated
The binomial distribution would be defined by the mean and observations
Using the whole genome prior a beta-binomial model would upweight the
lower methylation levels, since these are more common.
Provides increased power in comparisons
15
Globally changing samples
• Change the default
expectation
• Find average difference
for each starting point
• Select points which
exhibit unusual change
16
Replicated Analysis
17
Dealing with replicates
• Simple approach
– Merge data from replicates together
– Single test, High power
– Post-hoc test for consistency
• Explicitly account for batch effects
– Logistic regression
– Measures batch effects and excludes them from final significance calculation
– Beta binomial grouped tests – pool information between tests to gain power.
• Work with methylation values
– Normalise percentage methylation values
– Use conventional statistics (t-tests etc) for comparing groups
18
Replicated Count Based Analyses
• Logistic Regression
–
–
–
–
Effectively a replicated contingency test
Accounts for sampling bias
Compares samples in multiple conditions for difference
No modelling of observed distribution / dispersion
• Beta binomial modeling
– Extends previous model to include variance
– Can relate observed variances to methylation levels
– Can pool variance between related points to increase
power
19
Replicated methylation level analyses
•
•
•
•
Works from methylation percentages
Noise is very high – individual values unreliable
Expect no sudden changes
Observation level should predict noise
• Can generate “smoothed” data
• Use normal continuous statistics on smoothed
data
20
BSmooth algorithm
black: 25x (Lister)
pink: 4x (Lister)
21
Bsmooth t-values
22
Methylation statistics packages
•
swDMR (Perl/R-package)
n > 3)
•
Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for
methylKit* (R-package by A. Akalin et al.)
Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method.
•
bsseq* (R/Bioconductor by K.D. Hansen)
Fisher’s
•
Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms
exact test. Requires biological replicates for DMR detection
BiSeq* (R/Bioconductor by K. Hebestreit et al.)
Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq
•
RnBeads* (R package by F. Mueller et al.)
works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data
•
DMAP* (C command line tool by P. Stockwell et al.)
RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA
•
RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith)
Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs
•
MOABS* (C++ command line tool by D. Sun et al.)
Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single
metric that combines biological and statistical significance
•
ComMet (Y. Saito et al., 2014)
Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria.
Does not require biological replicates
•
DSS (R/Bioconductor by Feng et al., 2014)
Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially
methylated loci
* interface well with
23
Download