Differential Methylation Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2015-07 1 A basic question… 2 Factors to consider • • • • • Number of observations Magnitude of effect Technical considerations Biological variability Biological common sense 3 The problem of power… • Ideally want to cover every Cytosine (CpG) • Have to correct for the number of tests • There’s no way you’ll collect enough data to analyse each C and have p-values which survive multiple testing correction • Stats have to find a way to work round this. 4 Maximising power • Options – Analyse in windows – Pre-filter – Hierarchical or Adaptive filtering 5 Window sizes Effect size Small • • • • • Good resolution Specific biological effects High MTC burden Small observations High p-values Large • • • • • Lots of data High statistical power Low MTC burden Low p-values Effect averaging 6 Hierarchical testing • Test larger regions – Windows / Features etc. • Take significant hits and subdivide – Smaller windows – Individual CpGs – Correct only for these tests • Assemble hits together to make up DMRs 7 Hierarchical testing CGI CGI CGI Genome CGI X CGI CGI X CGI Genome CGI X X X CGI X Genome CGI X X CGI CGI CGI CGI CGI CGI CGI CGI Statistically ‘creative’ solution to not having enough data 8 Un-replicated Analysis 9 Contingency tests • Chi-square / G-test / Fisher’s exact test – Differ only at low observations – Significant changes require enough observations that any of these should give the same answer • Operates on single replicates • Technical measure of difference 10 Chi-Square results 11 Biological considerations • Minimum relevant effect size? – Balance power vs change – What makes biological sense – (what would you follow up?) • Minimum coverage worth testing – No point testing poorly covered regions 12 Effect of pre-filtering 13 Beta Binomial Models What is the probability distribution for the true methylation level? Simple model: Binomial stats to estimate confidence Can we do better? Genome-wide methylation profile. All levels are not equally likely Can inform the construction of a Custom beta binomial distribution 14 Beta-binomial model Measure 3/20 observations as methylated The binomial distribution would be defined by the mean and observations Using the whole genome prior a beta-binomial model would upweight the lower methylation levels, since these are more common. Provides increased power in comparisons 15 Globally changing samples • Change the default expectation • Find average difference for each starting point • Select points which exhibit unusual change 16 Replicated Analysis 17 Dealing with replicates • Simple approach – Merge data from replicates together – Single test, High power – Post-hoc test for consistency • Explicitly account for batch effects – Logistic regression – Measures batch effects and excludes them from final significance calculation – Beta binomial grouped tests – pool information between tests to gain power. • Work with methylation values – Normalise percentage methylation values – Use conventional statistics (t-tests etc) for comparing groups 18 Replicated Count Based Analyses • Logistic Regression – – – – Effectively a replicated contingency test Accounts for sampling bias Compares samples in multiple conditions for difference No modelling of observed distribution / dispersion • Beta binomial modeling – Extends previous model to include variance – Can relate observed variances to methylation levels – Can pool variance between related points to increase power 19 Replicated methylation level analyses • • • • Works from methylation percentages Noise is very high – individual values unreliable Expect no sudden changes Observation level should predict noise • Can generate “smoothed” data • Use normal continuous statistics on smoothed data 20 BSmooth algorithm black: 25x (Lister) pink: 4x (Lister) 21 Bsmooth t-values 22 Methylation statistics packages • swDMR (Perl/R-package) n > 3) • Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for methylKit* (R-package by A. Akalin et al.) Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method. • bsseq* (R/Bioconductor by K.D. Hansen) Fisher’s • Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms exact test. Requires biological replicates for DMR detection BiSeq* (R/Bioconductor by K. Hebestreit et al.) Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq • RnBeads* (R package by F. Mueller et al.) works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data • DMAP* (C command line tool by P. Stockwell et al.) RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA • RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith) Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs • MOABS* (C++ command line tool by D. Sun et al.) Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that combines biological and statistical significance • ComMet (Y. Saito et al., 2014) Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not require biological replicates • DSS (R/Bioconductor by Feng et al., 2014) Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated loci * interface well with 23