A review of quality control and preprocessing measures for the Illumina 450K BeadChip Randa Stringer Supervisor: Dr. Guillaume Paré Steps for Review Sample Quality Probe Quality Background correction Normalization Cellular composition Batch effects Array Design > 485,000 CpG sites Covers 99% of RefSeq genes Average of 17 sites per gene Distributed across promoter, 5’ UTR, first exon, gene body, and 3’ UTR Covers 96% of known CpG islands Sample Quality Reported vs. predicted sex Use DNA methylation to predict sex Minfi – getSex function yMed - xMed is less than cutoff we predict a female, otherwise male. Sample detection cut-offs Threshold of failed probes in a sample (usually < 0.05 or 0.01) Probe Quality Probe detection cut-offs Bead count ( > 3 ) Remove probes on sex chromosomes Probes containing SNPs Cross-reactive probes MAF > 1% Background Correction Background subtraction method Available in GenomeStudio Background calculated from negative control probes is subtracted from all probes (separately for each channel [rd vs grn]) (GenomeStudio Methylation Module v1.8 User Guide) Normalization Goal: reduce non-biological variation Equalizes probe intensity and signal distributions across arrays and between colour channels New challenges with DNA methylation vs. gene expression techniques Systematic/technical variation Novel probe design Normalization for Illumina 450K Problem: 2-type probe design Infinium I Probe 2 different probes per CpG Infinium II Probe Single base extension at CpG Maksimovic et al. Genome Biology 2012 CpG Content Infinium II ≤ 3 Infinium I ≥ 3 Compressed β value distribution in InfII Solution: scale Infinium II probes to InfI probes Maksimovic et al. Genome Biology 2012 Normalization to Internal Controls Illumina GenomeStudio Probe intensity multiplied by constant normalization factor (NF) NF calculated as average of controls in a reference sample (GenomeStudio Methylation Module v1.8 User Guide) Doesn’t account for the InfI vs InfII probe issues Peak-Based Correction (PBC) Raw Uses peak summits to correct β values Convert β to M values Determine peaks for I and II probes with kernel density estimation Rescale M values by peak summits Rescale these corrected M values to the I range and converted back to β values PBC Dedeurwaerder et al. Epigenomics 2011 Subset Quantile Normalization (SQN) Modeled after SQN methods in expression Probes separated and poor detection removed ‘Anchors’ (RQs) chosen from InfI probes Target quantiles are estimated for InfI and II InfI and II normalized to their RQs Dataset is rebuilt Touleimat and Tost, Epigenomics, 2012 SQN Cont’d No normalization RQs by ‘relation to CpG’ Unique RQs RQs by ‘relation to gene sequence’ Maksimovic et al. Genome Biology 2012 Subset Within-Array Normalization (SWAN) Allows InfI and InfII probes to be normalized together Subset of N InfI and InfII probes chosen based on underlying CpG content Separate methylated and unmethylated channels Mean intensity for each of 3N calculated InfI and II probes adjusted separately by linear interpolation Maksimovic et al. Genome Biology 2012 Beta-MIxture Quantile normalization (BMIQ) Novel normalization method Fit 3-state (U/H/M) to InfI and InfII probes separately Transform InfI U and M probes using the inverse of the cumulative beta distribution estimated from the respective InfII probes For H probes perform dilation transformation to fit the data into the gap Teschendorff et al. Bioinformatics 2012 START Data Raw Data SWAN Normalized Cellular Composition Adapted from Correa-Rocha et al. Pediatric Research 2012 Estimations by Houseman Houseman et al. BMC Bioinformatics 2012 Batch Effects Can be assessed using principal component analysis or variations on singular variable decomposition (ex. sva) ComBat method uses a parametric or nonparametric empirical Bayes framework to adjust for a known source of batch effects Singular Variable Decomposition (START) Questions & Discussion