Genetic algorithms
Hidden Markov Models
Circular Binary Segmentation wavelets
Maximum Likelihood
Demystifying the Black Box
• Intensity to Copy Number
• LogR and Allelic Data
• Detection Algorithms
– CNV
– LCSH/LOH
– Allelic Imbalance
– Breakpoint Mapping
• Calling Accuracy / Resolution
The ‘Black Box’
1. Image acquisition
2. Image analysis
3. Signal normalization/ smoothing
– has a significant effect on final data quality
4. Duplicate treatment
LogR (signal ratio) and BAF values are calculated for each probe
5. Apply segmentation and/or Calling Algorithms
6. Visual Inspection/ Analysis
LogR - log
2
(I
A
+I
B
)
‘normal’ (2/2, 0), het loss (1/2, -1), het dup (3/2, 0.58), etc
BAF - I
B
/ (I
A
+I
B
)
AA, BB, AB, A, B, AAB, ABB, etc
So Why Do We Need Calling Algorithms?
Various sources of variation that create ‘noise’
• hybridization differences/efficiencies, intensitydependent effects, small local effects (e.g. GC content) etc.
• Errors during scanning, human error during set-up
Pre-processing and Calling Algorithms
• Increase detection of aberrations, precision and confidence
• Removes some of the subjectivity, simplifies analysis
(faster, genomic coordinates)
Detection Algorithms
Algorithm
PennCNV cnvpartition
Computational Approach
HMM (combined LogR and BAF)
Segmentation of CN estimates, smoothing (sliding window)
Circular binary segmentation
Platform /
Software
Illumina /
Affymetrix SNP array
Illumina SNP array
Array-CGH CBS
BirdSuite HMM Affymetrix
Array-CGH GLAD Segmentation adaptive weights smoothing
FACADE Segmentation edge detection and non-parametric statistics
Nexus Rank and
SNPRank
Segmentation, including segment significance calculation
Array-CGH multi
Smoothing
Detection Algorithms
Eilers et al . Bioinformatics, 2005
Detection Algorithms
Calling approaches model data at the probe level (gain, loss, neutral)
LogR
Calling using 3-state HMM
Gain
Normal
Loss
Extend probe states to detect contiguous aberrations (i.e. breakpoint calling)
Detection Algorithms
Segmentation methods seek to identify contiguous regions of common means (LogR, SD, SE, t-statistic)
2 copy (neutral)
1 copy (del)
Once the means are calculated and breakpoints defined, a separate procedure (model) is used to assign copy number states (i.e. calling)
Empirical Comparisons
• Sensitivity-specificity tradeoff
• Effect of algorithm on number of CNVs detected
• Effect of algorithm on accuracy of assigning breakpoints and copy number/ allelic states
Lai et al. Bioinformatics, 2005
Lai et al. Bioinformatics, 2005
Factors Influencing Call Performance
• Array specification (probe target, count, distribution)
• Hybridisation characteristics
– Link calling performance to sample QC
• Segmentation/Calling Algorithm
– Modelling of Data (too restrictive) e.g. HMM (set number of expected states)
• Suboptimal parameter settings e.g. SD and LogR thresholds, ‘confidence’, CNV length, probe number
Zhang et al. PLOS one, 2011
Dellinger et al. Nucleic Acids Research, 2010
confidence/ significance estimates
Model complex genomes