GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data

advertisement
GIMscan: A New Statistical Method for
Analyzing Whole-Genome Array CGH Data
RECOMB 2007 Presentation
Yanxin Shi1, Fan Guo1, Wei Wu2,
Eric P. Xing1
1
School of Computer Science, Carnegie Mellon University
2 Division of Pulmonary, Allergy, and Critical Care Medicine,
University of Pittsburgh
Outline
• Motivation and Background
• Computational framework
• Experiments and Results
• Summary
Copy number aberration and Array CGH
• DNA copy number (a.k.a. dosage state)
– Normal: 2 DNA copies
– Aberrations: deletion(0 copy), loss (1 copy), gain(3 copies),
amplification(>3 copies)
– Array CGH: a high throughput method to measure DNA
copy number
Array CGH data
Ideally,
Deletion (0 copy): LR = log(0/2) =

Loss (1 copy): LR = log(1/2) = -1
Normal (2 copies): LR = log(2/2) = 0
Gain (3 copies): LR = log(3/2) = 0.58
Amplification (>=4 copies): LR >= log(4/2) = 1
However…
• Factors influencing the LR values
 Impurity of the test sample (e.g. mixture of normal
and cancer cells)
 Variations of hybridization efficiency
 Base compositions of different probes
 Saturation of array
 Divergent sequence lengths of the clones
 Many others…
 Measurement noises, etc…
Segmental pattern and spatial drift
Spatial drift
Segmental pattern
Existing Computational Methods
• Threshold Method
• Mixture Models (e.g. Hodgson et al., 2001)
– Assume observations are iid samples from a mixture distribution.
• Regression Models (e.g., Hsu et al., 2005; Myers et al.,
2004)
– Smoothing for visual inspection to detect copy number states.
• Segmentation Models (e.g. Hupé et al., 2004)
– Directly search for breakpoints in sequential data;
• Spatial Dynamics Models (e.g. Fridlyand et al., 2004)
Spatial Dynamic Methods
• Hidden Markov Models
– Dosage states form a Markov chain of hidden
variables
– Observed LR ratios are generated from state-specific
Gaussian distributions
dosage states
LR ratios
Dosage-Specific Kalman Filters
• Introduce hidden trajectory to model statespecific LR distributions (no longer fixed mean)
Linear Dynamics for dosage state m
Switching Kalman Filters
Trajectory 1
Trajectory M
Dosage state chain
• A SKF generates observations from one of the trajectories.
Posterior Inference
• Dosage annotation is equivalent to the estimate
of the posterior
.
• Recovering the hidden trajectory:
.
Variational Inference
• Posterior Inference is intractable.
• Variational inference: decouple the hidden chains.
• Decoupled chains have tractable distributions.
Variational Inference
• Use this tractable distribution to approximate the true
distribution by minimizing KL divergence.
• Fixed point equations to update the variational parameters.
Parameter Sharing
• The CGH dataset contains whole-genome
measurements for multiple individuals.
• Chromosome-specific parameters shared across
individuals:
trajectory parameters:
• Individual-specific parameters shared across
chromosomes:
All other parameters
e.g. output noise variance
Experiment Design
• Simulation Analysis:
– Data generated from SKFs.
– Compare with: threshold, HMM.
• aCGH profiles of 125 colorectal tumors
(Nakao et al. 2004)
– Case studies of 3 representative chromosomes.
– Populational analysis over 125 genomes
Simulation Analysis (1)
Performance of dosage state prediction
(b – noise in hidden dynamics, r – noise in observation, M=5)
Simulation Analysis (2)
Prediction by HMM
Synthetic Data
Prediction by SKF
Experiment Design
• Simulation Analysis:
– Data generated from SKFs.
– Compare with: threshold, HMM.
• aCGH profiles of 125 colorectal tumors
(Nakao et al. 2004)
– Case studies of 3 representative chromosomes.
– Populational analysis over 125 genomes
Real aCGH Profile
Spatial Patterns Difficult for Conventional Methods
(1) Flat-Arch Pattern
Real aCGH Profile
Spatial Patterns Difficult for Conventional Methods
(2) Step Pattern
Real aCGH Profile
Spatial Patterns Difficult for Conventional Methods
(3) Spikes Pattern
Populational Analysis
Frequency of dosage state alteration of 125 individuals
red bar – copy number gain or amplification
blue bar – copy number loss or deletion
solid vertical lines – boundary between chromosomes
Populational Analysis
Frequency of dosage state alteration on 2 chromosomes
top, red square – copy number gain
top, blue circle – copy number loss
bottom, red square – copy number amplification
bottom, blue circle – copy number deletion
Summary
• SKF for whole-genome analysis of aCGH data.
• SKF can capture variations in the hybridization
efficiency.
• Parameter sharing scheme for data integration.
• Possible Extensions:
– Gene expression concordance analysis
– Incorporate information about sequence length and
distance between clones
Thank you!
Populational Analysis
Detailed spectrum of GIM rates over 125 Colorectal cancer patients in
4 hotspots region with annotation of cancer related gene
• M is selected by AIC.
• We also have done experiments to
compare SKF with segmentation methods
(result now shown here).
Switching Kalman Filters
• A SKF generates observations from one of the trajectories.
•
is the switching process as in an HMM.
•
are observed LR ratios.
Download