Charalampos (Babis) E. Tsourakakis Joint work with Gary Miller, Richard Peng, Russell Schwartz, Stanley Shackney, Dave Tolliver, Maria A. Tsiarli Speaking Skills Machine Learning Journal Club 23 Feb. 2010 Algorithms for Denoising aCGH Data 1 Motivation & Problem Definition Related Work Our Problem Formulation and a O(n2) solution Experimental Results ~ Theoretical Ramifications: a O(n1.5) algorithm within additive ε error) Conclusion & Future Work Algorithms for Denoising aCGH Data 2 For each probe we obtain a noisy measurement of log(T/R) where T: true DNA copy number R=2 for humans (diploid organisms) Test DNA: Patient Reference DNA: Healthy subject Algorithms for Denoising aCGH Data 3 Ideal Scenario Copy Number log(T/R) 0 -Inf 1 -1 2 0 (Healthy probe) 3 0.58 4 1 T: true DNA copy number R=2 for humans (diploid organisms) In practice, for a variety of reasons (e.g., sample impurity, measurement noise) we obtain a noisy measurement log(T/R) per probe. Algorithms for Denoising aCGH Data 4 Input A vector (t~1,t~2,…,t~n), where ~ti is the measurement at the i-th proble Output A vector (t1,t2,…,tn) with discrete values corresponding to the true DNA copy number Algorithms for Denoising aCGH Data 5 log(T/R) Probe id Blue x : noisy measurements (input) Red □ : true value (output) Algorithms for Denoising aCGH Data 6 log(T/R) Nearby probes tend to have the same DNA copy number! Probe id 1)Treat Data as 1D time series 2) Fit Piecewise Constant Segments Algorithms for Denoising aCGH Data 7 Motivation & Problem Definition Related Work Our Problem Formulation and a O(n2) solution Experimental Results ~ Theoretical Ramifications: a O(n1.5) algorithm within additive ε error) Conclusion & Future Work Algorithms for Denoising aCGH Data 8 Not Exhaustive Lasso (Tibshirani et al., Huang et al.) Kalman Filters (Xing et al.) Hidden Markov Models (Fridlyand et al.) Bayesian Hidden Markov Models (Guha et al.) Wavelets (Hsu et al.) Hierarchical clustering (Tibshirani et al.) Circular Binary Segmentation (Olshen et al.) Statistical likelihood tests Loweless, i.e., Locally weighted regression Genetic Local search (Jong et al.) Algorithms for Denoising aCGH Data 9 Not Exhaustive Gaussian Mixtures fitting using EM (Hodgson et al.) Variable-bandwidth kernel methods (Muller et al.) Variable-knot splines (Stone et al.) Fused quantile regression (Wang et al.) Non parametric regression Thresholding Algorithms for Denoising aCGH Data 10 CBS: Matlab Toolbox, modification of Binary Segmentation. CGHSEG: Gaussianity, AIC&BIC, DP Both methods perform consistently better than others on real data (Lai et al., Willenbrock et al.) Algorithms for Denoising aCGH Data 11 Motivation & Problem Definition Related Work Our Problem Formulation and a O(n2) solution Experimental Results ~ Theoretical Ramifications: a O(n1.5) algorithm within additive ε error) Conclusion & Future Work Algorithms for Denoising aCGH Data 12 For the vector (p1,…,pn) we define the following recurrence equation: Breakpoint Squared error Penalty per segment Tradeoff(λ) Algorithms for Denoising aCGH Data 13 Keep the first and second order moments in an “online” way. Run time O(n2) Algorithms for Denoising aCGH Data 14 λ=0.2 Train on synthetic data generated by a realistic simulator Willenbrock et al λ=0.2 results in Precision=0.98 Recall=0.91 Algorithms for Denoising aCGH Data 15 Motivation & Problem Definition Related Work Our Problem Formulation and a O(n2) solution Experimental Results ~ Theoretical Ramifications: a O(n1.5) algorithm within additive ε error) Conclusion & Future Work Algorithms for Denoising aCGH Data 16 A)CSB B)CGHTrimmer Lai et al. (2005) Dataset Available from http://compbio.med.harvard.edu C)CGHSEG Algorithms for Denoising aCGH Data 17 Snijders et al., 15 Cell lines Golden Standard Dataset with two main characteristics a) Knowledge of ground truth after tedious biological tests b) “Easy” dataset (low noise levels) #Cell Description Color Lines CGHTrimmer performs worse that at least one competitor 0 CGHTrimmer performs equally well with both competitors 9 CGHTrimmer performs better than one competitor 5 CGHTrimmer performs better than both competitors 1 Algorithms for Denoising aCGH Data 18 A)CBS B)CGHTrimmer Breast Cancer Cell Line BT474 Chromosome 1 C)CGHSEG Algorithms for Denoising aCGH Data 19 A)CBS B)CGHTrimmer Breast Cancer Cell Line BT474 Chromosome 17 Results supported by oncology literature C)CGHSEG Algorithms for Denoising aCGH Data 20 A)CBS B)CGHTrimmer Breast Cancer Cell Line T47D Chromosome 1 C)CGHSEG Algorithms for Denoising aCGH Data 21 CGHTrimmer CGHSEG CBS Coriell 5.78sec 8.15min 47.7min Breast Cancer 22.76 23.3min 4.95hours REMARKS A) 1 to 3 orders of magnitude faster. B) Reason for speedup: different approach compared to competitors (lack of statistical assumptions, tests, likelihood functions but an intuitive formulation and a simple dynamic programming algorithm) Algorithms for Denoising aCGH Data 22 Motivation & Problem Definition Related Work Our Problem Formulation and a O(n2) solution Experimental Results ~ Theoretical Ramifications: a O(n1.5) algorithm within additive ε error) Conclusion & Future Work Algorithms for Denoising aCGH Data 23 Is O(n2) tight? Probably not… Lemma If |pi-pj| > 2 2C then in the optimal solution points i,j belong to different segments. Question: Can we use one of the existing “tricks” to speed up our dynamic program? Algorithms for Denoising aCGH Data 24 Unfortunately, existing “tricks” for dynamic programming do not work for us (e.g., Monge property) But we can find an good approximation algorithm! Algorithms for Denoising aCGH Data 25 Define a shifted i by2 a constant objective function, pm i.e., DPi=OPTi- m 1 Claim: DPi satisfies the following optimization formula where Si=p1+…+pi Algorithms for Denoising aCGH Data 26 Find the maximum and the minimum value that the shifted objected DPi can take. Claim: DPi takes values in [0,U2n] where Algorithms for Denoising aCGH Data 27 Perform binary search on ~ by guessing for each index i a DPi ( 0 ) ~ DP i U2n - DPi Algorithms for Denoising aCGH Data 28 constant Dot product of two 4D points ~ j,2Sj,Sj2+jDPj) ~ (x,i,Si,-1) and (j,DP Reporting points in Halfspaces, Matousek FOCS 1992 Algorithms for Denoising aCGH Data 29 ε/n by performing log( U2n/(ε/n) ) iterations Remember: We want DPi not DPi ( ) U2n 0 ~ DPi - DPi DPi Set i=n, our algorithm is an ε-additive approximation algorithm. Run Time: Algorithms for Denoising aCGH Data 30 Motivation & Problem Definition Related Work Our Problem Formulation and a O(n2) solution Experimental Results ~ Theoretical Ramifications: a O(n1.5) algorithm within additive ε error) Conclusion & Future Work Algorithms for Denoising aCGH Data 31 CHGtrimmer method: Simple, intuitive dynamic programming algorithm outperforms state-of-the-art competitors: Important biological biomarkers for aCGH data. Good precision-recall results. Significantly faster. New paradigm for Dynamic Programming by reducing the problem to a computational geometry halfspace query problem. Algorithms for Denoising aCGH Data 32 Structure: O(n2) unlikely to be tight. Extend our approximation to 2D recurrences (benefit many applications) Preprocess breast cancer data using Trimmer to get a discretized version, essential to perform standard tumor phylogenetics Make the DS practical. Algorithms for Denoising aCGH Data 33 CGHTrimmer: Discretizing Noisy Array CGH Data C.E.T, D. Tolliver, M. A. Tsiarli, S. Shackney, R. Schwartz Algorithms for Denoising aCGH Data G.L. Miller, R. Peng. R. Schwartz, C.E.T Code will be made available at http://www.cs.cmu.edu/~ctsourak/ Algorithms for Denoising aCGH Data 34