Introduction Related Work Preliminaries Approach Experiments Conclusion Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression Richard E. Edwards1 , Hao Zhang1 , Lynne E. Parker1 , Joshua R. New2 1 Distributed Intelligence Lab Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville TN, USA 2 Whole Building and Community Integration Group Oak Ridge National Lab Oak Ridge TN, United States December 7, 2013 Funded by the United States Department of Energy Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 1 Introduction Related Work Preliminaries Approach Experiments Conclusion Outline Introduction Related Work Preliminaries Approach Experiments Conclusion Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 2 Introduction Related Work Preliminaries Approach Experiments Conclusion Applying Kernel Methods to Large Datasets I Direct Kernel application scales poorly I I I I Scaling improvements I I I I Requires O(n2 ) memory Model solve time increases Model selection time increases Faster model solvers Problem decompositions Low-rank Kernel approximations Most scaling improvements apply to standard SVMs Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 3 Introduction Related Work Preliminaries Approach Experiments Conclusion Applying Kernel Methods to Large Datasets I Least Squares Support Vector Machine (LS-SVM) I I I I Naive cross-validation model calibration complexity: O(ln3 ) Best exact leave-one-out (LOO) cross-validation complexity: O(n2 ) Best approximate cross-validation complexity: O(m2 n) We can do better! I I Approximate cross-validation complexity: O(n log n) Applies to LOO as well Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 4 Introduction Related Work Preliminaries Approach Experiments Conclusion Outline Introduction Related Work Preliminaries Approach Experiments Conclusion Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 5 Introduction Related Work Preliminaries Approach Experiments Conclusion Previous LS-SVM Model Selection I T. Pahikkala et al (2006) and Cawley et al. (2004) obtained O(n2 ) LOO cross-validation I I An et al. (2007) obtained O(m2 n) l-fold cross-validation I I I I utilizes matrix inverse properties uses low-rank kernel approximation removes redundancy from the validation process introduces a new cross-validation algorithm L. Ding et al. (2011) obtained O(ln log n) l-fold cross-validation I O(n2 log n) LOO cross-validation Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 6 Introduction Related Work Preliminaries Approach Experiments Conclusion Outline Introduction Related Work Preliminaries Approach Experiments Conclusion Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 7 Introduction Related Work Preliminaries Approach Experiments Conclusion Multi-Level Matrices I I Matrices indexed by factors Example 3-level matrix with factors: 2x2, 4x4, 2x2 I I |M| = (2 × 4 × 2) × (2 × 4 × 2) Level 1: A00 A10 A01 A11 B00 B10 = B20 B30 B01 B11 B21 B31 B02 B12 B22 B32 6 2 M= I Level 2: A00 I B03 B13 B23 B33 Level 3: B00 = 5 1 Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 8 Introduction Related Work Preliminaries Approach Experiments Conclusion Circulant Matrices I A special Toeplitz matrix I Its inverse is computed in O(n log n) via Fast Fourier Transform I Example: 1 4 3 2 I General definition: 2 1 4 3 3 2 1 4 4 3 2 1 c0 cn .. . c1 c0 ... ... .. . cn cn−1 .. . c1 c2 ... c0 Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 9 Introduction Related Work Preliminaries Approach Experiments Conclusion P-Level Circulant Matrices I Combines Circulant Matrices and Multi-Level Circulant Matrices I I Each level is a circulant matrix All factors are now one diminsional I Example 3-Level with factors 2, 4, 2: A0 A1 M= A1 A0 I Level 2: B0 B3 A0 = B2 B1 I B1 B0 B3 B3 B2 B1 B0 B2 B3 B2 B1 B0 Level 3: B0 = 5 6 6 5 Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 10 Introduction Related Work Preliminaries Approach Experiments Conclusion Outline Introduction Related Work Preliminaries Approach Experiments Conclusion Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 11 Introduction Related Work Preliminaries Approach Experiments Conclusion Overview I We use same approximation method has L. Ding et al. (2011) I We remove inefficienies from the cross-validation process I Result: n log n LOO cross-validation I L. Ding et al.’s LOO cross-validation: n2 log n Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 12 Introduction Related Work Preliminaries Approach Experiments Conclusion Kernel Approximation via P-Level Circulant Matrices I Song et al. (2010) introduced P-Level Circulant RBF Kernel approximation I I I I approximation converges as matrix level factors approach infinity result: O(n + n2p ) complexity I I I allows n log n model solve time allows fast model selection However 2 to 3 factors work well L. Ding et al. (2011) and our work One caveat: this approximation method only applies to RBF Kernels Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 13 Introduction Related Work Preliminaries Approach Experiments Conclusion Kernel Approximation via P-Level Circulant Matrices Algorithm 1 Kernel Approximation with P-level Circulant Matrix Input: M (Kernel’s size), F = {n0 , n1 , . . . , np−1 }, k (Kernel function) 1: N ←{All multi-level indices defined by F} 2: T ← zeros(M), U ← zeros(M) 3: Hn ← {x0 , x1 , . . . , xp−1 } ∈ Rp s.t. ∀xi ∈ Hn , xi > 0 4: for all j ∈ N do 5: Tj ← k(||jHn ||2 ) 6: end for 7: for all j ∈ N do 8: Dj ← P Dj,0 × Dj,1 × · · · × Dj,p−1 9: Uj ← l∈Dj Tl 10: end for 11: K̃ ← U Output: K̃ Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 14 Introduction Related Work Preliminaries Approach Experiments Conclusion Efficient Cross-Validation Theorem Let y (k) = sign[gk (x)] denote the classifier formulated by leaving the kth −1 group out and let βk,i = yk,i − gk (xk,i ). Then β(k) = Ckk α(k) . I proven by An et al. (2007) I Take aways: I I Allows computing a single Kernel matrix inverse for all folds Perform smaller inverses to compute the hold out result Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 15 Introduction Related Work Preliminaries Approach Experiments Conclusion Efficient Cross-Validation Algorithm 2 Efficient Cross-Validation Input: K (Kernel matrix), l (Number folds), y (response) −1 1: Kγ−1 ← inv (K + γ1 I ), d ← 1T n Kγ 1n −1 2: C ← Kγ−1 + d1 Kγ−1 1n 1T n Kγ 1 −1 −1 T −1 3: α ← Kγ y + d Kγ 1n 1n Kγ y 4: nk ← size(y )/l, y (k) ← zeros(l, nk ) 5: for k ← 1, k ≤ l do 6: Solve Ckk β(k) = α(k) 7: y (k) ← sign[y(k) − β(k) ] 8: k ←k +1 9: end for Pl Pnk (k,i) 10: error ← 12 | k=1 i=1 |yi − y Output: error Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 16 Introduction Related Work Preliminaries Approach Experiments Conclusion Approximate l-fold Cross-Validation Theorem If K is a p-level circulant matrix with factorization n = n0 n1 . . . np−1 and l = n0 n1 . . . ns s.t. s ≤ p − 1, then the computational complexity for An et al.’s Cross-Validation Algorithm is O(n log n) I Take aways: I I This combination produces an O(n log n) runtime Works for any l-fold, provided the factorizations allign Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 17 Introduction Related Work Preliminaries Approach Experiments Conclusion Extension to Kernel Ridge Regression I An et al.’s changes to their algorithm: I I I Change C’s value to Kγ−1 Change α’s value to Kγ−1 y Our theorm still holds under these settings Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 18 Introduction Related Work Preliminaries Approach Experiments Conclusion Outline Introduction Related Work Preliminaries Approach Experiments Conclusion Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 19 Introduction Related Work Preliminaries Approach Experiments Conclusion Experimental Setup I Scaling I I I Approximation Quality I I measured with randomly generated data dataset sizes range from 213 to 220 samples measured on benchmark datasets Hyperparameter Selection Quality I Test exact models on real-world datasets Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 20 Introduction Related Work Preliminaries Approach Experiments Conclusion Single CPU Scaling Test 213 4.43s 1.3s 0.54s 214 35.25s 2.6s 1.06s 215 281.11s 5.32s 2.14s 216 – 10.88s 4.3s # Examples E-LOO A-LOO-LSSVM A-LOO-KRR 218 – 47.41s 17.28s 219 – 101.36s 35.39s 220 – 235.83s 68.22s # Examples E-LOO A-LOO-LSSVM A-LOO-KRR Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression 217 – 22.45s 8.55s The University of Tennessee 21 Introduction Related Work Preliminaries Approach Experiments Conclusion Runtime Scaling Comparison Runtime Scaling 3 10 loglog runtime(s) 2 10 1 10 0 E−LOO−LSSVM A−LOO−LSSVM A−LOO−KRR 10 −1 10 4 10 5 6 10 10 7 10 dataset size I A-LOO scales the same for LSSVM and KRR (same slopes) Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 22 Introduction Related Work Preliminaries Approach Experiments Conclusion Runtime Scaling Comparison Runtime Scaling 4 loglog runtime(s) 10 An, et al. A−LOO−LSSVM A−LOO−KRR 2 10 0 10 −2 10 2 10 4 6 10 10 8 10 dataset size I I We scale no worse than An et al’s low-rank approximation We are assumption free, An et al. requires m << n Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 23 Introduction Related Work Preliminaries Approach Experiments Conclusion Benchmark Dataset Performance Data set 1) Titanic 2) B. Cancer 3) Diabetes 4) F. Solar 5) Banana 6) Image 7) Twonorm 8) German 9) Waveform 10) Thyroid #Train 150 200 468 666 400 1300 400 700 400 140 #Test 2051 77 300 400 4900 1010 7000 300 4600 75 A-Error(L. Ding, et al.) 22.897±1.427 27.831±5.569 26.386±4.501 36.440±2.752 11.283±0.992 4.391±0.631 2.791±0.566 25.080±2.375 Not Reported 4.773±2.291 A-Error Hn ∈ (1, 2) 23.82±1.44 29.87±5.59 25.67±1.13 35.65±2.78 14.10±1.74 17.64±1.52 15.64±25.71 29.93±1.61 19.85±3.87 29.33±4.07 A-Error Hn ∈ (10, 11) 22.80±0.68 26.75±5.92 25.27±2.07 36.65±2.47 18.98±1.76 6.89±0.73 6.85±8.86 27.40±1.79 17.57±1.93 17.33±3.89 I The real values selected effect approximation quality I Hyperparameter selection is now Rp+2 , rather than R2 Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression E-Error 22.92±0.43 25.97±4.40 23.00±1.27 33.75±1.44 10.97±0.57 2.47±0.53 2.35±0.07 21.87±1.77 9.77±0.31 4.17±3.23 The University of Tennessee 24 Introduction Related Work Preliminaries Approach Experiments Conclusion Real World Dataset Data set House 1 Sensor A Sensor B Sensor C Sensor D S1 S2 I CoV(%) 19.6±1.69 1.3±0.05 17.2±4.89 12.0±2.31 1.4±0.09 13.1±0.00 3.1±0.00 MAPE(%) 15.3±0.47 1.0±0.05 10.8±0.25 7.8±0.68 0.9±0.03 10.0±0.00 4.7±0.00 CoV(%) 20.1±0.81 – – – – 13.7±0.00 6.4±0.00 MAPE(%) 16.1±0.85 – – – – 11.2±0.00 4.5±0.00 Selected hyperparameters work well with exact models Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 25 Introduction Related Work Preliminaries Approach Experiments Conclusion Outline Introduction Related Work Preliminaries Approach Experiments Conclusion Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 26 Introduction Related Work Preliminaries Approach Experiments Conclusion Conclusion I The approach provides an O(n log n) l-fold cross-validation method I The approach scales well I The approach selects hyperparameters that perform well with the exact model I Hyperparameter selection is now Rp+2 , rather than R2 Richard E. Edwards, Hao Zhang, Lynne E. Parker, Joshua R. New Approximate l-fold Cross-Validation with Least Squares SVM and Kernel Ridge Regression The University of Tennessee 27