SPARSE TENSORS DECOMPOSITION SOFTWARE Papa S. Diaw, Master’s Candidate Dr. Michael W. Berry, Major Professor 5/29/2016 1 Introduction • Large data sets • Nonnegative Matrix Factorization (NMF) Insights on the hidden relationships Arrange multi-way data into a matrix • • • • 5/29/2016 Computation memory and higher CPU Linear relationships in the matrix representation Failure to capture important structure information Slower or less accurate calculations 2 Introduction (cont'd) • Nonnegative Tensor Factorizations (NTF) Natural way for high dimensionality Original multi-way structure of the data Image processing, text mining 5/29/2016 3 Tensor Toolbox For MATLAB • Sandia National Laboratories • Licenses • Proprietary Software 5/29/2016 4 Motivation of the PILOT • Python Software for NTF • Alternative to Tensor Toolbox for MATLAB • Incorporation into FutureLens • Exposure to NTF • Interest in the open source community 5/29/2016 5 Tensors • • • • • • Multi-way array Order/Mode/Ways High-order Fiber Slice Unfolding Matricization or flattening Reordering the elements of an N-th order tensor into a matrix. Not unique 5/29/2016 6 Tensors (cont’d) • Kronecker Product a11B a21B AB aI 1B a12B a22B aI 2 B a1J B a2J B aIJ B • Khatri-Rao product A⊙B=[a1⊗b1 a2⊗b2… aJ⊗bJ] 5/29/2016 7 Tensor Factorization • Hitchcock in 1927 and later developed by Cattell in 1944 and Tucker in 1966 • Rewrite a given tensor as a finite sum of lowerrank tensors. • Tucker and PARAFAC • Rank Approximation is a problem 5/29/2016 8 PARAFAC • Parallel Factor Analysis • Canonical Decomposition (CANDE-COMPE) • Harsman,Carroll and Chang, 1970 5/29/2016 9 PARAFAC (cont’d) • Given a three-way tensor X and an approximation rank R, we define the factor matrices as the combination of the vectors from the rank-one components. R X A B C ar br c r r1 5/29/2016 10 PARAFAC (cont’d) 5/29/2016 11 PARAFAC (cont’d) • Alternating Least Square (ALS) • We cycle “over all the factor matrices and performs a least-square update for one factor matrix while holding all the others constant.”[7] • NTF can be considered an extension of the PARAFAC model with the constraint of nonnegativity 5/29/2016 12 Python • Object-oriented, Interpreted • Runs on all systems • Flat learning curve • Supports object methods (everything is an object in Python) 5/29/2016 13 Python (cont’d) • Recent interest in the scientific community • Several scientific computing packages Numpy Scipy • Python is extensible 5/29/2016 14 Data Structures • Dictionary Store the tensor data Mutable type of container that can store any number of Python objects Pairs of keys and their corresponding values • Suitable for sparseness of our tensors • VAST 2007 contest data 1,385,205,184 elements, with 1,184,139 nz • Stores the nonzero elements and keeps track of the zeros by using the default value of the dictionary 5/29/2016 15 Data Structures (cont’d) • Numpy Arrays Fundamental package for scientific computing in Python Khatri-Rao products or tensors multiplications Speed 5/29/2016 16 Modules 5/29/2016 17 Modules (cont’d) • SPTENSOR Most important module Class (subscripts of nz, values) • Flexibility (Numpy Arrays, Numpy Matrix, Python Lists) Dictionary Keeps a few instances variables • Size • Number of dimensions • Frobenius norm (Euclidean Norm) 5/29/2016 18 Modules (cont’d) • PARAFAC coordinates the NTF Implementation of ALS Convergence or the maximum number of iterations Factor matrices are turned into a Kruskal Tensor 5/29/2016 19 Modules (cont’d) 5/29/2016 20 Modules (cont’d) 5/29/2016 21 Modules (cont’d) • INNERPROD Inner product between SPTENSOR and KTENSOR PARAFAC to compute the residual norm Kronecker product for matrices • TTV Product sparse tensor with a (column) vector Returns a tensor • Workhorse of our software package • Most computation • It is called by the MTTKRP and INNERPROD modules 5/29/2016 22 Modules (cont’d) • MTTKRP Khatri-Rao product off all factor matrices except the one being updated Matrix multiplication of the matricized tensor with KR product obtained above • Ktensor Kruskal tensor Object returned after the factorization is done and the factor matrices are normalized. Class • Instance variables such as the Norm. • Norm of ktensor plays a big part in determining the residual norm in the PARAFAC module. 5/29/2016 23 Performance • Python Profiler Run time performance Tool for detecting bottlenecks Code optimization • negligible improvement • efficiency loss in some modules 5/29/2016 24 Performance (cnt’d) • Lists and Recursions ncalls 2803 1400 9635 814018498 400 1575/700 2129 5/29/2016 tottime 3605.732 1780.689 1538.597 651.952 101.308 81.705 39.287 percall 1.286 1.272 0.160 0.000 0.072 0.052 0.018 cumtime 3605.732 2439.986 1538.597 651.952 140.606 7827.373 39.287 Percall 1.286 1.743 0.160 0.000 0.100 11.182 0.018 function tolist return_unique array get of 'dict' setup_size ttv max 25 Performance (cnt’d) • Numpy Arrays ncalls 1800 12387 1046595156 1800 2025/900 2734 5/29/2016 tottime 15571.118 2306.950 1191.479 1015.757 358.778 69.638 percall 8.651 0.186 0.000 0.564 0.177 0.025 cumtime 16798.194 2306.950 1191.479 1086.062 20589.563 69.638 Percall 9.332 0.186 0.000 0.603 22.877 0.025 function return_unique array 'get' of 'dict' setup_size ttv max 26 Performance (cnt’d) • After removing Recursions ncalls 75 75 100 409 1 962709 479975 3 25 87 5/29/2016 tottime 134.939 7.802 5.463 2.043 1.034 0.608 0.347 0.170 0.122 0.083 percall 1.799 0.104 0.055 0.005 1.034 0.000 0.000 0.057 0.005 0.001 cumtime 135.569 8.148 151.402 2.043 1.034 0.608 0.347 150.071 0.122 0.083 Percall 1.808 0.109 1.514 0.005 1.034 0.000 0.000 50.024 0.005 0.001 function myaccumarray setup_dic ttv array get_norm append item mttkrp sum (Numpy) dot 27 Floating-Point Arithmetic • Binary floating-point “Binary floating-point cannot exactly represent decimal fractions, so if binary floating-point is used it is not possible to guarantee that results will be the same as those using decimal arithmetic.”[12] Makes the iterations volatile 5/29/2016 28 Convergence Issues TOL Tolerance on the difference in fit P KTENSOR X SPTENSOR RN normX 2 normP 2 2 * INNERPROD(X,P) Fit i 1 (RN /normX) Fit Fit i Fit i1 if (Fit TOL) STOP else Continue 5/29/2016 29 Convergence Issues (ctn’d) 5/29/2016 30 Convergence Issues (cont’d) 5/29/2016 31 Conclusion • There is still work to do after NTF Preprocessing of data Post Processing of results such as FutureLens • Expertise • Extract and Identify hidden components • Tucker Implementation. • C extension to increase speed. • GUI 5/29/2016 32 Acknowledgments • Mr. Andrey Puretskiy Discussions at all stages of the PILOT Consultancy in text mining Testing • Tensor Toolbox For MATLAB (Bader and Kolda) Understanding of tensor Decomposition PARAFAC 5/29/2016 33 References 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/ Tamara G. Kolda, Brett W. Bader , “Tensor Decompostions and Applications”, SIAM Review , June 10, 2008. Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, Shun-ichi Amari, “Nonnegative Matrix and Tensor Factorizations”, John Wiley & Sons, Ltd, 1009. http://docs.python.org/library/profile.html http://www.mathworks.com/access/helpdesk/help/techdoc http://www.scipy.org/NumPy_for_Matlab_Users Brett W. Bader, Andrey A. Puretskiy, Michael W. Berry, “Scenario Discovery Using Nonnegative Tensor Factorization”, J. Ruiz-Schulcloper and W.G. Kropatsch (Eds.): CIARP 2008, LNCS 5197, pp.791-805, 2008 http://docs.scipy.org/doc/numpy/user/ http://docs.scipy.org/doc/ http://docs.scipy.org/doc/numpy/user/whatisnumpy.html Tamara G. Kolda, “Multilinear operators for higher-order decompositions”, SANDIA REPORT, April 2006 http://speleotrove.com/decimal/decifaq1.html#inexact 5/29/2016 34 QUESTIONS? 5/29/2016 35