SPARSE TENSORS DECOMPOSITION SOFTWARE Papa S. Diaw, Master’s Candidate 5/29/2016

advertisement
SPARSE TENSORS DECOMPOSITION SOFTWARE
Papa S. Diaw, Master’s Candidate
Dr. Michael W. Berry, Major Professor
5/29/2016
1
Introduction
• Large data sets
• Nonnegative Matrix Factorization (NMF)
 Insights on the hidden relationships
 Arrange multi-way data into a matrix
•
•
•
•
5/29/2016
Computation memory and higher CPU
Linear relationships in the matrix representation
Failure to capture important structure information
Slower or less accurate calculations
2
Introduction (cont'd)
• Nonnegative Tensor Factorizations (NTF)
 Natural way for high dimensionality
 Original multi-way structure of the data
 Image processing, text mining
5/29/2016
3
Tensor Toolbox For MATLAB
• Sandia National Laboratories
• Licenses
• Proprietary Software
5/29/2016
4
Motivation of the PILOT
• Python Software for NTF
• Alternative to Tensor Toolbox for MATLAB
• Incorporation into FutureLens
• Exposure to NTF
• Interest in the open source community
5/29/2016
5
Tensors
•
•
•
•
•
•
Multi-way array
Order/Mode/Ways
High-order
Fiber
Slice
Unfolding
 Matricization or flattening
 Reordering the elements of an N-th order tensor into a matrix.
 Not unique
5/29/2016
6
Tensors (cont’d)
• Kronecker Product
a11B

a21B

AB 


aI 1B
a12B
a22B
aI 2 B
a1J B 

a2J B 


aIJ B 
• Khatri-Rao product

 A⊙B=[a1⊗b1 a2⊗b2… aJ⊗bJ]
5/29/2016
7
Tensor Factorization
• Hitchcock in 1927 and later developed by Cattell
in 1944 and Tucker in 1966
• Rewrite a given tensor as a finite sum of lowerrank tensors.
• Tucker and PARAFAC
• Rank Approximation is a problem
5/29/2016
8
PARAFAC
• Parallel Factor Analysis
• Canonical Decomposition (CANDE-COMPE)
• Harsman,Carroll and Chang, 1970
5/29/2016
9
PARAFAC (cont’d)
• Given a three-way tensor X and an
approximation rank R, we define the factor
matrices as the combination of the vectors from
the rank-one components.
R
X  A B C   ar br c r
r1
5/29/2016
10
PARAFAC (cont’d)
5/29/2016
11
PARAFAC (cont’d)
• Alternating Least Square (ALS)
• We cycle “over all the factor matrices and performs a
least-square update for one factor matrix while holding all
the others constant.”[7]
• NTF can be considered an extension of the
PARAFAC model with the constraint of
nonnegativity
5/29/2016
12
Python
• Object-oriented, Interpreted
• Runs on all systems
• Flat learning curve
• Supports object methods (everything is an object
in Python)
5/29/2016
13
Python (cont’d)
• Recent interest in the scientific community
• Several scientific computing packages
 Numpy
 Scipy
• Python is extensible
5/29/2016
14
Data Structures
• Dictionary
 Store the tensor data
 Mutable type of container that can store any number of
Python objects
 Pairs of keys and their corresponding values
• Suitable for sparseness of our tensors
• VAST 2007 contest data 1,385,205,184 elements, with 1,184,139 nz
• Stores the nonzero elements and keeps track of the zeros by using the
default value of the dictionary
5/29/2016
15
Data Structures (cont’d)
• Numpy Arrays
 Fundamental package for scientific computing in Python
 Khatri-Rao products or tensors multiplications
 Speed
5/29/2016
16
Modules
5/29/2016
17
Modules (cont’d)
• SPTENSOR
 Most important module
 Class (subscripts of nz, values)
• Flexibility (Numpy Arrays, Numpy Matrix, Python Lists)
 Dictionary
 Keeps a few instances variables
• Size
• Number of dimensions
• Frobenius norm (Euclidean Norm)
5/29/2016
18
Modules (cont’d)
• PARAFAC
 coordinates the NTF
 Implementation of ALS
 Convergence or the maximum number of iterations
 Factor matrices are turned into a Kruskal Tensor
5/29/2016
19
Modules (cont’d)
5/29/2016
20
Modules (cont’d)
5/29/2016
21
Modules (cont’d)
• INNERPROD
 Inner product between SPTENSOR and KTENSOR
 PARAFAC to compute the residual norm
 Kronecker product for matrices
• TTV
 Product sparse tensor with a (column) vector
 Returns a tensor
• Workhorse of our software package
• Most computation
• It is called by the MTTKRP and INNERPROD modules
5/29/2016
22
Modules (cont’d)
• MTTKRP
 Khatri-Rao product off all factor matrices except the one
being updated
 Matrix multiplication of the matricized tensor with KR
product obtained above
• Ktensor
 Kruskal tensor
 Object returned after the factorization is done and the factor
matrices are normalized.
 Class
• Instance variables such as the Norm.
• Norm of ktensor plays a big part in determining the residual norm in the PARAFAC
module.
5/29/2016
23
Performance
• Python Profiler
 Run time performance
 Tool for detecting bottlenecks
 Code optimization
• negligible improvement
• efficiency loss in some modules
5/29/2016
24
Performance (cnt’d)
• Lists and Recursions
ncalls
2803
1400
9635
814018498
400
1575/700
2129
5/29/2016
tottime
3605.732
1780.689
1538.597
651.952
101.308
81.705
39.287
percall
1.286
1.272
0.160
0.000
0.072
0.052
0.018
cumtime
3605.732
2439.986
1538.597
651.952
140.606
7827.373
39.287
Percall
1.286
1.743
0.160
0.000
0.100
11.182
0.018
function
tolist
return_unique
array
get of 'dict'
setup_size
ttv
max
25
Performance (cnt’d)
• Numpy Arrays
ncalls
1800
12387
1046595156
1800
2025/900
2734
5/29/2016
tottime
15571.118
2306.950
1191.479
1015.757
358.778
69.638
percall
8.651
0.186
0.000
0.564
0.177
0.025
cumtime
16798.194
2306.950
1191.479
1086.062
20589.563
69.638
Percall
9.332
0.186
0.000
0.603
22.877
0.025
function
return_unique
array
'get' of 'dict'
setup_size
ttv
max
26
Performance (cnt’d)
• After removing Recursions
ncalls
75
75
100
409
1
962709
479975
3
25
87
5/29/2016
tottime
134.939
7.802
5.463
2.043
1.034
0.608
0.347
0.170
0.122
0.083
percall
1.799
0.104
0.055
0.005
1.034
0.000
0.000
0.057
0.005
0.001
cumtime
135.569
8.148
151.402
2.043
1.034
0.608
0.347
150.071
0.122
0.083
Percall
1.808
0.109
1.514
0.005
1.034
0.000
0.000
50.024
0.005
0.001
function
myaccumarray
setup_dic
ttv
array
get_norm
append
item
mttkrp
sum (Numpy)
dot
27
Floating-Point Arithmetic
• Binary floating-point
 “Binary floating-point cannot exactly represent decimal
fractions, so if binary floating-point is used it is not possible
to guarantee that results will be the same as those using
decimal arithmetic.”[12]
 Makes the iterations volatile
5/29/2016
28
Convergence Issues
TOL  Tolerance on the difference in fit
P  KTENSOR
X  SPTENSOR
RN  normX 2  normP 2  2 * INNERPROD(X,P)
Fit i  1  (RN /normX)
Fit  Fit i  Fit i1
if (Fit  TOL)
STOP
else
Continue
5/29/2016
29
Convergence Issues (ctn’d)
5/29/2016
30
Convergence Issues (cont’d)
5/29/2016
31
Conclusion
• There is still work to do after NTF
 Preprocessing of data

Post Processing of results such as FutureLens
• Expertise
• Extract and Identify hidden components
• Tucker Implementation.
• C extension to increase speed.
• GUI
5/29/2016
32
Acknowledgments
• Mr. Andrey Puretskiy
 Discussions at all stages of the PILOT
 Consultancy in text mining
 Testing
• Tensor Toolbox For MATLAB (Bader and
Kolda)
 Understanding of tensor Decomposition
 PARAFAC
5/29/2016
33
References
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
Tamara G. Kolda, Brett W. Bader , “Tensor Decompostions and Applications”, SIAM
Review , June 10, 2008.
Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, Shun-ichi Amari, “Nonnegative
Matrix and Tensor Factorizations”, John Wiley & Sons, Ltd, 1009.
http://docs.python.org/library/profile.html
http://www.mathworks.com/access/helpdesk/help/techdoc
http://www.scipy.org/NumPy_for_Matlab_Users
Brett W. Bader, Andrey A. Puretskiy, Michael W. Berry, “Scenario Discovery Using
Nonnegative Tensor Factorization”, J. Ruiz-Schulcloper and W.G. Kropatsch (Eds.):
CIARP 2008, LNCS 5197, pp.791-805, 2008
http://docs.scipy.org/doc/numpy/user/
http://docs.scipy.org/doc/
http://docs.scipy.org/doc/numpy/user/whatisnumpy.html
Tamara G. Kolda, “Multilinear operators for higher-order decompositions”, SANDIA
REPORT, April 2006
http://speleotrove.com/decimal/decifaq1.html#inexact
5/29/2016
34
QUESTIONS?
5/29/2016
35
Download