Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Personalized Tag Recommendation through Nonlinear Tensor Factorization Using Gaussian Kernel Xiaomin Fang and Rong Pan∗ Guoxiang Cao, Xiuqiang He and Wenyuan Dai Department of Computer Science Sun Yat-sen University Guangzhou, China {fangxmin@mail2, panr@}.sysu.edu.cn Huawei Technologies Co., Ltd. Bantian, Longgang District Shenzhen, China {benjamin.cao, hexiuqiang, dai.wenyuan}@huawei.com Abstract Some tag recommender systems rank tags by means of tensor factorization techniques. Tensor factorization is a high-order extension of matrix factorization. Tensor factorization-based models decompose the user-item-tag tensor into three matrices to learn the latent features of the users, items and tags. Higher Order Singular Vector Decomposition (HOSVD) (Symeonidis, Nanopoulos, and Manolopoulos 2008), Multiverse (Karatzoglou et al. 2010) and RTF (Rendle et al. 2009) are based on Tucker Decomposition (TD), which can improve the state-of-the-art recommenders in prediction quality, but require too much computation power. In contrast, pairwise interaction tensor factorization (PITF) (Rendle and Schmidt-Thieme 2010) is based on another factorization approach, Canonical Decomposition (CD). It is linear both in the dataset size and the feature dimensionality. In this paper, we present a novel approach based on Canonical Decomposition and can be considered as a nonlinear generalization of CD. We use Gaussian radial basis function to capture the complex relations between the users, items and tags. We will analyze why exploiting Gaussian can make better use of the latent features and its relation to the traditional linear tensor factorization techniques. Our experimental results show that our proposed model outperforms other competitive methods and can achieve good performance with only a small number of latent features. The contributions of our work are summarized as follows. Personalized tag recommendation systems recommend a list of tags to a user when he is about to annotate an item. It exploits the individual preference and the characteristic of the items. Tensor factorization techniques have been applied to many applications, such as tag recommendation. Models based on Tucker Decomposition can achieve good performance but require a lot of computation power. On the other hand, models based on Canonical Decomposition can run in linear time and are more feasible for online recommendation. In this paper, we propose a novel method for personalized tag recommendation, which can be considered as a nonlinear extension of Canonical Decomposition. Different from linear tensor factorization, we exploit Gaussian radial basis function to increase the model’s capacity. The experimental results show that our proposed method outperforms the state-of-the-art methods for tag recommendation on real datasets and perform well even with a small number of features, which verifies that our models can make better use of features. Introduction Nowadays, a great amount of information springs up in the Internet and users struggle to find the items they are really interested in. To tackle this problem, recommendation systems are designed to recommend appropriate items to the users according to their habits. Collaborative tagging systems like Delicious, Last.fm and Movielens allow users to upload resources and annotate them. The activity that users annotate items, such as movies, songs and pictures with some keywords, is called tagging. A tag can be viewed as an implicit rating and be used to identify not only the features of the items, but also the users’ personality. Tag recommendation is a task to suggest tags to a user when he is about to annotate an item. Given an item, different users may use different tags to annotate it. Personalized tag recommender systems utilize users’ past tagging behaviors to predict their future behaviors. For example, if two users annotated an item with the same tag before, it is likely that they will annotate another item with the same tag in the future. • We propose a novel nonlinear approach based on Canonical Decomposition, which employs Gaussian radial basis function. The time performance is comparable to PITF and runs in linear time. • We compare our nonlinear approach to the traditional linear methods and analyze the reasons why the Gaussianbased method can make better use of the latent features. • The experimental results demonstrate that our method outperforms other methods on real datasets and performs well even with only a small number of latent features. Related Work Tag Recommendation Techniques There is a lot of previous work on tag recommendation. Work by Krestel, Fankhauser, and Nejdl proposed an approach based on Latent Dirichlet Allocation(LDA). Given a ∗ Corresponding author c 2015, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 439 document, the top terms composing its latent topics are recommended to the users. By this way, the terms in the same topic but showed in the other documents have the opportunity to be presented. Lipczak et al. employed probabilistic latent topic model for tag recommendation as well. They added a variable of tag to LDA, so as to link the tags to the topics of the document. Liu et al. ranked the tags for images by using random walks. However, the systems mentioned above are impersonalized. Different users may prefer different tags for the same item. Work by Guan et al. ranked tags by a graph-based ranking algorithm. It takes personalized tag recommendation as a ”query and ranking” problem. The query consists of two parts, the user and the item. The tags are related to the item and the user. Besides, Durao and Dolog introduced a hybrid approach. It makes use of several factors, including tag similarity, tag popularity, tag representativeness and affinity between user and tag to provide appropriate tags. XVHU XVHUX " LWHP " LWHP LWHP " " " " " " LWHP WDJ WDJ WDJ WDJ WDJ LWHP Figure 1: Tensor construction and post-based ranking interpretation for personalized tag recommendation man 1970) can be seen as higher-order extension of the matrix factorization, Singular Value Decomposition (SVD). Tucker Decomposition (TD) decomposes a tensor into a core tensor multiplied by a matrix along each mode. Take a three-order tensor A ∈ RI×J×K for example and it can be written as Ai,j,k = Tensor Factorization Techniques R1 X R2 X R3 X Cr1 ,r2 ,r3 · Xi,r1 · Yj,r2 · Zk,r3 , r1 =1 r2 =1 r3 =1 (1) where X ∈ RI×R1 , Y ∈ RJ×R2 , Z ∈ RK×R3 and C ∈ RR1 ×R2 ×R3 is a core tensor. R1 , R2 and R3 are the number of latent features. Canonical Decomposition (CD) decomposes a tensor into a sum of component rank-one tensors. Let A ∈ RI×J×K be a tensor and it can be written as Tucker Decomposition and Canonical Decomposition are the high-order extension of Singular Value Decomposition (SVD). Nickel, Tresp, and Kriegel applied tensor factorization to collective entity resolution. Some studies (Dunlavy, Kolda, and Acar 2011; ErmisĖ§, Acar, and Cemgil 2012; Spiegel et al. 2012) proposed tensor-based models to solve the problem of link prediction by incorporating time information. Moreover, tensor factorization is applied to personalized recommendation, including personalized web search (Sun et al. 2005), hyperlink recommendation (Gao et al. 2012) and personalized tag recommendation (Rendle and Schmidt-Thieme 2010; Symeonidis, Nanopoulos, and Manolopoulos 2008; 2010; Rendle et al. 2009; Chi and Zhu 2010). With respect to personalized tag recommendation, Higher Order Singular Vector Decomposition (HOSVD) (Symeonidis, Nanopoulos, and Manolopoulos 2008) and RTF (Rendle et al. 2009) are based on Tucker Decomposition (TD), while PITF (Rendle and Schmidt-Thieme 2010) is based on Canonical Decomposition (CD). HOSVD can not deal with missing values and it fills the unknown entries with 0. RTF handles the missing values by adding pairwise ranking constraint. Although TD-based methods outperform other state-of-the-art tag recommendation approaches, they require a lot of computation power. Therefore, they are not feasible for online recommendation. Compared with TDbased methods, CD-based methods have a huge advantage in running time, because they can be trained in linear time. PITF, an extension of CD, splits the ternary relation of users, items and tags into two relations, user-tag and item-tag and its quality is comparable to RTF. Ai,j,k = R X Xi,r · Yj,r · Zk,r , (2) r=1 where X ∈ RI×R , Y ∈ RJ×R , Z ∈ RK×R and R is the number of latent features. X, Y, Z in Eqs. (1) and (2) are in analogy with the factor matrices in Singular Value Decomposition (SVD) and values in core tensor C is in analogy with eigenvalues. Besides, if we compare the running time of TD in Eq. (1) and CD in Eq. (2), we can find that CD is linear in feature dimensionality but the time complexity of TD is much higher. Personalized Tag Recommendation Tagging is a process that a user associates an item with its keywords, which are called tags. We use the similar notation of (Rendle et al. 2009; Rendle and Schmidt-Thieme 2010) for the formulation of personalized tag recommendation. Let U be the set of users, I be the set of items, and T be the set of tags. The process that user u annotate item i with tag t is symbolized by a triple (u, i, t). Let S ∈ U × I × T denote the set of tagging history. If a triple (u, i, t) ∈ S, it means a particular user u has annotated a specific item i with tag t. The relation between the users, items and tags for personalized tag recommendation can represented by a three-order tensor, which is depicted in Figure 1. Rendle et al. introduced post-based ranking interpretation scheme for personalized tag recommendation and we employ the schema in this paper. Let PS represent the set of all of the user-item pairs (u, i) in S and it can be written as Notations and Preliminaries Tensor Factorization A tensor is a multidimensional or multi-way array and the order of a tensor is the number of the dimensions. Tucker Decomposition (TD) (Tucker 1966) and Canonical Decomposition (CD) (Carroll and Chang 1970; Harsh- PS = {(u, i)|∃(u, i, t) ∈ S}. 440 i obeys Gaussian distribution. The smaller the distance is, the more relevant are item i and tag t. It can be written as Post-based schema generates positive examples and negative examples only from the observed posts (u, i) ∈ PS . Given a post (u, i) ∈ PS , all of the triples (u, i, t) ∈ S are taken as positive examples and the triples (u, i, t) ∈ / S are taken as negative examples. If (u, i) ∈ / PS , we do not know that user’s preference for that item. Thus, all the entries for the unobserved posts are taken as missing values. We formulate the relevance of (u, i, t) with f (u, i, t). Given a post (u, i), if user u has annotated item i with tag tpos rather than tag tneg , it implies that user u prefers tag tpos to tag tneg for that post. In other words, f (u, i, tpos ) > f (u, i, tneg ), iff triple (u, i, tpos ) ∈ S and triple (u, i, tneg ) ∈ / S. Thus, in this paper, we use notation tpos for “positive tags”, and notation tneg for “negative tags”. As a result, the pairwise ranking constraints DS from S is defined as DS = {(u, i, t pos neg ,t )|(u, i, t pos ) ∈ S ∧ (u, i, t neg ||Ii,k − Tt,k || ∼ N (0, σ 2 ), where σ is the standard deviation of the Gaussian distribution. The situations of user-tag and user-item relations are in analogy to that of the item-tag relation. However, the distributions of the latent topics of user-item, user-tag and itemtag relationships might be different. Thus, we define two sets of the features for the users, items and tags, respectively, and we have 2 ||UIu,k − IU i,k || ∼ N (0, σ ), 2 ||UTu,k − TU t,k || ∼ N (0, σ ), and ||ITi,k − TIt,k || ∼ N (0, σ 2 ), )∈ / S}. where UT , UI ∈ R|U |×R , IU , IT ∈ R|I|×R , TU , and TI ∈ R|T |×R . With the assumptions made, given a particular triple (u, i, t), we define its relevant score f (u, i, t) as X 1 2 f (u, i, t) = ζkU I · exp(− β||UIu,k − IU i,k || ) 2 k X 1 2 + ζkU T · exp(− β||UTu,k − TU t,k || ) 2 k X 1 + ζkIT · exp(− β||ITi,k − TIt,k ||2 ), (3) 2 The elements in DS are used as training examples for tag recommendation. Figure 1 shows the tensor construction and a toy example of post-based ranking interpretation for personalized tag recommendation. User u has labeled item 4 with tag 1 and tag 3. Thus, we assume user u prefers tag 1 and tag 3 to tag 2 and tag 4. Besides, since user u has not labeled item 1 and item 4, the values of all the entries in the first row and the third row are missing. Nonlinear Tensor Factorization using Gaussian Kernal for Tag Recommendation k Nonlinear Tensor Factorization using Gaussian 2 where β = 1/σ . ζkU I is the weight or the importance of the k-th topic for relation user-item. Similarly, ζkU T and ζkIT are the importance of the k-th topic for relations user-tag and item-tag, respectively. In fact, the NLTF model can be represented by a neural network with a specific activation function and a three-layer networks structure. The framework of the neural network is depicted in Figure 2. The neural network contains three layers, input layer, hidden layer and output layer. The input layer of the neural network is composed of six blocks and the hidden layer is composed of three blocks. The first two block of the input layer and the first block of hidden layer corresponds to relation user-item. The representation of relations user-tag and item-tag are in analogy with relation user-item. For the input representation, one-of-n encoding schema is used. Given a triple (u, i, t), the u-th unit in the first block is set to be 1 and all the other units in the first block are set to be 0. The settings for the other input blocks are similar. For the hidden layer, we use exp(− 21 β||p − q||2 ) as the activation function, where p and q are the weights of the nonzero input units. Take relation user-item for example. p and q correspond to UIu,k and IU i,k in Eq. (3), respectively. ζkU I in Eq. (3) corresponds to the weight of the k-th unit in the first block of the hidden layer. For a single training example (u, i, tpos , tneg ) ∈ DS , we define the objective function for personalized tag recommendation with respect to that single example as follows. In this paper, we introduce a nonlinear way for tensor decomposition, and we refer to it as Nonlinear Tensor Factorization (NLTF). The ternary relation between users, items and tags is represented by a three-order tensor with |U | users, |I| items and |T | tags. The tensor can be decomposed into three components, a user matrix U ∈ R|U |×R , an item matrix I ∈ R|I|×R and a tag matrix T ∈ R|T |×R , where R is the number of features. The entries in the i-th row of matrix U, matrix I and matrix T indicate the latent features of i-th user, i-th item and i-th tag, respectively. Each column of matrices U, I and T can be considered as a latent topic. Analogous to the previous work (Rendle and SchmidtThieme 2010), we model three pairwise relations user-item, user-tag and item-tag rather than directly model ternary relation use-item-tag. As we have mentioned, an entry in a feature vector stands for a latent topic. Without loss of generality, we take relation item-tag for example. We first make an assumption that the latent topics are independent to each other. Fox example, if the items are movies, the first latent topic might be comedy and the second latent topic might be action. If tag t is intimately related to item i with respect to the first topic, the value of the first latent feature of tag t should be close to the value of the first latent feature of item i. Taking consideration of it, we assume if item i and tag t are relevant with the k-th topic, then Tt,k obeys the Gaussian distribution N (Ii,k , σ 2 ) and vice versa, where Tt,k represents the k-th feature of tag t and Ii,k represents the k-th feature of item i. In other words, the distance between the value of the k-th feature of tag t and the k-th feature of item J(u, i, tpos , tneg ) = sigmoid(f (u, i, tpos ) − f (u, i, tneg )), (4) 441 RXWSXWXQLW KLGGHQXQLWV XVHU LWHP XVHU WDJ LQSXWXQLWV LWHP WDJ Figure 2: Neural network representation for NLTF 1 where sigmoid(x) = 1+exp(−x) . It is clear that the sigmoid function, sigmoid(x), is a monotonically increasing function. If a particular user u has annotated item i with tag tpos , but not tag tneg , it indicates that he prefers tag tpos to tag tneg . Thus, the difference of the relevant scores between (u, i, tpos ) and (u, i, tneg ) should be large and we use sigmoid function to model it. Given a training set DS , the overall objective function is defined as X J(u, i, tpos , tneg ). (5) J(DS ) = With the gradients, we can update the parameters of our model. We first initialize all of the parameters and then learn the parameters until convergence. Comparison with Other Tensor Factorization Models HOSVD (Karatzoglou et al. 2010), RTF (Rendle et al. 2009) and PITF (Rendle and Schmidt-Thieme 2010) are models based on linear tensor factorization. In particular, HOSVD and RTF are TD-based and PITF is CD-based. Our proposed model NLTF can be seen as a nonlinear extension of the CDbased models. We use Gaussian radial basis function rather than employing dot product used in PITF. Let’s assume the number of features of users, items and tags are the same for TD-based models, which is R. If we exploit stochastic gradient descent (SGD) to learn the parameters for all models, given a training example, TD-based models take O(R3 ) to update the parameters, while CDbased models take O(R), where R is the number of features. For prediction, given a post (user, item), TD-based models take O(|T | · R + R3 ) to rank the tags, while CD-based models take O(|T |·R). Therefore, CD-based models outperform TD-based models in terms of the running time and are more feasible for online tag recommendation. On the other hand, our model that exploits Gaussian radial basis function to calculate the similarity is better than models exploiting dot product. Models employing dot product can be viewed as two-class classifiers, while models employing Gaussian radial basis function can be seen as multi-class classifiers. For example, we want to provide a set of tags to user u who has just watched movie i. Let’s assume there is only one latent feature Ii,1 . For dot product, if the value of Ii,1 is positive, it may indicate the movie is a comedy; otherwise, the movie is not. Thus, if the value of Ii,1 is positive and tag t is related to comedies, the value of Tt,1 should be positive so as to make the dot product of Ii,1 and Tt,1 large, so that tag t can rank high given post (u, i). For Gaussian-based models, a toy example is depicted in Figure 3. The horizontal axis represents the value of Ii,1 .If Ii,1 ∈ (−3, 3), the movie is probably an action; if Ii,1 ∈ (−9, −5), the movie is probably an romantic; if Ii,1 ∈ S (5, 9), the movie is probably an drama; if Ii,1 ∈ (−∞, 20) (20, ∞), it does not belong to any of the categories in Figure 3. For instance, tag t is “comedy” and movie i1 and movie i2 are annotated with tag t. As we assume, if an item and a tag are related, the distance between the feature of that item and the feature of that tag obeys (u,i,tpos ,tneg )∈DS The parameters estimation for the tag recommendation task can be formulated as the optimization problem of maximizing J(DS ). Note that, given an example (u, i, tpos , tneg ), the first component in f (u, i, t) is eliminated when calculate f (u, i, tpos ) − f (u, i, tneg ) in Eq. (4). Thus, we drop it from f (u, i, t). Besides, when maximizing the objective function, the scale of ζkU I , ζkU T and ζkIT will get large, which leads to overfitting. To address this issue, we simply set the values of ζkU I , ζkU T and ζkIT to 1. Therefore, given a triple (u, i, t), the relevant score f (u, i, t) is simplified as X 1 2 f (u, i, t) = exp(− β||UTu,k − TU t,k || ) 2 k X 1 (6) + exp(− β||ITi,k − TIt,k ||2 ). 2 k Parameters Learning We use stochastic gradient descent, a widely used technique, to optimize the objective function J(DS ). We randomly select a training example (u, i, tpos , tneg ) from DS and perform an update for that example. This process repeats until convergence. The gradients for the nonlinear tensor factorization model are ∂fu,i,t 1 = −β · exp(− β||UTu,k − TIt,k ||2 )(UTu,k − TU t,k ), 2 ∂UTu,k ∂fu,i,t 1 = −β · exp(− β||ITi,k − TIt,k ||2 )(ITi,k − TIt,k ), 2 ∂ITi,k ∂fu,i,t 1 = β · exp(− β||UTu,k − TIt,k ||2 )(UTu,k − TU t,k ), 2 ∂TU t,k and 1 ∂fu,i,t = β · exp(− β||ITi,k − TIt,k ||2 )(ITi,k − TIt,k ). 2 ∂TIt,k 442 SUREDELOLW\ FRPHG\ URPDQFH DFWLRQ Table 2: Data statistics VFLIL GUDPD Dataset Delicioussmall Last.fm Movielens Deliciouslarge ,L -20 -15 -10 -5 0 5 10 15 20 Figure 3: A toy example of nonlinear tensor factorization using Gaussian radial basis function Users 1.8K Items 35.9K Tags 9.3K Posts 66.7K Triples 312.0K 1.6K 0.7K 1.5M 7.3K 2.6K 7.5M 2.3K 1.6K 2.6M 61.1K 17.8K 198.9M 164.5K 30.8K 675.0M Besides, given item i and tag t, we define NIT (i, t) as Gaussian distribution. That means the distance between the feature of tag t and the feature of movie i1 is likely to be small and the distance between the feature of tag t and the feature of movie i2 is likely to be small as well. Thus, the feature of i1 is likely to be close to the feature of i2 . In other words, the features of the movies that belong to the same categories are likely to be close to each other and obey Gaussian distribution. We can distinguish several categories by using only one feature, because different categories fall into different part of the coordinate axis. In conclusion, Gaussianbased models can deal with more than two classes with only one feature, but dot product can handle only two classes. Thus, exploiting Gaussian function on radial basis distance can make better use of the latent features than dot product. The experimental results on real datasets in the next section demonstrate this point. NIT (i, t) = |{u|(u, i, t) ∈ S}|. We compare our proposed model NLTF with several baselines, including PITF, SVD, Top-UT, Top-IT and TopUT+IT, where SVD only employs two-dimension relation item-tag, Top-UT ranks tags based on NU T (u, t), Top-IT ranks tags based on NIT (i, t) and Top-UT+IT ranks tags based on NU T (u, t) + NIT (i, t). Here, Top-UT represents the affinity between users and tags, Top-IT represents the tag popularity and Top-UT+IT can be seen as a simple ensemble of Top-UT and Top-IT. We do not compare NLTF with TDbased models like HOSVD and RTF, because the computation time is unacceptable in practice, especially for larger dataset Delicious-large. Besides, the experimental results in (Rendle and Schmidt-Thieme 2010) show that the performance of PITF is comparable to TD-based models. For each dataset. we tuned the hyper-parameters of all models by grid search to achieve the best results. Experimental Results Experimental Setup Evaluation Methodology We use four datasets to evaluate the performance of our proposed model, including Delicoius-small1 , Last.fm2 , Movielens3 and Delicious-large4 (Wetzker, Zimmermann, and Bauckhage 2008). Three smaller datasets Delicious-small, Last.fm and Movielens come from the 2-nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). The larger dataset Delicious-large (Wetzker, Zimmermann, and Bauckhage 2008) contains the complete bookmarking activity for almost 2 million users from the launch of the social bookmarking website in 2003 to the end of March 2011. For each dataset, we removed the infrequent users, items occurred in less than 5 posts and tags occurred in less than 5 triplets. The statistics of datasets after removal are described in Table 2. To avoid randomness of the results, we perform 10-fold cross-validation for the three smaller datasets, Delicioussmall, Last.fm and Movielens. For larger dataset Deliciouslarge, we randomly sample about 40 thousand posts for test and the remaining data is used for training. To evaluate our method, we introduce two baseline functions NU T (u, t) and NIT (i, t). Given user u and tag t, we define NU T (u, t) as We treat personalized tag recommendation as a information retrieval problem and employ Normalized Discounted Cumulative Gain (NDCG) as our metrics to evaluate the performance of the models. NDCG is widely used in information retrieval and it makes the assumption that relevant tags are more valuable if it appears earlier in the recommendation list. Discounted Cumulative Gain (DCG) evaluates the gain of a tag based on its position in the ranked sequence returned by the model. The DCG accumulated at a particular position p is defined as Xp rel(k) DCG@p = rel(1) + , (7) k=1 log2 (k) where rel(k) is 1 if the tag at position k is relevant or 0 otherwise. In order to normalize the metric, Ideal Discounted Cumulative Gain (IDCG) is introduced, which is the DCG of the ideal ranked sequence. Normalized Discounted Cumulative Gain (NDCG) is defined as N DCG@p = NU T (u, t) = |{i|(u, i, t) ∈ S}|. DCG@p . IDCG@p (8) Performance on Smaller Datasets 1 Comparisons with Baselines Table 3 shows the performance of different methods on three smaller datasets Delicious-small, Last.fm and Movielens. For NLTF and PITF, the number of features R = 64, and for SVD, R = http://www.delicious.com http://www.lastfm.com 3 http://www.grouplens.org 4 http://www.zubiaga.org/resources/socialbm0311 2 443 -40 -30 -20 -10 Oscal and AFI movie categories 0 movie content 10 adjec!ve 20 30 others Figure 4: Tag distribution on dataset Movielens with the number of features R = 1 Table 1: Description of tags on dataset Movielens with the number of features R = 1 Oscar and AFI afi 100 oscar (best directing) oscar (best cinematography) oscar (best supporting actor) afi 100 (thrills) oscar (best actor) Movie Categories scifi fantasy comedy action anime classic Movie Content war world war ii surreal serial killer time travel true story 256 because NLTF and PITF saturate at 64 features and SVD saturates at 256 features. It can be seen from Table 3 that NLTF outperforms other competitive methods on Delicioussmall and Last.fm, but PITF performs a little bit better than our method on Movielens. Since the data in Delicious-small and Last.fm is relatively sparser than the data in Movielens, it should be the case that our method has some advantages when the data is sparser. Moreover, our method beats TopUT+IT on all the datasets, which integrates tag popularity and affinity between users and tags, but PITF works worse than Top-UT-IT on Delicious-small. Delicious-small 0.2518 0.2718 0.1089 0.2416 0.1110 0.2607 Last.fm 0.6715 0.6986 0.3633 0.3811 0.4683 0.4588 Others erlends dvds my movies johnny depp boring based on a book franchise Table 4: NDCG@5 for different number of features on three smaller datasets feature PITF NLTF PITF NLTF Movielens PITF NLTF Delicious -small Last.fm 4 0.068 0.075 0.369 0.445 0.321 0.479 8 0.100 0.126 0.502 0.562 0.468 0.564 16 0.157 0.188 0.601 0.652 0.576 0.606 32 0.231 0.247 0.654 0.692 0.629 0.624 64 0.252 0.272 0.672 0.699 0.647 0.631 serve whether the tag distribution is corresponding to the toy example in Figure 3. In order to get better observation, the unpopular tags are removed. The tag distribution on dataset Movielens with the number of features R = 1 is depicted in Figure 4 and the description of which is showed in Table 1. As we can see, NLTF divides the tags into four groups automatically, including Oscar and AFI, movie categories, movie content and adjective. Tags in the same group are close to each other and some of their labels overlapped. Tags in others do not belong to any groups, thus they might do little help to personalized tag recommendation. Table 3: NDCG@5 of different methods on three smaller datasets PITF-64 NLTF-64 SVD-256 Top-UT Top-IT Top-UT+IT Adjective atmospheric disturbing deliberate quirky reflective visceral Movielens 0.6466 0.6309 0.1673 0.4082 0.2851 0.4909 Performance on Large-Scale Dataset Impact of Number of Features The performances of the NLTF and PITF depend on the number of features. Table 4 shows the impact of the number of features on the three datasets Delicious-small, Last.fm and Movielens. As expected, the performances of NLTF and PLTF increase when we increase the number of features. Their performances saturate at 64 features on all three datasets. Besides, the performance of NLTF is much better than PITF when the number of feature is small, because models exploiting Gaussian like NLTF can make better use of the latent features than models exploiting dot product like PITF, just as we analyze in Section and the effect is more significant when only a smaller number of features are used. Besides, in order to verify the point that using Gaussian can make better use of features than using dot product, we carry out another experiment. We train our proposed model NLTF by using only one feature on Movielens, so as to ob- In addition, we compare our method NLTF with PITF on a large-scale dataset Delicious-large, which contains more than 0.6 billion triples and about 0.2 billion posts. There is no previous work has conducted experiments for personalized tag recommendation on such a large-scale dataset before to our best knowledge. We conduct experiments on a computer with a 6-core intel i7 CPU and the performance after 30 iterations is showed in Table 5. From the figure, we can see that our method greatly outperforms PITF even on large scale dataset. Table 5: Comparison of NLTF and PITF on larger dataset NLTF PITF 444 NDCG@1 0.3143 0.2845 NDCG@3 0.3053 0.2758 NDCG@5 0.3214 0.2907 NDCG@10 0.3579 0.3247 Conclusion and Future Work Karatzoglou, A.; Amatriain, X.; Baltrunas, L.; and Oliver, N. 2010. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems, 79–86. ACM. Krestel, R.; Fankhauser, P.; and Nejdl, W. 2009. Latent dirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems, 61– 68. ACM. Lipczak, M.; Hu, Y.; Kollet, Y.; and Milios, E. 2009. Tag sources for recommendation in collaborative tagging systems. ECML PKDD discovery challenge 157–172. Liu, D.; Hua, X.-S.; Yang, L.; Wang, M.; and Zhang, H.-J. 2009. Tag ranking. In Proceedings of the 18th international conference on World wide web, 351–360. ACM. Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A threeway model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), 809–816. Rendle, S., and Schmidt-Thieme, L. 2010. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, 81–90. ACM. Rendle, S.; Balby Marinho, L.; Nanopoulos, A.; and Schmidt-Thieme, L. 2009. Learning optimal ranking with tensor factorization for tag recommendation. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 727–736. ACM. Spiegel, S.; Clausen, J.; Albayrak, S.; and Kunegis, J. 2012. Link prediction on evolving data using tensor factorization. In New Frontiers in Applied Data Mining. Springer. 100– 110. Sun, J.-T.; Zeng, H.-J.; Liu, H.; Lu, Y.; and Chen, Z. 2005. Cubesvd: a novel approach to personalized web search. In Proceedings of the 14th international conference on World Wide Web, 382–390. ACM. Symeonidis, P.; Nanopoulos, A.; and Manolopoulos, Y. 2008. Tag recommendations based on tensor dimensionality reduction. In Proceedings of the 2008 ACM conference on Recommender systems, 43–50. ACM. Symeonidis, P.; Nanopoulos, A.; and Manolopoulos, Y. 2010. A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis. Knowledge and Data Engineering, IEEE Transactions on 22(2):179–192. Tucker, L. R. 1966. Some mathematical notes on threemode factor analysis. Psychometrika 31:279–311. Wetzker, R.; Zimmermann, C.; and Bauckhage, C. 2008. Analyzing social bookmarking systems: A del.icio.us cookbook. In Mining Social Data (MSoDa) Workshop Proceedings, 26–30. We introduce a nonlinear tensor factorization method based on Canonical Decomposition in this paper. Our proposed method exploits Gaussian radial basis function and can make better use of features than dot product. The models exploiting Gaussian can be considered as a multi-class classifiers, while the models exploiting dot product can be considered as a two-class classifiers. Additionally, our experimental results show that our method outperforms the baseline methods and can achieve very good performance with only a small number of features. For future work, we plan to leverage the profile of users, the content of the items and tags to make better predictions for tag recommendation. Acknowledgements We would like to thank the many referees of the previous version of this paper for their extremely useful suggestions and comments. This work was supported by Huawei Innovation Research Program (HIRP) and National Science Foundation of China (61033010). References Carroll, J., and Chang, J. 1970. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika 35(3):283–319. Chi, Y., and Zhu, S. 2010. Facetcube: a framework of incorporating prior knowledge into non-negative tensor factorization. In Proceedings of the 19th ACM international conference on Information and knowledge management, 569–578. ACM. Dunlavy, D. M.; Kolda, T. G.; and Acar, E. 2011. Temporal link prediction using matrix and tensor factorizations. ACM Transactions on Knowledge Discovery from Data (TKDD) 5(2):10. Durao, F., and Dolog, P. 2010. Extending a hybrid tag-based recommender system with personalization. In Proceedings of the 2010 ACM Symposium on Applied Computing, 1723– 1727. ACM. ErmisĖ§, B.; Acar, E.; and Cemgil, A. T. 2012. Link prediction via generalized coupled tensor factorisation. arXiv preprint arXiv:1208.6231. Gao, D.; Zhang, R.; Li, W.; and Hou, Y. 2012. Twitter hyperlink recommendation with user-tweet-hyperlink threeway clustering. In Proceedings of the 21st ACM international conference on Information and knowledge management, 2535–2538. ACM. Guan, Z.; Bu, J.; Mei, Q.; Chen, C.; and Wang, C. 2009. Personalized tag recommendation using graph-based ranking on multi-type interrelated objects. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 540–547. ACM. Harshman, R. A. 1970. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA working papers in phonetics 16:1. 445