Personalized Tag Recommendation through Nonlinear Tensor Factorization Using Gaussian Kernel

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
Personalized Tag Recommendation
through Nonlinear Tensor Factorization Using Gaussian Kernel
Xiaomin Fang and Rong Pan∗
Guoxiang Cao, Xiuqiang He and Wenyuan Dai
Department of Computer Science
Sun Yat-sen University
Guangzhou, China
{fangxmin@mail2, panr@}.sysu.edu.cn
Huawei Technologies Co., Ltd.
Bantian, Longgang District
Shenzhen, China
{benjamin.cao, hexiuqiang, dai.wenyuan}@huawei.com
Abstract
Some tag recommender systems rank tags by means
of tensor factorization techniques. Tensor factorization is
a high-order extension of matrix factorization. Tensor
factorization-based models decompose the user-item-tag
tensor into three matrices to learn the latent features of
the users, items and tags. Higher Order Singular Vector
Decomposition (HOSVD) (Symeonidis, Nanopoulos, and
Manolopoulos 2008), Multiverse (Karatzoglou et al. 2010)
and RTF (Rendle et al. 2009) are based on Tucker Decomposition (TD), which can improve the state-of-the-art recommenders in prediction quality, but require too much computation power. In contrast, pairwise interaction tensor factorization (PITF) (Rendle and Schmidt-Thieme 2010) is based
on another factorization approach, Canonical Decomposition (CD). It is linear both in the dataset size and the feature
dimensionality. In this paper, we present a novel approach
based on Canonical Decomposition and can be considered
as a nonlinear generalization of CD. We use Gaussian radial
basis function to capture the complex relations between the
users, items and tags. We will analyze why exploiting Gaussian can make better use of the latent features and its relation to the traditional linear tensor factorization techniques.
Our experimental results show that our proposed model outperforms other competitive methods and can achieve good
performance with only a small number of latent features.
The contributions of our work are summarized as follows.
Personalized tag recommendation systems recommend
a list of tags to a user when he is about to annotate
an item. It exploits the individual preference and the
characteristic of the items. Tensor factorization techniques have been applied to many applications, such
as tag recommendation. Models based on Tucker Decomposition can achieve good performance but require
a lot of computation power. On the other hand, models based on Canonical Decomposition can run in linear
time and are more feasible for online recommendation.
In this paper, we propose a novel method for personalized tag recommendation, which can be considered as a
nonlinear extension of Canonical Decomposition. Different from linear tensor factorization, we exploit Gaussian radial basis function to increase the model’s capacity. The experimental results show that our proposed
method outperforms the state-of-the-art methods for tag
recommendation on real datasets and perform well even
with a small number of features, which verifies that our
models can make better use of features.
Introduction
Nowadays, a great amount of information springs up in the
Internet and users struggle to find the items they are really
interested in. To tackle this problem, recommendation systems are designed to recommend appropriate items to the
users according to their habits. Collaborative tagging systems like Delicious, Last.fm and Movielens allow users to
upload resources and annotate them. The activity that users
annotate items, such as movies, songs and pictures with
some keywords, is called tagging. A tag can be viewed as
an implicit rating and be used to identify not only the features of the items, but also the users’ personality. Tag recommendation is a task to suggest tags to a user when he
is about to annotate an item. Given an item, different users
may use different tags to annotate it. Personalized tag recommender systems utilize users’ past tagging behaviors to
predict their future behaviors. For example, if two users annotated an item with the same tag before, it is likely that they
will annotate another item with the same tag in the future.
• We propose a novel nonlinear approach based on Canonical Decomposition, which employs Gaussian radial basis function. The time performance is comparable to PITF
and runs in linear time.
• We compare our nonlinear approach to the traditional linear methods and analyze the reasons why the Gaussianbased method can make better use of the latent features.
• The experimental results demonstrate that our method
outperforms other methods on real datasets and performs
well even with only a small number of latent features.
Related Work
Tag Recommendation Techniques
There is a lot of previous work on tag recommendation.
Work by Krestel, Fankhauser, and Nejdl proposed an approach based on Latent Dirichlet Allocation(LDA). Given a
∗
Corresponding author
c 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
439
document, the top terms composing its latent topics are recommended to the users. By this way, the terms in the same
topic but showed in the other documents have the opportunity to be presented. Lipczak et al. employed probabilistic
latent topic model for tag recommendation as well. They
added a variable of tag to LDA, so as to link the tags to
the topics of the document. Liu et al. ranked the tags for
images by using random walks. However, the systems mentioned above are impersonalized. Different users may prefer
different tags for the same item. Work by Guan et al. ranked
tags by a graph-based ranking algorithm. It takes personalized tag recommendation as a ”query and ranking” problem.
The query consists of two parts, the user and the item. The
tags are related to the item and the user. Besides, Durao and
Dolog introduced a hybrid approach. It makes use of several factors, including tag similarity, tag popularity, tag representativeness and affinity between user and tag to provide
appropriate tags.
XVHU
XVHUX
"
LWHP
"
LWHP
LWHP
"
"
"
"
"
"
LWHP
WDJ
WDJ
WDJ
WDJ
WDJ
LWHP
Figure 1: Tensor construction and post-based ranking interpretation for personalized tag recommendation
man 1970) can be seen as higher-order extension of the matrix factorization, Singular Value Decomposition (SVD).
Tucker Decomposition (TD) decomposes a tensor into a
core tensor multiplied by a matrix along each mode. Take a
three-order tensor A ∈ RI×J×K for example and it can be
written as
Ai,j,k =
Tensor Factorization Techniques
R1 X
R2 X
R3
X
Cr1 ,r2 ,r3 · Xi,r1 · Yj,r2 · Zk,r3 ,
r1 =1 r2 =1 r3 =1
(1)
where X ∈ RI×R1 , Y ∈ RJ×R2 , Z ∈ RK×R3 and C ∈
RR1 ×R2 ×R3 is a core tensor. R1 , R2 and R3 are the number
of latent features.
Canonical Decomposition (CD) decomposes a tensor into
a sum of component rank-one tensors. Let A ∈ RI×J×K be
a tensor and it can be written as
Tucker Decomposition and Canonical Decomposition are
the high-order extension of Singular Value Decomposition
(SVD). Nickel, Tresp, and Kriegel applied tensor factorization to collective entity resolution. Some studies (Dunlavy,
Kolda, and Acar 2011; ErmisĖ§, Acar, and Cemgil 2012;
Spiegel et al. 2012) proposed tensor-based models to solve
the problem of link prediction by incorporating time information. Moreover, tensor factorization is applied to
personalized recommendation, including personalized web
search (Sun et al. 2005), hyperlink recommendation (Gao
et al. 2012) and personalized tag recommendation (Rendle
and Schmidt-Thieme 2010; Symeonidis, Nanopoulos, and
Manolopoulos 2008; 2010; Rendle et al. 2009; Chi and Zhu
2010). With respect to personalized tag recommendation,
Higher Order Singular Vector Decomposition (HOSVD)
(Symeonidis, Nanopoulos, and Manolopoulos 2008) and
RTF (Rendle et al. 2009) are based on Tucker Decomposition (TD), while PITF (Rendle and Schmidt-Thieme 2010)
is based on Canonical Decomposition (CD). HOSVD can
not deal with missing values and it fills the unknown entries
with 0. RTF handles the missing values by adding pairwise
ranking constraint. Although TD-based methods outperform
other state-of-the-art tag recommendation approaches, they
require a lot of computation power. Therefore, they are not
feasible for online recommendation. Compared with TDbased methods, CD-based methods have a huge advantage
in running time, because they can be trained in linear time.
PITF, an extension of CD, splits the ternary relation of users,
items and tags into two relations, user-tag and item-tag and
its quality is comparable to RTF.
Ai,j,k =
R
X
Xi,r · Yj,r · Zk,r ,
(2)
r=1
where X ∈ RI×R , Y ∈ RJ×R , Z ∈ RK×R and R is the
number of latent features.
X, Y, Z in Eqs. (1) and (2) are in analogy with the factor
matrices in Singular Value Decomposition (SVD) and values
in core tensor C is in analogy with eigenvalues. Besides, if
we compare the running time of TD in Eq. (1) and CD in
Eq. (2), we can find that CD is linear in feature dimensionality but the time complexity of TD is much higher.
Personalized Tag Recommendation
Tagging is a process that a user associates an item with its
keywords, which are called tags. We use the similar notation
of (Rendle et al. 2009; Rendle and Schmidt-Thieme 2010)
for the formulation of personalized tag recommendation. Let
U be the set of users, I be the set of items, and T be the
set of tags. The process that user u annotate item i with tag
t is symbolized by a triple (u, i, t). Let S ∈ U × I × T
denote the set of tagging history. If a triple (u, i, t) ∈ S,
it means a particular user u has annotated a specific item i
with tag t. The relation between the users, items and tags
for personalized tag recommendation can represented by a
three-order tensor, which is depicted in Figure 1.
Rendle et al. introduced post-based ranking interpretation
scheme for personalized tag recommendation and we employ the schema in this paper. Let PS represent the set of all
of the user-item pairs (u, i) in S and it can be written as
Notations and Preliminaries
Tensor Factorization
A tensor is a multidimensional or multi-way array and the
order of a tensor is the number of the dimensions.
Tucker Decomposition (TD) (Tucker 1966) and Canonical Decomposition (CD) (Carroll and Chang 1970; Harsh-
PS = {(u, i)|∃(u, i, t) ∈ S}.
440
i obeys Gaussian distribution. The smaller the distance is,
the more relevant are item i and tag t. It can be written as
Post-based schema generates positive examples and negative
examples only from the observed posts (u, i) ∈ PS . Given
a post (u, i) ∈ PS , all of the triples (u, i, t) ∈ S are taken
as positive examples and the triples (u, i, t) ∈
/ S are taken
as negative examples. If (u, i) ∈
/ PS , we do not know that
user’s preference for that item. Thus, all the entries for the
unobserved posts are taken as missing values. We formulate
the relevance of (u, i, t) with f (u, i, t). Given a post (u, i),
if user u has annotated item i with tag tpos rather than tag
tneg , it implies that user u prefers tag tpos to tag tneg for that
post. In other words, f (u, i, tpos ) > f (u, i, tneg ), iff triple
(u, i, tpos ) ∈ S and triple (u, i, tneg ) ∈
/ S. Thus, in this
paper, we use notation tpos for “positive tags”, and notation
tneg for “negative tags”. As a result, the pairwise ranking
constraints DS from S is defined as
DS = {(u, i, t
pos
neg
,t
)|(u, i, t
pos
) ∈ S ∧ (u, i, t
neg
||Ii,k − Tt,k || ∼ N (0, σ 2 ),
where σ is the standard deviation of the Gaussian distribution. The situations of user-tag and user-item relations are in
analogy to that of the item-tag relation. However, the distributions of the latent topics of user-item, user-tag and itemtag relationships might be different. Thus, we define two sets
of the features for the users, items and tags, respectively, and
we have
2
||UIu,k − IU
i,k || ∼ N (0, σ ),
2
||UTu,k − TU
t,k || ∼ N (0, σ ),
and ||ITi,k − TIt,k || ∼ N (0, σ 2 ),
)∈
/ S}.
where UT , UI ∈ R|U |×R , IU , IT ∈ R|I|×R , TU , and TI ∈
R|T |×R .
With the assumptions made, given a particular triple
(u, i, t), we define its relevant score f (u, i, t) as
X
1
2
f (u, i, t) =
ζkU I · exp(− β||UIu,k − IU
i,k || )
2
k
X
1
2
+
ζkU T · exp(− β||UTu,k − TU
t,k || )
2
k
X
1
+
ζkIT · exp(− β||ITi,k − TIt,k ||2 ), (3)
2
The elements in DS are used as training examples for tag
recommendation. Figure 1 shows the tensor construction
and a toy example of post-based ranking interpretation for
personalized tag recommendation. User u has labeled item
4 with tag 1 and tag 3. Thus, we assume user u prefers tag
1 and tag 3 to tag 2 and tag 4. Besides, since user u has not
labeled item 1 and item 4, the values of all the entries in the
first row and the third row are missing.
Nonlinear Tensor Factorization using
Gaussian Kernal for Tag Recommendation
k
Nonlinear Tensor Factorization using Gaussian
2
where β = 1/σ . ζkU I is the weight or the importance of the
k-th topic for relation user-item. Similarly, ζkU T and ζkIT are
the importance of the k-th topic for relations user-tag and
item-tag, respectively.
In fact, the NLTF model can be represented by a neural
network with a specific activation function and a three-layer
networks structure. The framework of the neural network
is depicted in Figure 2. The neural network contains three
layers, input layer, hidden layer and output layer. The input layer of the neural network is composed of six blocks
and the hidden layer is composed of three blocks. The first
two block of the input layer and the first block of hidden
layer corresponds to relation user-item. The representation
of relations user-tag and item-tag are in analogy with relation user-item. For the input representation, one-of-n encoding schema is used. Given a triple (u, i, t), the u-th unit in
the first block is set to be 1 and all the other units in the first
block are set to be 0. The settings for the other input blocks
are similar. For the hidden layer, we use exp(− 21 β||p − q||2 )
as the activation function, where p and q are the weights of
the nonzero input units. Take relation user-item for example.
p and q correspond to UIu,k and IU
i,k in Eq. (3), respectively.
ζkU I in Eq. (3) corresponds to the weight of the k-th unit in
the first block of the hidden layer.
For a single training example (u, i, tpos , tneg ) ∈ DS , we
define the objective function for personalized tag recommendation with respect to that single example as follows.
In this paper, we introduce a nonlinear way for tensor decomposition, and we refer to it as Nonlinear Tensor Factorization (NLTF). The ternary relation between users, items
and tags is represented by a three-order tensor with |U |
users, |I| items and |T | tags. The tensor can be decomposed
into three components, a user matrix U ∈ R|U |×R , an item
matrix I ∈ R|I|×R and a tag matrix T ∈ R|T |×R , where R
is the number of features. The entries in the i-th row of matrix U, matrix I and matrix T indicate the latent features of
i-th user, i-th item and i-th tag, respectively. Each column of
matrices U, I and T can be considered as a latent topic.
Analogous to the previous work (Rendle and SchmidtThieme 2010), we model three pairwise relations user-item,
user-tag and item-tag rather than directly model ternary relation use-item-tag. As we have mentioned, an entry in a
feature vector stands for a latent topic. Without loss of generality, we take relation item-tag for example. We first make
an assumption that the latent topics are independent to each
other. Fox example, if the items are movies, the first latent
topic might be comedy and the second latent topic might be
action. If tag t is intimately related to item i with respect
to the first topic, the value of the first latent feature of tag t
should be close to the value of the first latent feature of item
i. Taking consideration of it, we assume if item i and tag t
are relevant with the k-th topic, then Tt,k obeys the Gaussian distribution N (Ii,k , σ 2 ) and vice versa, where Tt,k represents the k-th feature of tag t and Ii,k represents the k-th
feature of item i. In other words, the distance between the
value of the k-th feature of tag t and the k-th feature of item
J(u, i, tpos , tneg ) = sigmoid(f (u, i, tpos ) − f (u, i, tneg )),
(4)
441
RXWSXWXQLW
KLGGHQXQLWV
XVHU
LWHP
XVHU
WDJ
LQSXWXQLWV
LWHP
WDJ
Figure 2: Neural network representation for NLTF
1
where sigmoid(x) = 1+exp(−x)
. It is clear that the sigmoid function, sigmoid(x), is a monotonically increasing
function. If a particular user u has annotated item i with tag
tpos , but not tag tneg , it indicates that he prefers tag tpos to
tag tneg . Thus, the difference of the relevant scores between
(u, i, tpos ) and (u, i, tneg ) should be large and we use sigmoid function to model it.
Given a training set DS , the overall objective function is
defined as
X
J(u, i, tpos , tneg ).
(5)
J(DS ) =
With the gradients, we can update the parameters of our
model. We first initialize all of the parameters and then learn
the parameters until convergence.
Comparison with Other Tensor Factorization
Models
HOSVD (Karatzoglou et al. 2010), RTF (Rendle et al. 2009)
and PITF (Rendle and Schmidt-Thieme 2010) are models
based on linear tensor factorization. In particular, HOSVD
and RTF are TD-based and PITF is CD-based. Our proposed
model NLTF can be seen as a nonlinear extension of the CDbased models. We use Gaussian radial basis function rather
than employing dot product used in PITF.
Let’s assume the number of features of users, items and
tags are the same for TD-based models, which is R. If we
exploit stochastic gradient descent (SGD) to learn the parameters for all models, given a training example, TD-based
models take O(R3 ) to update the parameters, while CDbased models take O(R), where R is the number of features.
For prediction, given a post (user, item), TD-based models
take O(|T | · R + R3 ) to rank the tags, while CD-based models take O(|T |·R). Therefore, CD-based models outperform
TD-based models in terms of the running time and are more
feasible for online tag recommendation.
On the other hand, our model that exploits Gaussian radial basis function to calculate the similarity is better than
models exploiting dot product. Models employing dot product can be viewed as two-class classifiers, while models
employing Gaussian radial basis function can be seen as
multi-class classifiers. For example, we want to provide a
set of tags to user u who has just watched movie i. Let’s
assume there is only one latent feature Ii,1 . For dot product, if the value of Ii,1 is positive, it may indicate the
movie is a comedy; otherwise, the movie is not. Thus, if
the value of Ii,1 is positive and tag t is related to comedies,
the value of Tt,1 should be positive so as to make the dot
product of Ii,1 and Tt,1 large, so that tag t can rank high
given post (u, i). For Gaussian-based models, a toy example is depicted in Figure 3. The horizontal axis represents
the value of Ii,1 .If Ii,1 ∈ (−3, 3), the movie is probably
an action; if Ii,1 ∈ (−9, −5), the movie is probably an romantic; if Ii,1 ∈ S
(5, 9), the movie is probably an drama; if
Ii,1 ∈ (−∞, 20) (20, ∞), it does not belong to any of the
categories in Figure 3. For instance, tag t is “comedy” and
movie i1 and movie i2 are annotated with tag t. As we assume, if an item and a tag are related, the distance between
the feature of that item and the feature of that tag obeys
(u,i,tpos ,tneg )∈DS
The parameters estimation for the tag recommendation task
can be formulated as the optimization problem of maximizing J(DS ).
Note that, given an example (u, i, tpos , tneg ), the first
component in f (u, i, t) is eliminated when calculate
f (u, i, tpos ) − f (u, i, tneg ) in Eq. (4). Thus, we drop it from
f (u, i, t). Besides, when maximizing the objective function,
the scale of ζkU I , ζkU T and ζkIT will get large, which leads to
overfitting. To address this issue, we simply set the values of
ζkU I , ζkU T and ζkIT to 1. Therefore, given a triple (u, i, t), the
relevant score f (u, i, t) is simplified as
X
1
2
f (u, i, t) =
exp(− β||UTu,k − TU
t,k || )
2
k
X
1
(6)
+
exp(− β||ITi,k − TIt,k ||2 ).
2
k
Parameters Learning
We use stochastic gradient descent, a widely used technique,
to optimize the objective function J(DS ). We randomly select a training example (u, i, tpos , tneg ) from DS and perform an update for that example. This process repeats until
convergence. The gradients for the nonlinear tensor factorization model are
∂fu,i,t
1
= −β · exp(− β||UTu,k − TIt,k ||2 )(UTu,k − TU
t,k ),
2
∂UTu,k
∂fu,i,t
1
= −β · exp(− β||ITi,k − TIt,k ||2 )(ITi,k − TIt,k ),
2
∂ITi,k
∂fu,i,t
1
= β · exp(− β||UTu,k − TIt,k ||2 )(UTu,k − TU
t,k ),
2
∂TU
t,k
and
1
∂fu,i,t
= β · exp(− β||ITi,k − TIt,k ||2 )(ITi,k − TIt,k ).
2
∂TIt,k
442
SUREDELOLW\
FRPHG\
URPDQFH
DFWLRQ
Table 2: Data statistics
VFLIL
GUDPD
Dataset
Delicioussmall
Last.fm
Movielens
Deliciouslarge
,L
-20
-15
-10
-5
0
5
10
15
20
Figure 3: A toy example of nonlinear tensor factorization
using Gaussian radial basis function
Users
1.8K
Items
35.9K
Tags
9.3K
Posts
66.7K
Triples
312.0K
1.6K
0.7K
1.5M
7.3K
2.6K
7.5M
2.3K
1.6K
2.6M
61.1K
17.8K
198.9M
164.5K
30.8K
675.0M
Besides, given item i and tag t, we define NIT (i, t) as
Gaussian distribution. That means the distance between the
feature of tag t and the feature of movie i1 is likely to be
small and the distance between the feature of tag t and the
feature of movie i2 is likely to be small as well. Thus, the
feature of i1 is likely to be close to the feature of i2 . In other
words, the features of the movies that belong to the same categories are likely to be close to each other and obey Gaussian
distribution. We can distinguish several categories by using
only one feature, because different categories fall into different part of the coordinate axis. In conclusion, Gaussianbased models can deal with more than two classes with only
one feature, but dot product can handle only two classes.
Thus, exploiting Gaussian function on radial basis distance
can make better use of the latent features than dot product.
The experimental results on real datasets in the next section
demonstrate this point.
NIT (i, t) = |{u|(u, i, t) ∈ S}|.
We compare our proposed model NLTF with several baselines, including PITF, SVD, Top-UT, Top-IT and TopUT+IT, where SVD only employs two-dimension relation
item-tag, Top-UT ranks tags based on NU T (u, t), Top-IT
ranks tags based on NIT (i, t) and Top-UT+IT ranks tags
based on NU T (u, t) + NIT (i, t). Here, Top-UT represents
the affinity between users and tags, Top-IT represents the tag
popularity and Top-UT+IT can be seen as a simple ensemble
of Top-UT and Top-IT. We do not compare NLTF with TDbased models like HOSVD and RTF, because the computation time is unacceptable in practice, especially for larger
dataset Delicious-large. Besides, the experimental results in
(Rendle and Schmidt-Thieme 2010) show that the performance of PITF is comparable to TD-based models. For each
dataset. we tuned the hyper-parameters of all models by grid
search to achieve the best results.
Experimental Results
Experimental Setup
Evaluation Methodology
We use four datasets to evaluate the performance of our proposed model, including Delicoius-small1 , Last.fm2 , Movielens3 and Delicious-large4 (Wetzker, Zimmermann, and
Bauckhage 2008). Three smaller datasets Delicious-small,
Last.fm and Movielens come from the 2-nd International
Workshop on Information Heterogeneity and Fusion in
Recommender Systems (HetRec 2011). The larger dataset
Delicious-large (Wetzker, Zimmermann, and Bauckhage
2008) contains the complete bookmarking activity for almost 2 million users from the launch of the social bookmarking website in 2003 to the end of March 2011. For each
dataset, we removed the infrequent users, items occurred in
less than 5 posts and tags occurred in less than 5 triplets.
The statistics of datasets after removal are described in Table
2. To avoid randomness of the results, we perform 10-fold
cross-validation for the three smaller datasets, Delicioussmall, Last.fm and Movielens. For larger dataset Deliciouslarge, we randomly sample about 40 thousand posts for test
and the remaining data is used for training.
To evaluate our method, we introduce two baseline functions NU T (u, t) and NIT (i, t). Given user u and tag t, we
define NU T (u, t) as
We treat personalized tag recommendation as a information
retrieval problem and employ Normalized Discounted Cumulative Gain (NDCG) as our metrics to evaluate the performance of the models. NDCG is widely used in information
retrieval and it makes the assumption that relevant tags are
more valuable if it appears earlier in the recommendation
list.
Discounted Cumulative Gain (DCG) evaluates the gain of
a tag based on its position in the ranked sequence returned
by the model. The DCG accumulated at a particular position
p is defined as
Xp
rel(k)
DCG@p = rel(1) +
,
(7)
k=1 log2 (k)
where rel(k) is 1 if the tag at position k is relevant or 0 otherwise. In order to normalize the metric, Ideal Discounted
Cumulative Gain (IDCG) is introduced, which is the DCG
of the ideal ranked sequence. Normalized Discounted Cumulative Gain (NDCG) is defined as
N DCG@p =
NU T (u, t) = |{i|(u, i, t) ∈ S}|.
DCG@p
.
IDCG@p
(8)
Performance on Smaller Datasets
1
Comparisons with Baselines Table 3 shows the performance of different methods on three smaller datasets
Delicious-small, Last.fm and Movielens. For NLTF and
PITF, the number of features R = 64, and for SVD, R =
http://www.delicious.com
http://www.lastfm.com
3
http://www.grouplens.org
4
http://www.zubiaga.org/resources/socialbm0311
2
443
-40
-30
-20
-10
Oscal and AFI
movie categories
0
movie content
10
adjec!ve
20
30
others
Figure 4: Tag distribution on dataset Movielens with the number of features R = 1
Table 1: Description of tags on dataset Movielens with the number of features R = 1
Oscar and AFI
afi 100
oscar (best directing)
oscar (best cinematography)
oscar (best supporting actor)
afi 100 (thrills)
oscar (best actor)
Movie Categories
scifi
fantasy
comedy
action
anime
classic
Movie Content
war
world war ii
surreal
serial killer
time travel
true story
256 because NLTF and PITF saturate at 64 features and SVD
saturates at 256 features. It can be seen from Table 3 that
NLTF outperforms other competitive methods on Delicioussmall and Last.fm, but PITF performs a little bit better than
our method on Movielens. Since the data in Delicious-small
and Last.fm is relatively sparser than the data in Movielens,
it should be the case that our method has some advantages
when the data is sparser. Moreover, our method beats TopUT+IT on all the datasets, which integrates tag popularity
and affinity between users and tags, but PITF works worse
than Top-UT-IT on Delicious-small.
Delicious-small
0.2518
0.2718
0.1089
0.2416
0.1110
0.2607
Last.fm
0.6715
0.6986
0.3633
0.3811
0.4683
0.4588
Others
erlends dvds
my movies
johnny depp
boring
based on a book
franchise
Table 4: NDCG@5 for different number of features on three
smaller datasets
feature
PITF
NLTF
PITF
NLTF
Movielens PITF
NLTF
Delicious
-small
Last.fm
4
0.068
0.075
0.369
0.445
0.321
0.479
8
0.100
0.126
0.502
0.562
0.468
0.564
16
0.157
0.188
0.601
0.652
0.576
0.606
32
0.231
0.247
0.654
0.692
0.629
0.624
64
0.252
0.272
0.672
0.699
0.647
0.631
serve whether the tag distribution is corresponding to the toy
example in Figure 3. In order to get better observation, the
unpopular tags are removed. The tag distribution on dataset
Movielens with the number of features R = 1 is depicted in
Figure 4 and the description of which is showed in Table 1.
As we can see, NLTF divides the tags into four groups automatically, including Oscar and AFI, movie categories, movie
content and adjective. Tags in the same group are close to
each other and some of their labels overlapped. Tags in others do not belong to any groups, thus they might do little
help to personalized tag recommendation.
Table 3: NDCG@5 of different methods on three smaller
datasets
PITF-64
NLTF-64
SVD-256
Top-UT
Top-IT
Top-UT+IT
Adjective
atmospheric
disturbing
deliberate
quirky
reflective
visceral
Movielens
0.6466
0.6309
0.1673
0.4082
0.2851
0.4909
Performance on Large-Scale Dataset
Impact of Number of Features The performances of the
NLTF and PITF depend on the number of features. Table 4
shows the impact of the number of features on the three
datasets Delicious-small, Last.fm and Movielens. As expected, the performances of NLTF and PLTF increase when
we increase the number of features. Their performances saturate at 64 features on all three datasets. Besides, the performance of NLTF is much better than PITF when the number
of feature is small, because models exploiting Gaussian like
NLTF can make better use of the latent features than models
exploiting dot product like PITF, just as we analyze in Section and the effect is more significant when only a smaller
number of features are used.
Besides, in order to verify the point that using Gaussian
can make better use of features than using dot product, we
carry out another experiment. We train our proposed model
NLTF by using only one feature on Movielens, so as to ob-
In addition, we compare our method NLTF with PITF on
a large-scale dataset Delicious-large, which contains more
than 0.6 billion triples and about 0.2 billion posts. There is
no previous work has conducted experiments for personalized tag recommendation on such a large-scale dataset before to our best knowledge. We conduct experiments on a
computer with a 6-core intel i7 CPU and the performance
after 30 iterations is showed in Table 5. From the figure, we
can see that our method greatly outperforms PITF even on
large scale dataset.
Table 5: Comparison of NLTF and PITF on larger dataset
NLTF
PITF
444
NDCG@1
0.3143
0.2845
NDCG@3
0.3053
0.2758
NDCG@5
0.3214
0.2907
NDCG@10
0.3579
0.3247
Conclusion and Future Work
Karatzoglou, A.; Amatriain, X.; Baltrunas, L.; and Oliver,
N. 2010. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In
Proceedings of the fourth ACM conference on Recommender
systems, 79–86. ACM.
Krestel, R.; Fankhauser, P.; and Nejdl, W. 2009. Latent
dirichlet allocation for tag recommendation. In Proceedings
of the third ACM conference on Recommender systems, 61–
68. ACM.
Lipczak, M.; Hu, Y.; Kollet, Y.; and Milios, E. 2009. Tag
sources for recommendation in collaborative tagging systems. ECML PKDD discovery challenge 157–172.
Liu, D.; Hua, X.-S.; Yang, L.; Wang, M.; and Zhang, H.-J.
2009. Tag ranking. In Proceedings of the 18th international
conference on World wide web, 351–360. ACM.
Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A threeway model for collective learning on multi-relational data.
In Proceedings of the 28th international conference on machine learning (ICML-11), 809–816.
Rendle, S., and Schmidt-Thieme, L. 2010. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the third ACM international conference on Web search and data mining, 81–90. ACM.
Rendle, S.; Balby Marinho, L.; Nanopoulos, A.; and
Schmidt-Thieme, L. 2009. Learning optimal ranking with
tensor factorization for tag recommendation. In Proceedings of the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining, 727–736. ACM.
Spiegel, S.; Clausen, J.; Albayrak, S.; and Kunegis, J. 2012.
Link prediction on evolving data using tensor factorization.
In New Frontiers in Applied Data Mining. Springer. 100–
110.
Sun, J.-T.; Zeng, H.-J.; Liu, H.; Lu, Y.; and Chen, Z. 2005.
Cubesvd: a novel approach to personalized web search. In
Proceedings of the 14th international conference on World
Wide Web, 382–390. ACM.
Symeonidis, P.; Nanopoulos, A.; and Manolopoulos, Y.
2008. Tag recommendations based on tensor dimensionality
reduction. In Proceedings of the 2008 ACM conference on
Recommender systems, 43–50. ACM.
Symeonidis, P.; Nanopoulos, A.; and Manolopoulos, Y.
2010. A unified framework for providing recommendations
in social tagging systems based on ternary semantic analysis. Knowledge and Data Engineering, IEEE Transactions
on 22(2):179–192.
Tucker, L. R. 1966. Some mathematical notes on threemode factor analysis. Psychometrika 31:279–311.
Wetzker, R.; Zimmermann, C.; and Bauckhage, C. 2008.
Analyzing social bookmarking systems: A del.icio.us cookbook. In Mining Social Data (MSoDa) Workshop Proceedings, 26–30.
We introduce a nonlinear tensor factorization method based
on Canonical Decomposition in this paper. Our proposed
method exploits Gaussian radial basis function and can make
better use of features than dot product. The models exploiting Gaussian can be considered as a multi-class classifiers,
while the models exploiting dot product can be considered as
a two-class classifiers. Additionally, our experimental results
show that our method outperforms the baseline methods and
can achieve very good performance with only a small number of features.
For future work, we plan to leverage the profile of users,
the content of the items and tags to make better predictions
for tag recommendation.
Acknowledgements
We would like to thank the many referees of the previous
version of this paper for their extremely useful suggestions
and comments. This work was supported by Huawei Innovation Research Program (HIRP) and National Science Foundation of China (61033010).
References
Carroll, J., and Chang, J. 1970. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika
35(3):283–319.
Chi, Y., and Zhu, S. 2010. Facetcube: a framework of incorporating prior knowledge into non-negative tensor factorization. In Proceedings of the 19th ACM international conference on Information and knowledge management, 569–578.
ACM.
Dunlavy, D. M.; Kolda, T. G.; and Acar, E. 2011. Temporal
link prediction using matrix and tensor factorizations. ACM
Transactions on Knowledge Discovery from Data (TKDD)
5(2):10.
Durao, F., and Dolog, P. 2010. Extending a hybrid tag-based
recommender system with personalization. In Proceedings
of the 2010 ACM Symposium on Applied Computing, 1723–
1727. ACM.
ErmisĖ§, B.; Acar, E.; and Cemgil, A. T. 2012. Link prediction
via generalized coupled tensor factorisation. arXiv preprint
arXiv:1208.6231.
Gao, D.; Zhang, R.; Li, W.; and Hou, Y. 2012. Twitter
hyperlink recommendation with user-tweet-hyperlink threeway clustering. In Proceedings of the 21st ACM international conference on Information and knowledge management, 2535–2538. ACM.
Guan, Z.; Bu, J.; Mei, Q.; Chen, C.; and Wang, C. 2009.
Personalized tag recommendation using graph-based ranking on multi-type interrelated objects. In Proceedings of the
32nd international ACM SIGIR conference on Research and
development in information retrieval, 540–547. ACM.
Harshman, R. A. 1970. Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA working papers in phonetics
16:1.
445