LRBM: A Restricted Boltzmann Machine based Approach for Representation Learning on Linked Data Kang Li∗ , Jing Gao† , Suxin Guo‡ , Nan Du§ , Xiaoyi Li¶ and Aidong Zhangk Department of Computer Science and Engineering The State University of New York at Buffalo Emails: {kli22∗ , jing† , suxinguo‡ , nandu§ , xiaoyili¶ and azhangk } Abstract—Linked data consist of both node attributes, e.g., preferences, posts and degrees, and links which describe the connections between nodes. They have been widely used to represent various network systems, such as social networks, biological networks and etc. Knowledge discovery on linked data is of great importance to many real applications. One of the major challenges of learning linked data is how to effectively and efficiently extract useful information from both node attributes and links in linked data. Current studies on this topic either use selected topological statistics to represent network structures, or linearly map node attributes and network structures to a shared latent feature space. However, while approaches based on statistics may miss critical patterns in network structure, approaches based on linear mappings may not be sufficient to capture the non-linear characteristics of nodes and links. To handle the challenge, we propose, to our knowledge, the first deep learning method to learn from linked data. A restricted Boltzmann machine model named LRBM is developed for representation learning on linked data. In LRBM, we aim to extract the latent feature representation of each node from both node attributes and network structures, non-linearly map each pair of nodes to the links, and use hidden units to control the mapping. The details of how to adapt LRBM for link prediction and node classification on linked data have also been presented. In the experiments, we test the performance of LRBM as well as other baselines on link prediction and node classification. Overall, the extensive experimental evaluations confirm the effectiveness of the proposed LRBM model in mining linked data. I. I NTRODUCTION In many current studies and applications, linked data are used to describe systems consisting of interacted objects. Given that each node represents an object, in linked data, node attributes contain features, preferences, and actions, and links describe interactions between nodes. For instance, in a social network like Facebook, each user is viewed as a node. Node attributes may include gender, habits and number of friends (i.e. degree), and links can represent if two users are friends, if one comments on another’s posts and other types of interactions. The great expressive power of linked data enables this data format to capture both characteristics and interactions of objects in various systems, such as linked gene mutation databases and ecommerce websites. Knowledge discovery on linked data is the process of leveraging both node attributes and link structures for the learning on the corresponding systems, and it is of great importance to understanding the characteristics and interaction patterns in these systems. For instance, [1] discusses how to predict friendships on social networks. [2] learns from linked drug networks to predict potentially useful drugs for diseases. [3] utilizes link structure in linked data to recommend social groups on social media websites. One of the major challenges in mining linked data is how to effectively and efficiently utilize information from both node attributes and link structures. For this challenge, many existing models seek to represent link structures using selected statistics on networks, then combine selected statistics with node attributes. For example, in [4], [5], the authors characterized link structure using recursive egonet-based statistics, and further used the statistics to learn object roles. In [6], researchers detected spammers using degrees of friends. The primary limitation of these approaches is that the topological statistics in each task is usually subjectively selected. Therefore, such methods may miss critical patterns in linked data. Moreover, when the aimed tasks are complex, it could be very difficult to select or create relevant topological statistics. To avoid the limitation of the above methods, other approaches aim to extract a shared representation for both node attributes and link structures. For example, in [7], links are viewed as interactions between latent features of connected nodes. In [8], latent features are extracted w.r.t. the criterion that objects from different clusters are dissimilar while objects in the same clusters are similar. These methods usually rely on linear mappings to capture the relations among node attributes, network structures and the aimed latent feature representations of nodes. As a result, such approaches suffer from the simplicity of linear mappings and fail to capture non-linear characteristics of nodes and links. To address the above issues, in this paper, we propose a novel model named LRBM, which stands for Restricted Boltzmann Machines for Latent Feature Learning in Linked Data. Different from the aforementioned methods using graph statistics, the proposed model does not rely on any subjectively selected topological statistics, and is capable of characterizing both node attributes and link structures in a unified framework. At the heart of this model is a shared latent feature representation of each node, which is used to formulate nonlinear relations among nodes, links and hidden units. To avoid large amount of sampling, Contrastive Divergence (CD) [9], [10] is applied to train the model and other techniques such as fine-tune and parameter sharing are also used to simplify the h calculation. Moreover, the proposed LRBM model can also be ”stacked” as traditional RBM to explore ”deep” characteristics of nodes and high-order interaction patterns between nodes. Furthermore, we present the details of how to apply LRBM to two popular topics on linked data, which are link prediction and node classification. These two topics are closely related to tasks such as friend recommendations in social networks and informative gene selection in genetic networks. • We propose a novel binary and conditional RBM model to effectively extract latent features of objects in linked data with unweighted interactions. • We extend the binary model to general cases to extract representations of linked data with weighted interactions. • We present the details of how to apply tensor factorization, parameter sharing and fine-tune to reduce the number of parameters of the proposed model, and develop the solution to the proposed LRBM model. • We also provide the details of implementing the proposed model for link prediction and node classification. • We experimentally evaluate the proposed method on link prediction and node classification. The overall performance well confirms the effectiveness of LRBM. BACKGROUND Before diving into the details of LRBM, we first introduce two relevant methods which are graph factorization models and conditional Restricted Boltzmann Machines (cRBM). A. Graph Factorization Models Given a linked data set, suppose the link structure can be represented by an n × n size relational link matrix L, where n is the number of the objects and Lij stands for the connection between node i and node j. In binary graph factorization models which focus on modeling existence of links between two arbitrary nodes, entry Lij = 1 indicates the presence of a link and Lij = 0 indicates the absence of a link between node i and node j. Such graph factorization models usually favor predicting the existence of unobserved links (entries of Y ) using observed links as: p(Lij = 1|µi , µj ) = F(ς + Ψ(µi , µj )). W W t-1 v (A) Overall, the contributions of this work include: II. h (1) In Eq.1, µi and µj are the latent feature representations of node i and node j, respectively. ς is a bias. p(Lij = 1|µi , µj ) stands for the probability that there is a link between node i and node j given their latent feature representations. t v (B) Fig. 1: Illustrations of an RBM in (A) and an cRBM in (B). In (B), hidden layer h relates to both current visible layer vt and previous visible layer vt−1 , and h, vt and vt−1 are connected with a three-way tensor W . In Eq.1, Ψ(µi , µj ) is a distance function enforcing that two nodes are likely to interact with each other if their latent feature representations are similar, and vice versa. In the latent distance model [11], Ψ(µi , µj ) is defined as a negative distance as Ψ(µi , µj ) = −d(µi , µj ) where d is a distance function. In other models such as the latent eigen model [12] and the latent feature relational model [13], [14], Ψ(µi , µj ) is defined as Ψ(µi , µj ) = µ> i W µj , where W is a parameter matrix subsuming the interacting patterns between node i and node j. W is usually constrained to be symmetric for modeling symmetric (i.e. undirected) linked data and be asymmetric for modeling directed linked data. B. Conditional Restricted Boltzmann Machines As demonstrated in Fig.1.(A), Restricted Boltzmann Machines (RBM) are undirected graphical models that define a probability distribution over a set of observed or visible variables and a set of unobserved or hidden variables. In the model, the term Restricted indicates that there is no visible to visible interaction or hidden to hidden interaction. In RBM, the joint distribution between visible variables v and hidden variables h is defined as p(v, h) = 1 Z exp(−E(v, h)), where Z is a constant that normalizes the joint probability distribution p(v, h). E(v, h) is the energy function and it is usually defined as: X X X E(v, h) = − Wij vi hj − ai vi − bj hj . (2) ij i j In Eq.2, Wij is the mapping function between the visible unit vi and the hidden unit hj . ai and bj are biases for vi and hj , respectively. The free energy of the model is: X F (v) = − log exp(−E(v, h)) h F is the mapping function that maps the latent factors µi and µj to the conditional probability p(Lij = 1|µi , µj ). In related matrix factorization models, F is usually defined as F(x) = x for linear mappings. In other related models, the sigmoid function F(x) = 1+e1−x and exponential family distributions F(x|θ) = Φ(x) exp(φ(θ) · T (x) − Ξ(θ)) are generally used, where Φ(x), φ(θ), T (x) and Ξ(θ) are known functions and θ is the hyper-parameter in F(x|θ). =− X i vi> ai − X log(1 + exp(bj + v > W:,j )). (3) j To learn a parameter θ of the model, by using gradient descent of the log-likelihood, we have: − ∂F (v 0 ) ∂ log p(v) ∂F (v) X = − p(v 0 ) . ∂θ ∂θ ∂θ 0 v (4) In Eq.4, the first term after the symbol ”=” is usually referred as the positive phase and can be computed exactly. The last term in Eq.4 is referred as the negative phase. This term stands for an expectation over the model distributions for all possible configurations of input visible variables v 0 , and it can hardly be directly computed. To solve the problem, Hinton [15] showed that the negative phase can be approximated using samples obtained by starting a Gibbs chain at a training vector, and named this technique Contrastive Divergence (CD) [9]. The CD process which terminates after K-steps of Gibbs sampling is called CD-K. The technique CD-K enables RBM models to be trained efficiently and be applied to handle large data sets. Current studies show that RBM and the corresponding deep belief networks can achieve good performance when the data size is large. Such models are generally used in areas like computer vision and signal processing. One of the advances of RBM in the past few years is called conditional RBM (cRBM) [16], [17]. In such models, the dependence between unknown parameters and visible inputs is modeled with all-way correlations. Current studies apply cRBM on mining time-series data [18], [19]. As demonstrated in Fig.1, at time t, suppose the current visible input is vt and historical data are v<t (E.g., in Fig.1.(B), v<t is vt−1 ). cRBM defines a joint probability distribution over vt and current hidden variable ht , which is conditional on v<t and the model parameter θ: 1 p(vt , hv |v<t , θ) = exp(−E(vt , hv |v<t , θ)), Z X X X (5) E=− Wijk vi,t vj,<t hk,t − ai vi,t − bj hj,t . ijk i j In Eq.5, Wijk is a three-way tensor that connects historical visible variables v<t , current visible variables vt and current hidden variables ht . a and b are biases on vt and ht . Such a cRBM model describes the pattern how historical data v<t are related to current data vt , and can be trained efficiently using CD-K technique. III. R ESTRICTED B OLTZMANN M ACHINES FOR L INKED DATA In this section, we first show a latent factor model as the starting point, then extend it to the LRBM model. For simplicity, we develop LRBM from unweighted linked data and then expand the model to more general cases. A. A Latent Factor Model Suppose in a linked data set, Ai is the node attributes of the i-th node and Ll is the weight of the l-th link. The index l of a link is one-to-one mapped to a pair of node indices il and jl , and the weight Ll may contain multiple features, and each feature stands for the weight of a type of interaction. Furthermore, suppose a link points from node i to node j, we denote node i as the sender and node j as the receiver of this link. In undirected/bidirectional networks where links are undirected, each node of a link is both the sender and the receiver. For simplicity, we first consider the case when weights of links are either 1 or 0, where 1 represents existence of interactions and 0 represents no interaction. Inspired by the existing models of graph factorizations, we seek to use latent feature representations to encode the characteristic of each node, and to learn how these latent representations relate to their attributes A and links L. Specifically, we assume that the latent representation of each node contains two parts, which are sender ”behaviors” and receiver ”behaviors”. By this assumption, each node contributes to its outgoing links according to its sender ”behavior”, and contributes to its incoming links according to its receiver ”behavior”. We explicitly denote the two types of latent representation as Ri and Si for receiver and sender ”behaviors” of node i, respectively. As demonstrated in Fig.2.(A), node attributes Ai of node i is related to its latent receiver behavior Ri and sender behavior Si . Besides, Si and Rj decide the interaction from node i to node j. To model the interaction of node attributes A, links L, and the latent representations S and R, we propose a latent factor model with the energy function: X X E(S, R; A, L) = − Wis Si Ai − Wir Ri Ai − i X XX i j i Wijl Ll Si Rj . (6) l In Eq.6, the first two terms define how the characteristics S and R of nodes impact the node attributes A. In these terms, W s and W r are tensors that connect A to S and R, respectively. The third term of Eq.6 defines how sender behavior Si of node i and receiver behavior Rj of node j relate to the link Ll . W is a three-way tensor that connects the weight Ll and the latent behavior Si and Rj . Notice that, in this model, Ll may not necessarily connect node i and node j. In most recent social network research, it is observed that the connectivity of two nodes is also affected by the nodes close to these two nodes(i.e. neighbor nodes). Therefore, in this general energy function, we consider the relations between each link and every directed pair of nodes. Eq.6 defines a mapping relation among A, R, S and L, and the negative energy −E(S, R; A, L) measures the compatibility of this mapping. In details, the model contains two parts, which are among A, R and S, and among R, S and L. The first one covers two compressions and can be optimized efficiently using the off-the-shelf methods. The second one is a linear mapping between the output line L and the two trained latent matrices R and S. In many cases, such simple linear mapping can not effectively capture relations among L, R and S. When number of known links is small, it also tends to overfit the linked data. To address the above issue, we keep the first part the same, and replace the second part (the mapping from R and S to L) with a more effective model based on conditional restricted Boltzmann machines. B. The Binary Model Given the latent receiver and sender behaviors R and S, to model the probability of a link L which is related to S L L h biases and gated biases as: X X E(L, h; S, R) = − Wijkl Si Rj hk Ll − ckl hk Ll ijkl W R − W S R X kl ak hk − k S X (8) bl Ll . l In Eq.8, ckl is the gated bias that shifts the activity level of hk and Ll conditionally. ak and bl are standard additive biases, and they shift the activity levels of hk and Ll , respectively. Using the energy function in Eq.7, we define the joint probability distribution p(L, h; S, R) over links and hidden variables as: A A (A) (B) Fig. 2: Illustrations of a latent factor model in (A) and the binary LRBM in (B). In (A), node attributes A is independently mapped to latent sender behavior S and latent receiver behavior R, and S and R are further mapped to corresponding links L with a three-way tensor W . In (B), a hidden variable h is added to control the mapping. and R, we consider the energy function that includes all the components S, R and L. To explicitly characterize the possible ways in which these components are related, we add a hidden variable h. The energy function that captures the interactions of the components S, R, L and h is: X E(L, h; S, R) = − Wijkl Si Rj hk Ll . (7) ijkl As demonstrated in Fig.2.(B), in Eq.7, Wijkl is an element of a four-way tensor W , and it connects the i-th sender Si , the j-th receiver Rj , the k-th hidden variable hk and the l-th link Ll . Specifically, W learns from the training triplets of S, R and L, and it measures how a directed pair of nodes i and j impact a link l. The negative energy function −E(L, h; S, R) measures the compatibility of the components S, R, L and h. In this function, the hidden variable matrix h controls the way in which W connects the other components. Therefore, if h is fixed, the energy function in Eq.7 is simply a mapping among the components. Therefore if we can perfectly fix the hidden variable matrix h, we then obtain an ideal mapping among S, R and L. To achieve this goal, in practice, we utilize the energy function in Eq.7 to estimate the probability distribution of the hidden variable h and the link matrix L, and then marginalize over all the possible mappings. In order to model affine of h and L, we further enhance the model with biases. While the most classic restricted Boltmann machines only include additive biases which ”shift” the activity level of each unit, the recent conditional models also include gated biases which ”shift” the activity level conditionally. In this paper, the enhanced energy function includes both standard p(L, h; S, R) = 1 exp(−E(L, h; S, R)), Z(S, R) (9) where Z(S, R) = X exp(−E(L, h; S, R)). L,h In Eq.9, Z(S, R) is a normalizing term that depends on the node attributes S and R. By the joint probability distribution in Eq.9, given the node attributes S and R, the probability of each link can be estimated as: X p(L|S, R) = p(L, h|S, R). (10) h In the above model, Z(S, R) in Eq.9 and p(L|S, R) in Eq.10 can not be directly calculated in most of the cases, because both of them need to sum over exponentially large size of all possible h, and Z(S, R) need to further sum over all possible links L. In practice, however, we do not have to calculate the exact values of Z(S, R) and p(L|S, R) in either training or testing process. Specifically, in the training process, given node attributes S and R and the set of existing links L, we can infer the probability of activating each unit of h as: p(hk = 1|S, R, L) = 1 + exp(− 1 P . W S i Rj Ll − ijkl ijl l ckl Ll − ak ) (11) P Notice that in Eq.11, each unit of h is inferred independently based on node attributes and known links. Therefore, there is no hidden-hidden connections in the model. Similarly, in testing, given node attributes S and R, and the approximated hidden variable matrix h, we can infer the probability of activating a link as: p(Ll = 1|S, R, h) = 1 + exp(− 1 P . W S ijkl i Rj hk − ijk k ckl hk − bl ) (12) P By Eq.11 and Eq.12, the model estimates the linear dependencies of the components S, R, L and h, and utilizes the linear dependencies in the learning of unknown links. L h L h may not be sufficient to capture the complex characteristics of nodes and links. Therefore, it is also highly desirable to enable the model to cover more general distributions. b To enable modeling continuous weights on links, we keep the hidden variable binary and re-define the energy function with additive Gaussian noise as: L h W W Factors R S W W R S R (A) S (B) Fig. 3: Illustrations of the factored LRBM in (A) and the nonbinary LRBM in (B). In (A), the four-way tensor W is factored into W S , W R , W h and W L which connect S, R, h and L through factors. In (B), a Gaussian bias is added to links L, and it enables model weights on links. C. The Factored Model In general, the model in Eq.8 can be viewed as a regressive model in which a transformation is built between node attributes S and R and links L. In the transformation, the hidden variables h is exponentially large and it encodes exponentially many ways of mappings. Fortunately, we don’t have to calculate the exponentially many settings of h, as explained in Eq.11 and Eq.12. However, we have to obtain the estimation of Wijkl , whose number of parameters is quartic when the numbers of node attributes, links and hidden variables are comparable. Recent studies on tensors and factored conditional restricted Boltzmann machines [17], [19] suggest that multi-way, multiplicative interactions can be captured by using much less parameters by factorizing the weight tensors. Inspired by these studies, we factorize the quartic tensor W into a product of pairwise interactions as shown in Fig.3. When we apply the factorization to Eq.8, the first term is then factorized as: X XX S R h L Wijkl Si Rj hk Ll → Wif Wjf Wkf Wlf Si Rj hk Ll . ijkl 1 XX S R h L Wif Wjf Wkf Wlf Si Rj hk Ll γ f ijkl X 1X 1 X (Ll − bl )2 − ckl hk Ll − ak hk . + 2 2γ γ l kl k (14) E(L, h; S, R) = − f ijkl In Eq.14, γ is the variance. The difference between the model in Eq.14 and the model in Eq.8 is that in Eq.14, we change the standard and additive biases to additive Gaussian noise. Therefore, in Eq.14, instead of shifting the activity level of each unit of the link matrix, the model centers the activity levels with means and variances and model the continuous characteristics of weights in weighted linked data. By the model, in the training process, the probability distribution p(h|S, R, L) keeps the same because the additive gaussian noise cancels out in exponentiating and conditioning [16]. In the testing process, the probability distribution p(L|S, R, h) changes into a Gaussian distribution as: X X L S R h p(Ll = 1|S, R, h) = N (γ Wlf Wif Wjf Wkf Si Rj hk f +γ ijk X k ckl hk + bl , γ 2 ). (15) Notice that in Eq.15, the conditional probability of each Ll is estimated independently across the others. This fact is consistent to the characteristic of restrict Boltzmann machines. The Gaussian distribution in Eq.15 enables regression of the weights Ll . In cases when the weights are not Gaussian distributed, this model can also be simply adapted to other exponential distributions with modification on the biases. (13) In Eq.13, f is the index of a set of pairwise weight matrices W , W R , W h and W L . These four pairwise matrices connect the factors to senders, receivers, hidden variables and links, respectively. Therefore, the factors in the factorization act as agents of the four components S, R, L and h and handle their interactions with the others. By this factorization, the size of parameters is reduced from O(n4 ) to O(n2 ) when the sizes of the four components are comparable. S D. The Non-Binary Model By Eq.8 and Eq.13, we define a gated, factored and conditional restricted Boltzmann machines on unweighted linked data. However, links between objects are not always unweighted. Learning and predicting link weights are extremely useful in many link related tasks, such as predicting user-user interaction types in social networks and detecting influence strengths of users in information diffusion networks. Furthermore, due to the diversity of linked data, binary distribution E. The Solution In Eq.15, the marginal distribution is a mixture of exponentially many Gaussian distributions and it is intractable to be evaluated. Fortunately, it does not need to be explicitly estimated in either training or testing process. Following the contrastive divergence as used in [18], [19], we can obtain a good approximation to the gradient of unknown parameters. The learning rules for W S , W R , W h and W L takes the form: ∆Wuf ∝ hαu βv1 f ζv2 f θv3 f idata − hαu βv1 f ζv2 f θv3 f irecon (16) In Eq.16, u, v 1 , v 2 , v 3 ∈ {i, j, k, l} is the index of units. αu stands for the units connected to Wuf . The other terms βv1 f , ζv2 f and θv3 f correspond to the other three units involved in the four-way tensors. hidata is the expectation w.r.t. the data distribution and hirecon is the expectation w.r.t. the distribution of the ”reconstructed” data. Specifically, to estimate hirecon , we start with a Markov chain at the data distribution, and perform K-step alternating Gibbs sampling (i.e. CD-K) in which we iteratively update the hidden variables h and the link weights L according to Eq.11 and Eq.15, respectively. S’, R’ W’, h’, L S, R As an example, the updating rule for Wif is: X X X S R h L ∆Wif ∝hSi Wjf Rj Wkf hk Wlf Ll idata j −hSi X k R Wjf Rj j X l h Wkf hk k X L Wlf Ll irecon . S, R W, h, L A l Similar to [20], the biases ak , bl and ckl can be updated using simplified versions of Eq.16. For instance, the updating rule for ak is: ∆ak ∝ hhk idata − hhk irecon . W, h, L (17) (18) By Eq.17 and Eq.18, given R, S, L and the updated h, we can optimize the parameters involved in the model. As discussed in Eq.6, R and S are independently mapped from the node attributes A as Si = Ai Wis and Ri = Ai Wir . To update the parameters W r and W s in the mapping, we simply backpropagate the gradients obtained from the CD-K. According to the chain rule, the updating function for W s is: A (A) (B) Fig. 4: A unified view of LRBM in (A) and An illustration of deep LRBM in (B). In (A), LRBM can be viewed as a mapping from visible node attribute Ai to hidden node features Si ∪ Ri for node i. W , h and L are the detailed mapping relations. In (B), {Si ∪ Ri }N i=1 is viewed as visible variables to the upper LRBM enabling deep learning. observations, we diffuse Si and Ri to the directly connected nodes of node i in each updating iteration as: η X Si → (1 − η)Si + Sj . (21) Ni j∈Ni F. Parameter Sharing In Eq.21, Ni is the set of neighbors of node i and Ni is the size of the set. η ∈ [0, 1] is a parameter controlling to what degree each node relies on the neighbors. The similar diffusion is also performed on R. In each alternating update, the diffusion starts from the first node in the shuffled node lists and ends when all the nodes have been diffused. This diffusion smoothes latent node representations Si and Ri of node i with its neighbors, enhances the impact of local neighbors, and fits the observations that the connectivity of two nodes is affected by neighbor nodes. In Section III-C, we factorize the four-way tensors to reduce the size of parameters from O(n4 ) to O(n2 ) when the number of units in each way is comparable to the others. To further improve the efficiency of the factored model and take full advantage of the linked structure, in this part, we discuss the effect of tying some sets of parameters together and diffusing the learned parameters across the linked data. The four-way tensor Wijkl connects each pair of nodes i and j to each observed and possible link Ll . Similar to the diffusion of S and R, we can use fine-tune to constraint that each link Ll is only impacted by the directly connected nodes and the nearby nodes instead of all the nodes. In details, we set Wijkl = 0 when the shortest path from node i and node j to link l is larger than a threshold. In Eq.18, Wis and Wir work as mappers that map the node attributes Ai to the latent sender behavior Si and receiver behavior Ri , respectively. To enforce that the mappings are consistent across all the nodes, we can apply the tying functions Wis = Wjs and Wir = Wjr for two arbitrary nodes i and j. By this tying operation, node attributes are mapped into S and R following the same way across arbitrary nodes in a linked data set. The updating rules for W s and W r are then modified accordingly. For instance, on W s , we can apply: 1 X > hA> (20) ∆W s ∝ i Ci idata − hAi Ci irecon . N i Such a constraint will give rise to the significance of local impacts. Since in real networks, the degree (in-degree and outdegree) of each node is usually unrelated to the number of nodes in the network and can be viewed as a constant, this finetuning operation can reduce the number of the combinations of Si , Rj and Ll from O(n4 ) to O(n2 ). ∆Wis ∝ hA> Ci idata − hA> i Ci irecon , Xi X X X S R h L Ci = Wif Wjf Rj Wkf hk Wlf Ll . f j k (19) l For W r , the updating rule is very similar to Eq.19, thus we omit the details here. In Eq.20, N is the total number of nodes in the linked data set and Ci is the same as the term in Eq.19. The meanings of S and R are latent characteristics of nodes. In current studies, we usually observe that the characteristic of each node is significantly affected by its neighboring nodes in the linked data. To adapt our model towards such G. Time Complexity and Extensibility Suppose the number of hidden units in h, the number of features in node attributes A, the number of factors in tensor factorization and etc. are constants, and the number of link is O(n). In CD-K, to sample the hidden variables h from the others according to Eq.11, the complexity is O(n2 ). To sample the links E from the others according to Eq.15, the complexity is O(n2 ), given that sizes of hidden variables h and factors are tuneable and can be viewed as constants. Therefore, the CDK takes O(n2 ) time. After CD-K, the updates of W S , W R , W h and W L take at most O(n2 ). Besides, the complexity of updating W S and W R is O(n2 ). Overall, the time complexity is O(n2 ) when the number of links is linearly proportional to the number of nodes. When the connects between objects are much denser than the above case, the number of links is usually assumed to be 3 smaller than O(n 2 ), which means each two nodes are either directly connected or have one intermediate node on average. In this case, the time complexity is O(n3 ). In some applications, we also need to consider link prediction problem when insufficient knowledge of node attributes or links is provided, such as friend recommendation for new users of a social network. In such cases, knowledge diffusion can be applied to the ”unclear” nodes according to the similar rule used in Eq.21. In details, the attributes of the unclear nodes will be enhanced using diffused attributes from neighbors as: η X Ai → (1 − η)Ai + Aj . (22) Ni j∈Ni Similar to other RBM related models, since each unit in LRBM is sampled and updated independent of the other units, LRBM can be developed into parallel in the same way as [21]. Due to the space constraint of the paper, we skip the details of parallel implementations of LRBM. The enhanced node attributes Ai is then fed into the trained LRBM model for link prediction. A significantly strong point of RBM is that it can be ”stacked” to support deep learning on visible inputs. We argue that the proposed LRBM can also be extended to support deep learning on linked data as illustrated in Fig.4. In the framework, we can view LRBM as a traditional RBM that maps visible node attributes A to hidden node representations {Si ∪Ri }ni=1 . The other involved parameters W and h, and the input link structure L, are treated conceptually as the mapping functions that maps Ai and Si ∪ Ri . Therefore, the extracted latent node representation Si ∪ Ri can be used as visible units to train a LRBM that stacks on top, and higher LRBMs can be generated in the similar way. Different from the link prediction problem which focuses on interactions between nodes, the Node Classification Problem focuses on how each node relates to a set of pre-defined classes. For instance, [23] discusses how to classify users in a social network w.r.t. whether a user is going to buy a product, and [24] explores how to classify users in social networks w.r.t. their social media engagement levels and politic interest degrees. In general, given a set of node-class pairs, models on node classification problem aim to divide nodes into predefined classes. Moreover, the proposed LRBM model can also be applied to pure networks which have no node attributes. In details, pseudo node attributes can be generated by many off-theshelf graphical statistics. Such pseudo node attributes work as the starting point in the training, and the trained model can be viewed as a trade-off between the subjectively selected graphical statistics and the objective link structure. IV. T WO A PPLICATIONS OF LRBM In this section, we discuss how to apply the proposed LRBM model to discover knowledge on linked data. Specifically, we present the details of two applications, which are link prediction and node classification. B. Node Classification In the proposed LRBM model, given the node attributes A and the links L, we can extract the sender behavior S and receiver behavior R of each node. R and S, as the latent node representations, encode information from both node attributes A and link structure L, and characterize how each node interacts with the others. To apply the LRBM model, for node i, we intend to learn a mapping from the trained Ri and Si to the class label Yi . Similar to Eq.8, we add a hidden variable hY and biases to control the patterns of mapping. The energy function for the node classification problem is: X X Y E(Y, hY ; S, R) = − Wijk Si Ri hYj Yk − aj hYj j ijk − X k bk Yk − X cjk hYj (23) Yk . jk A. Link Prediction Given a snapshot of a linked data set, which contains attributes of nodes and links of nodes, can we infer noisy links at the current state and new interactions in the near future? This problem is defined in [22] and generally recognized as the Link Prediction Problem. Models on link in many areas. For recommendation on filter noisy links on prediction problem are very meaningful instance, they can be applied for friend social networks, and can also be used to protein-protein interaction networks. In general, the proposed LRBM model focuses on characterizing the interactions among node attributes A, latent factors h, S and R, and links L. Therefore, to predict the link between two nodes i and j, if we have known links and attributes of the two nodes, we can use the LRBM model to learn how these two nodes interact with the other nodes, and to learn the probability distributions of their link according to Eq.15. In Eq.23, the first term define how node class Y relates to the latent node representations S and R, and to the hidden variables h. The other terms are biases and gated-biases of Y and h. This node classification model can be trained in the similar factoring style as in Eq.13, and it can also be adapted to continues class labels in a way similar to Eq.14. Besides, when classifying new nodes, the same technique as in Eq.22 can be applied. V. E XPERIMENTS In this part, we experimentally evaluate the proposed LRBM model on link prediction and node classification. The MATLAB code of our program is publicly available1 . Datasets 1 TABLE I: Performance on Link Prediction with RMSE (×102 ) and AUC. Exposure 14.53 (0.966) 14.74 (0.994) 27.11 (0.715) 15.69 (0.940) 12.20 (0.973) 1 0.9 0.9 0.8 0.8 0.7 LRBM CN AA DLP Fact 0.6 0.5 0.4 We use the following three datasets in the experiments. The first linked data set SocialE [25] is a social network of residents in an undergraduate dormitory. This data set was collected from October 2008 to May 2009. It covers information of residents, such as diet, exercise, obesity, eating habits, political opinions and etc., and it also contains interactions between residents, which include phone calls, music sharing, friendships and etc. The data set is available upon request2 . The second data set used in the experiments is the Robot, which is a social network of users of the website Robot.Net3 . This data set was crawled daily from the website since 2008. In the data set, each user is labeled by the others as Observer, Apprentice, Journeyer or Master w.r.t. the performance of the user in their communities. The data set is publicly available4 . The third data set investigated in the experiments is the Exposure, which is a gene-gene interaction network. This data set is obtained from a study on cigarette smoke inhalation [26]. In the study, rats were exposed to cigarette smoke during eight months. Gene expression data were collected in the process, and molecular changes were found to identify the biological and pathological consequences of cigarette smoke inhalation. In our experiments, we use the genes whose p-values (via ttest) are smaller than 0.05. A. Link Prediction Metrics and Baselines In the task of link prediction, we evaluate the performance of each method on two aspects, which are link existence prediction and link weight prediction. The former focuses on predicting if a link exists or not, and the latter concerns predicting the weight of each link. In the evaluation, we use Area Under the Curve (AUC) scores on link existence prediction, and use Root Mean Square Error (RMSE) on link weight prediction. For AUC scores, the range is [0, 1], and the larger the AUC score is, the better the performance is. For RMSE, the smaller the RMSE is, the more accurate a prediction is. Moreover, to evaluate the stability of each model, we run each experiment multiple times and calculate standard error (SE) on the multiple results of the same experiment. The smaller SE is, the more stable the model is. In the experiments, we compare the proposed LRBM model with four baselines, which are: Common Neighbors (CN) [27], Adamic-Adar (AA) [28], Deep Learning for Link Prediction in Social Service (DLP) [29], and Link Prediction via Factorization (Fact) [7]. Performance Study 2 3 4 net/ Scores over Increasing Number of Hidden Units Scores Robot 4.85 (0.795) 5.04 (0.744) 7.31 (0.895) 4.89 (0.788) 3.57 (0.998) AUC Score CN AA DLP F act LRBM SocialE 3.35 (0.530) 5.71 (0.718) 5.55 (0.803) 3.17 (0.515) 0.94 (0.876) AUC Scores over Increasing Flip Rate 1 0 5 10 0.7 0.6 AUC RMSE⋅10 0.5 15 20 25 30 Flip Rate 35 40 45 50 0.4 10 20 40 80 160 320 640 Number of Units in the Hidden Layer (a) Experiments on Increasing Flip (b) Experiments on Increasing NumRates ber of Hidden Units Fig. 5: Link Prediction on Robot To evaluate the model, we perform link prediction on ’noisy’ linked data and evaluate how the predictions fit the ’clean’ linked data. In details, we ’flip’ a set of links in network structures and train on the distorted linked data. In the flipping process, a selected link is removed if it exists in the original network structure, or added with weight 1 if it does not exist. The performance is then evaluated w.r.t. how well a predicted link structure matches the original (unflipped) network. On each data set, the number of flipped links is the same with the number of existing links. We repeat the process 20 times and show the average performance in Table I. In the table, AUC scores are presented in parenthesises and RMSE scores are presented outside of parenthesises. On link existence prediction, compared to the four baselines, LRBM performs the best on SocialE and Robot, and performs slightly worse than AA on Exposure. Among the three linked data sets, SocialE and Robot are very sparse, with average degrees 8.4 and 3.95, respectively. Exposure is much denser than the other two sets with average degree 103.5. The baselines CN , AA and F act all tend to perform better on denser linked data and to perform significantly worse on the two sparse sets. Different from these baselines, LRBM can perform comparably well on Robot and Exposure, regardless of whether the linked data is sparse or not. Overall, on average, LRBM outperforms the four baselines by 15.9% to 26.9%. On link weight prediction, among all the five investigated approaches, LRBM performs significantly better than the rest on all the data sets. On average, LRBM outperforms the four baselines in the task of link weight prediction by 26.5% to 58.2%. Flip Rate Test Besides the above performance study, we also do experiments on each data set using varying number of flipped links. We denote the ratio of number of flipped links to the number of existing links in a data set as Flip Rate. In this section, we vary the flip rate from 0.1 to 50 on Robot, and present the results in Fig.5. In the figure, the SE is plotted as a bar on each point. To better present SE, we scale the magnitude of these SE scores by 10 times. As indicated in Fig.5.(a), as flip rate increases from 0.1 to 50, the performance of each investigated model decreases. Specifically, the performance of Fact decreases the fastest, and reaches even lower than 0.5 when the flip rate is larger than 30. For DLP and AA, the AUC scores decrease in the similar trend, and for CN the decreasing speed is the slowest. On the proposed LRBM model, the performance decreases from 1 to 0.9. LRBM consistently outperforms the rest approaches ROC Curves of Node Classification on SocialE ROC Curves of Node Classification on Robot 1 0.9 0.8 0.8 True Positive Rate True Positive Rate 1 0.9 0.7 0.6 0.5 0.4 LRBM:0.836 CF:0.651 LAP:0.782 SSC:0.587 ICA:0.618 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 False Positive Rate 0.8 0.9 0.7 0.6 0.5 0.4 LRBM:0.897 CF:0.595 LAP:0.67 SSC:0.745 ICA:0.821 0.3 0.2 0.1 1 (a) Experiments on SocialE 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 False Positive Rate 0.8 0.9 1 (b) Experiments on Robot Fig. 6: ROC Curves of Node Classification. The model and AUC score of each curve are listed in the legends. over all the settings. Moreover, as shown in Fig.5.(a), LRBM achieves the lowest SE in all these cases, which represents that the performance of LRBM is more stable than the baselines. Parameter Sensitivity Test In the proposed model, the hidden layer h controls the mapping between latent node representations and links. Therefore, the number of hidden units in h is closely related to the learn ability of LRBM. To test the sensitivity of LRBM on h, we vary the number of units of h from 10 to 640 and present the results in Fig.5.(b). Over the whole test range, LRBM performs very stable. On link existence prediction, the AUC scores keep nearly the same. On link weight prediction, the RMSE scores slightly decreases when the number of units is larger than 40 and smaller than 320. Theoretically, if the number of units is too smaller, h can not encode sufficient many mappings, and if the number of units is too large, h tends to overfit in training. In our settings, based on the results in Fig.5.(b), we set the number of units in h to 100. Another parameter that is closely related to the performance of LRBM is the number of features in S and R. Since S ∪ R is the aimed latent feature representation of objects, the number of features in S and R is closely related to the expressive power of S, R and the whole LRBM model. To test the sensitivity on this parameter, we also vary it from 2 to 128 and obtain a set of results similar to Fig.5.(b). In our settings, we set the number of features to 10. B. Node Classification Baselines and Metrics In the task of node classification, we evaluate the performance on whether a model can precisely predict the class label of each node. In the evaluation, we use Receiver Operating Characteristic (ROC) to demonstrate and use AUC scores to quantify the performance of each model. In the experiments, we compare the proposed LRBM model with four baselines, which are: Coordinates Factorization (CF) [24], Learning, Analyzing and Prediction Model (LAP) [24], Supervised Sparse Coding (SSC) [30] and Iterative Class Absorption (ICA) [31]. Performance Study In the task of node classification, we do experiments on the data sets Robot and SocialE. Since Exposure does not contain labels on node classes, we do not investigate it in this experiment. On Robot, we study whether an object is a Master or not. On SocialE, we utilize the survey on residents’ political interests, and seek to predict whether a resident is Enthusiastic about politics or not. On both of these two data sets, we use 6 labeled nodes (3 in each class) in the training process, and use the rest nodes in testing. The results are presented in Fig.6. In both data sets, LRBM performs the best among the five investigated approaches. CF does not perform well on either data set since it requires sufficient information. SSC and ICA have worse performance on SocialE while achieve relatively better performance on Robot. Such inconsistent performance is partially caused by noisy links in SocialE. Overall, the proposed LRBM model achieves 6.91% to 51.01% better performance over the four baselines on the task of node classification. VI. R ELATED W ORK In Section I, we explained that LRBM significantly differs from the existing studies of linked data. Besides them, there are also several other approaches considering non-linear characteristics of nodes and links, such as exponential family models [14] and probabilistic matrix factorization models [32], [33]. In such models, node attributes and network structures are deterministically mapped to latent feature representations of nodes. Different from these approaches, the proposed LRBM model adopts hidden variable h to control the mapping relations, and in the training process, LRBM optimizes over all the possible mapping relations (as in Eq.15) to achieve the most likely mapping relation w.r.t. observable data. Moreover, LRBM is also extensible to support deep learning on linked data, which enables the extraction of deep and high-order characteristics of nodes and interaction patterns. Although there are several studies using RBM on networks, the proposed LRBM model distinguishes itself from the existing work in both the method and the aimed task. In [34], RBM as well as cRBM have been applied to the task of learning Drug-Target relations on multidimensional networks. In [29], RBM has been used for more general link prediction tasks in social networks. In [21], an advanced model named ctRBM has been proposed to do link prediction on dynamic networks. In all of these studies, the weights of links from a node to the other nodes are viewed as the visible features of the node. In details, if a network has n nodes, each node has a feature vector of size n, and each feature is the weight of the link from this node to the corresponding node. Therefore, the number of visible units is the same as the number of nodes in the network. When networks are sparse, in these models, the latent features of each node (i.e. the hidden variables of each input) are dominated by the 0s which stand for no connection to corresponding nodes, causing the latent features indistinguishable from node to node. Different from these approaches, the proposed LRBM does not treat the connections of node i as features. Instead, LRBM treat the connection Lij as the interaction output of node i and node j. Therefore, the results of LRBM are not dominated by 0s. Besides, LRBM integrates not only networks but also node attributes, which enables LRBM to learn on linked data. Another advanced RBM that considers the interaction between input units is the Semi-restricted Boltzmann machines (SRBM) [35]. Different from the proposed LRBM which seeks to learn the interaction of nodes in linked data, SRBM focuses on using the interaction between input units to capture neighbor pixels of a image. Moreover, in SRBM interactions between input units are treated separately from the interaction between input units and hidden variables. Instead, in the proposed LRBM, we use a four-way tensor to subsume both types of interactions. Overall, compared to SRBM, our proposed model LRBM is significantly different from it in both the models and the aimed tasks. VII. [11] [12] [13] ACKNOWLEDGEMENT This work is partially supported by the National Science Foundation under Grants No. 1218393 and No. 1016929. [14] [15] VIII. C ONCLUSIONS Based on the graph factorization models and conditional RBM models, we proposed a RBM model for linked data. At the heart of the proposed LRBM are latent feature representations of objects in linked data. In LRBM, we focused on mapping the latent representations of objects to corresponding links, and controlling the mapping with hidden variables and four-way tensors. We then applied tensor factorizations, parameter sharing and fine-tune to reduce the number of parameters involved in LRBM. Besides, Gaussian bias has been applied to links to model weights of links. The continuous LRBM model scales well w.r.t. the number of links in the linked data and is extensible to be stacked for deep learning. Finally, the details of how to apply LRBM to link prediction and node classification on linked data were presented. Experiments on real datasets validated that LRBM outperforms the investigated baselines by 15.9% to 58.2% on link prediction and by 6.91% to 51.01% on node classification. Future work will focus on extending LRBM for mining dynamic and time-evolving linked data. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller, “Link prediction in relational data,” Proc. of Neural Information Processing Systems (NIPS’03), 2003. B. Chen, Y. Ding, and D. J. Wild, “Assessing drug target association using semantic linked data,” PLOS Computational Biology, 2012. J. Tang and H. Liu, “Unsupervised feature selection for linked social media data,” Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD’12), 2012. K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, and L. Li, “Roix: Structural role extraction and mining in large graphs,” Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD’12), 2012. K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos, “It’s who you know: Graph mining using recursive structural features,” Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD’11), 2011. M. McCord and M. Chuah, “Spam detection on twitter using traditional classifiers,” Autonomic and Trusted Computing, 2011. A. K. Menon and C. Elkan, “Link prediction via matrix factorization,” Proc. of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD’11), 2011. L. Tang and H. Liu, “Relational learning via latent social dimensions,” Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD’09), 2009. G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, 2006. T. Tieleman, “Training restricted boltzmann machines using approximations to the likelihood gradient,” Proc. of Interational Conference on Machine Learning (ICML’08), 2008. [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] P. D. Hoff, A. E. Raftery, and M. S. Handcock, “Latent space approaches to social network analysis,” Journal of the American Statistical Association, 2002. P. D. Hoff, “Modeling homophily and stochastic equivalence in symmetric relational data,” Proc. of Neural Information Processing Systems (NIPS’07), 2007. K. T. Miller and T. L. Griffiths, “Nonparametric latent feature models for link prediction,” Proc. of Neural Information Processing Systems (NIPS’09), 2009. J. Zhu, “Max-margin nonparametric latent feature models for link prediction,” Proc. of Interational Conference on Machine Learning (ICML’12), 2012. G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, 2002. R. Memisevic and G. Hinton, “Unsupervised learning of image transformations,” Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07), 2007. M. A. Ranzato and G. Hinton, “Factored 3-way restricted boltzmann machines for modeling natural images,” Artificial Intelligence, 2010. G. W. Taylor, G. E. Hinton, and S. Roweis, “Modeling human motion using binary latent variables,” Proc. of Neural Information Processing Systems (NIPS’06), 2006. G. W. Taylor and G. E.Hinton, “Factored conditional restricted boltzmann machines for modeling motion style,” Proc. of International Conference on Machine Learning (ICML’09), 2009. K. Cho, A. Ilin, and T. Raiko, “Improved learning of gaussian-bernoulli restricted boltzmann machines,” Proc. of International Joint Conference on Neural Networks (IJCNN’11), 2011. X. Li, N. Du, H. Li, K. Li, J. Gao, and A. Zhang, “A deep learning approach to link prediction in dynamic networks,” Proc. of SIAM International Conference on Data Mining (SDM’14), 2014. D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social networks,” Proc. of ACM Conference of Information and Knowledge Management (CIKM’03), 2003. M. Li, Z. Jiang, B. Luo, J. Tang, Q. Gu, and D. Chen, “Product and user dependent social network models for recommender systems,” Advances in Knowledge Discovery and Data Mining, 2013. K. Li, N. Du, S. Guo, J. Gao, and A. Zhang, “Learning, analyzing and predicting object roles on dynamic networks,” Proc. of IEEE Internaltional Conference on Data Mining (ICDM’13), 2013. A. Madan, M. Cebrian, S. Moturu, K. Farrahi, and A. Pentland, “Sensing the ’health state’ of a community,” Pervasive Computing, 2012. C. Stevenson, C. Docx, R. Webster, and e. a. C. Battram, “Comprehensive gene expression profiling of rat lung reveals distinct acute and chronic responses to cigarette smoke inhalation,” AJP-LUNG, 2007. D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for social networks,” JASIST, 2007. L. Adamic and E. Adar, “Friends and neighbors on the web,” Social Networks, 2003. F. Liu, B. Liu, C. Sun, M. Liu, and X. Wang, “Deep learning approaches for link prediction in social network services,” Neural Information Processing, 2013. J. Yang, K. Yu, and T. Huang, “Supervised translation-invariant sparse coding,” Proc. of CVPR’10, 2010. S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classification in social networks,” Social Network Data Analytics, 2011. M. Jamali and M. Ester, “A matrix factorization technique with trust propagation for recommendation in social networks,” Proc. of ACM Recommendation Systems (RecSys’10), 2010. H. Ma, H. Yang, M. R. Lyu, and I. King, “Sorec: Social recommendation using probabilistic matrix factorization,” Proc. of ACM Conference of Information and Knowledge Management (CIKM’08), 2008. Y. Wang and J. Zeng, “Predicting drug-target interactions using restricted boltzmann machines,” Bioinformatics, 2013. S. Osindero and G. Hinton, “Modeling image patches with a directed hierarchy of markov random fields,” Proc. of Neural Information Processing Systems (NIPS’08), 2008.