Pseudo Cold Start Link Prediction with Multiple Sources in Social Networks Liang Ge 1 , Aidong Zhang 2 1,2 Computer Science and Engineering Department, State University of New York at Buffalo Buffalo, 14260, U.S.A. 1 liangge@cse.buffalo.edu, 2 azhang@cse.buffalo.edu, Abstract of the social network, we only have the knowledge of Link prediction is an important task in social networks and a small subgraph. Meanwhile we also obtain multiple data mining for understanding the mechanisms by which the auxiliary networks between the same users in the social social networks form and evolve. In most link prediction network. We are interested in predicting possible links researches, it is assumed either a snapshot of the social network or a social network with some missing links is among the vertices in the social network by exploiting available. Most existing researches therefore approach this the multiple auxiliary networks. Notice that when problem by exploring the topological structure of the social the size of the known subgraph is zero, i.e., we have network using only one source of information. However, in many application domains, in addition to the social network no knowledge of the existing link structure and only of interest, there are a number of auxiliary information one source of information is available, the problem is available. reduced to the cold start link prediction introduced by In this work, we introduce the pseudo cold start link prediction with multiple sources as the problem of predicting Leroy, et al. [2]. Consider, for instance, an online coupon company the structure of a social network when only a small subgraph of the social network is known and multiple heterogeneous G (e.g., Groupon) and a company F (e.g., Facebook) sources are available. We propose a two-phase supervised method: the first phase generates an efficient feature selec- that provides general-purpose social network services. tion scheme to find the best feature from multiple sources Suppose G and F have made the following agreement: that is used for predicting the structure in the social network. 1) F provides a functionality that users can share and In the second phase, we propose a regularization method to control the risk of over-fitting induced by the first phase. review their favorite online coupons and these sharing We assess our method empirically over a large data collec- and reviews are made available to their contacts, and tion obtained from Youtube. The extensive experimental 2) when a user clicks one online coupon, the user is evaluations confirm the effectiveness of our approach. redirected to the corresponding page in G’s website, but 3) F keeps the structure of its social network 1 Introduction as secret (this may be mandatory because of privacy Link prediction, introduced by Nowell and Kleinberg [1], regulations). Thanks to the agreement, users of F might refers to the problem that given a snapshot of a network influence each other to purchase online coupons and the at time t and a future time t’, predict the new links that easiest way for them to buy coupons would be via the are likely to appear in the network within the interval [t, website of G. In this scenario, G owns information about t’]. The link prediction is about understanding to what the purchase history and reviews from its customers. extent the mechanisms by which the networks form Without violating the privacy regulations of F, G may and evolve can be modeled by features intrinsic to the have crawled a subgraph of the social network of F. network itself [1]. Therefore, most existing researches The question herein is: can G nonetheless infer the tackle this problem by exploring the topological struc- social network, using the purchase and review informature of the social network of interest [1, 9, 12, 26]. In tion it owns and the small subgraph it crawls? This addition to the social network, in many application do- would be quite helpful for G. It would enable G to adopt mains, we also have exogenous information, most inter- viral marketing strategies [3] and if G decides, in the fuestingly the auxiliary networks between the same group ture, to develop its own social networking service, this of people from heterogeneous sources. Take Facebook can be used to recommend possible friends, thus facilinetwork as an example, besides the friendship relations tating the initial growth of this social network. between the users, there are other relations based on The PCS problem brings distinct challenges in regroup membership and commenting, etc. spect to the traditional link prediction. Many existing In this paper, we tackle the pseudo cold start link research work in link prediction exploits the topological prediction (PCS) with multiple sources problem: we still structure of the social network [1, 9, 12, 26]. However, aim to predict the links but instead of having a snapshot 518 768 Copyright © SIAM. Unauthorized reproduction of this article is prohibited. in the PCS problem, we only know a small subgraph of the social network (e.g., the subgraph contains only 5% of the total nodes). Such setting prevents us from applying many existing approaches of the link prediction. Since we have obtained multiple auxiliary networks from heterogeneous sources, we aim to predict the links in social network by exploring these auxiliary information. This brings us another challenge: the curse of multiple sources. For each auxiliary source, there are a great number of features or predictors that can be explored or derived to predict the links in the social network. Nowell, et al. [1] examined 10 topological features for link prediction while Leroy, et al. [2] enumerated and inspected 15 features from the group member information alone. Given the multiple sources available, the possible number of features for link prediction is overwhelming and the fact that the computation of each feature in the large scale social network is not trivial increases our burden. The curse of multiple sources urges us to propose an efficient and effective feature selection scheme to find the best feature for link prediction from the multiple sources. However, on the other hand, heterogeneous auxiliary sources grant us broader perspectives about the link structure in the social network than any single source. It is thus expected that the more sources we obtain, the more precisely we are able to predict. How to achieve this expectation is yet another challenge in the PCS problem. In this paper, we propose a two-phase supervised method for the PCS problem. In the first phase, a feature selection approach is proposed to find the feature in each source that can best infer the links in the social network. The proposed approach selects the best feature by aligning the feature to the known small subgraph. The efficiency of the feature selection approach is guaranteed by the fact that the alignment is shown to be equivalent to the one dimensional curve fitting. To control the possible risk of over-fitting in the first phase, we propose a regularization method in the second phase. The regularization is achieved by mapping the original points in the social network to a Hilbert space and mandating the mapping respect the known small subgraph and a regularizer. The predictions of the links are derived by the inner product of the mappings. It is noted that the optimal mapping is from a Reproducing Kernel Hilbert Space (RKHS) induced by the output of the first phase. We apply our method into a large data collection obtained from Youtube, a popular online video sharing community. We aim to predict the contact links between users and explore four auxiliary networks (see details in Section 2). The extensive empirical evalua- 519 769 tion results demonstrate the power of our approach. Our contributions can be summarized as follows: • We introduce pseudo cold start link prediction (PCS) with multiple sources as the problem of predicting the link structure of a social network assuming a small subgraph and multiple auxiliary networks are available. • We propose a two-phase supervised method as the solution to the PCS problem which includes an efficient feature selection scheme in the first phase and a regularization method in the second phase. • We apply our method to predict the link structure of Youtube social network with four auxiliary networks and the experimental results show the power of our approach. • Our approach answers the challenge of the curse of multiple sources and achieves the intuition that the more sources we obtain, the more precisely we are able to predict. The rest of the paper is organized as follows: in the next section, we present the formal definition of the problem and describes the data sets characteristics. In Sections 3 and 4 the proposed two-phase method is presented. Extensive experimental results are shown in Section 5. Related work is discussed in Section 6 and finally we discuss future research directions and conclude the paper. 2 Problem and Data sets 2.1 Problem Definition For a set O of n users, we aim to predict the link structure of the social network An×n . We are given a subgraph of size l (l << n) and its corresponding links and other auxiliary networks of the same n users: Bn×n , Cn×n , · · · . Predicting the links in A is to build a function h: O × O → {0, 1}. 2.2 Data Sets Youtube is a highly popular online video sharing community. In Youtube, a user can place another user as the contact, subscribe to other users’ videos and tag the favorite videos. In this work, we focus the link prediction in the contact network. In Youtube, links are directed. Most of the features we use in the following sections are symmetric, which implies we predict the same likelihood for the existence of links in both directions. We obtain the data sets of Youtube from Tang, et al. [4]. The data set is composed of five networks of which the contact network is the one to predict and we treat the other four networks as auxiliary sources. There are 15088 users in the contact network, we remove the users Copyright © SIAM. Unauthorized reproduction of this article is prohibited. much denser. This simple observation shows the auxiliary networks are verbose in that a user, for instance, may share favorite videos with many people but only a few of them become his friends (contacts). The direct consequence of this observation is that we can’t infer the links of A by directly using links in auxiliary networks. Almost all social networks are found to follow the scale-free property, i.e., the degree distribution of the Figure 1: Degree distribu- Figure 2: Degree distribu- social network follows the power-law distribution. Simtion of Network A tion of Network B,C,D,E ilar skewed distributions of the degree are also observed in networks A, B, C, D, E as shown in Figures 1 and 2. In the PCS problem, we only know a subgraph of size l. In this paper, we take l to be 5%, 10% and with zero contact. After the removal, there are 13723 15% of n and randomly sample l number of nodes and users left. The description of the data sets is as follows: their links from A as the known subgraph. Notice that • Major Network A: One user adds another user the subgraph is small in that when l is 10% of n, on as the contact. It is the network of which we will average the subgraph only contains 0.9% of the total links. Due to the limit of the space, in this paper, if predict the link structure. unless otherwise stated, all the empirical results assume • Co-contact network B: Two users are connected l is 10% of n. We also report the results when l is 5% and 15% of n. if they share a contact. • Co-subscription network C: Two users sharing 2.3 Result Presentation The link prediction problem is intrinsically a very difficult binary prediction any subscription are connected. problem due to the sparseness of the social network. 2 • Co-subscribed network D: Two users are con- Given n nodes, we have a space of n − n possible links, among them only a very small fraction of links actually nected if they are subscribed by the same user. exists. In our data, the existing links only constitute ap• Favorite network E: Two users are connected if proximately 0.08% of all possible links. This means acthey share favorite videos. curacy is not a very meaningful measure in link prediction. In our data sets, by predicting an empty network The four auxiliary networks show distinct or even (the links don’t exist), we can always achieve accuracy complementary perspectives of the link structure in A. of 99.92%. For the same reason, comparing different Roughly speaking, these auxiliary networks may imply functions for link prediction by recall is not appropriate the latent reasons that people in Youtube add others as either. In order to compare the performance of different contacts. For example, people may connect to others predictive functions, we adopt the Receiver Operating because they both know the same people or they share Characteristics (ROC) curve and Area Under the Curve favorite videos, etc. (AUC) to present our results. Receiver Operating Characteristics (ROC) curve, Table 1: Data sets Properties with x-axis the false positive rate and y-axis the true positive rate, is a reliable way to compare binary clasNetwork Avg. Max. Total. sifiers. As we move alone the x-axis, the larger percentA 11.2 534 153,350 age of links are selected (by lowering the link occurrence B 1143.2 5637 15,688,290 score threshold), and the percentage of actually occurC 771.0 6517 10,581,208 ring links that are in the selected links monotonically D 306.2 5686 4,202,604 increases. When all links are selected the percentage of E 1357.4 7640 18,627,088 the selected links will be 1. Therefore the ROC curve always starts at (0, 0) and ends at (1, 1). The ROC The degree properties of various networks are sum- curve associated with the prediction by randomly semarized in Table 1 where the second and third column lecting links to include will result in a 45-degree line. exhibit the average and maximum degree of each net- The standard measure to assess the quality of the ROC work. The last column shows the total number of links curve is the Area Under the Curve (AUC) measure. in each network. It is easy to see that the major net- The AUC measure is strictly bounded between 0 and work A is rather sparse while the auxiliary networks are 1. The perfect algorithm has an AUC measure of 1 and 520 770 Copyright © SIAM. Unauthorized reproduction of this article is prohibited. the random algorithm has an AUC measure of 0.5. We normalize AUC to be between 0 and 100. 3 Feature Selection Phase In the PCS problem, approaches that employ intrinsic features to the social network don’t work because we only know a small subgraph. Naturally, we have to explore the auxiliary networks to infer the possible links. Since there are a great number of possible features that can be derived from auxiliary networks, we encounter the curse of multiple sources. Furthermore, it has been reported [1, 12] that there is no feature that is consistently superior in all data sets. This observation makes sense because different social networks may have distinct mechanisms of formation and evolution. The massive scale of the social network and auxiliary networks makes the problem even harder. Therefore, an efficient and effective feature selection strategy has to be developed to tackle the curse of multiple sources. In this section, we first introduce the pool of candidate features for each source and present the approach to find the best feature for link prediction in the major network. After that, we show the empirical evidence to support our feature selection strategy. At last, we show that by the proper integration of the best feature in each source, we are able to get the result that is better than the best feature from the single source. 3.1 The Pool of Candidate Features For a given auxiliary network, for instance, B, we adopt four mappings to construct the link prediction features. They are 1) B → B, mapping to its self. 2) B → B̃, mapping −1/2 to its normalized matrix, where B̃ = W −1/2 PBW and W is the diagonal matrix where Wii = j Bij . 3) B → L, mapping to its Laplacian where L = W − B. 4) B → L̃, mapping to its normalized Laplacian where L̃ = I − B̃ where I is the identity matrix of the same size of B . The first set of prediction features we examine is the exponential [8] and von Neumann [9] graph kernels defined as follows (3.1) (3.2) matrix of an unweighed graph contains the number of paths of length r connecting all node pairs, it is assumed that nodes connecting by many paths should be considered more close to the pair connected by fewer paths. We notice the following: (3.3) exp(αB) = ∞ X αi i=0 (3.4) −1 (I − αB) = i! ∞ X Bi, αi B i , i=0 Notice that von Neuman graph kernel is also called Katz index [7] that enjoys many success in practise. When the network is large, the inversion of I − αB becomes too expensive, we can stop counting the path after length r. Also some experiments show that small r usually yields better results than large r [6]. Notice that when r = 2, it is equivalent to another popular feature that computes the closeness of the pair by counting the common neighborhood. We thus define the sum of powers as our another choice of feature for link prediction: (3.5) fsop (B) = r X αi B i , i=1 Notice that if we replace B with B̃ in Eqs. 3.1, 3.2, 3.5, we can have another three candidate features for link prediction. The second set of features for link prediction employs Laplacian and normalized Laplacian. The MoorePenrose pseudoinverse of the laplacian [10] is called the commute time or resistance distance kernel. By the regularization of the commute kernel, we have regularized Laplacian kernels [11]. These two are defined as (3.6) fcomm (L) = L+ , (3.7) fcommr (L) = (I + αL)−1 , fexp (B) = exp(αB), fneu (B) = (I − αB)−1 , where α is a positive parameter. The exponential and von Neuman kernel are equivalent to the path counting features for the link prediction. Path counting features capture the proximity between two nodes in the terms of the number of the length of paths between them. Exploring the fact that, B r , the powers of the network 521 771 The ”commute time” kernel takes its name from the average commute time, which is defined as the average number of steps a random walker, starting from node i 6= j, will take before entering a node j for the first time, and go back to i. The kernel is also known as resistance distance kernel because of a close analogy with the effective resistance in electrical Copyright © SIAM. Unauthorized reproduction of this article is prohibited. Table 2: Best features for Auxiliary Networks Auxiliary Network B C D E Best Predictor Sum of Power Heat Kernel Commr kernel Sum of Power Heat Kernel Sum of Power Sum of Power Sum of Power Heat Kernel Sum of Power Sum of Power Heat Kernel Best Mapping B̃ L L C̃ L C D̃ D L Ẽ E L RMSE 0.08 0.22 0.23 0.11 0.38 0.39 0.08 0.10 0.39 0.09 0.26 0.28 networks. Anther candidate feature is the heat diffusion kernel [9] defined as (3.8) AUC-Al×l 87.3 73.5 85.8 77.7 68.1 73.8 77.3 75.6 71.1 71.4 59.1 68.9 AUC-A 86.1 71.2 80.2 74.4 68.4 71.8 80.2 77.9 72.8 73.8 63.5 60.7 min k f (Xl×l) − Al×l kF =k U f (Λ)U T − Al×l kF fheat (L) = exp(−αL), (3.10) Also, if we replace L with L̃ in Eqs. 3.6, 3.7, 3.8, we have another three candidate link prediction features. Therefore, for an individual source, we can enumerate 12 candidate features to infer the link structure in the major network. The candidate pool is obviously not complete but it is good enough to produce good performance and enables the efficient feature selection strategy. 3.2 Feature Selection Strategy We notice that for each feature in the pool, we can represent the feature as a function of X: f (X). X could be B, B̃, L or L̃. Since we already know a small subgraph of size l, we propose to select the best feature by minimizing the following: (3.9) Parameter 0.65 0.89 0.92 0.43 0.93 0.01 0.33 0.01 0.79 0.56 0.89 0.95 min k f (Xl×l ) − Al×l kF , f where f is from the pool, X is one of the four mappings, Al×l is the given subgraph , Xl×l is the mapping on the same nodes as in Al×l and k · kF denotes the Frobenius norm. To solve Eq. 3.9, we adopt the following two observations. Firstly, For any symmetric matrix, f (X) can be computed as f (X) = U f (Λ)U T where U is the orthogonal matrix whose ith column is the ith eigenvectors of X and Λ is the diagonal matrix in which Λii is the ith eigenvalues of X. Secondly, the Frobenius norm is invariant under the multiplication of an orthogonal matrix. Based on the two observations we have: 522 772 =k f (Λ) − U T Al×l U kF X ∼ (f (Λii ) − U.iT Al×l U.i )2 i in the last step of Eq. 3.10, notice that f (Λ) is a diagonal matrix, thus we can drop off all the nondiagonal elements in U T Al×l U because it is irrelevant to f . Therefore, the feature selection is equivalent to the one-dimensional curve fitting, which can be solved efficiently. Notice that the last step of Eq. 3.10 corresponds to the root mean squared error (RMSE). Furthermore, most of the features in the candidate pool is parameter-laden (except commute kernel). With Eq. 3.10, we can also tune the parameter by achieving the minimal RMSE. Now we show some empirical evidences to justify the benefits of the proposed feature selection strategy. For each auxiliary network, we only present the results of the top-3 features in terms of RMSE and present the AUC and ROC curve of each feature. The detailed results are shown in Table 2 where AU C − Al×l denotes the performance of the prediction on the known subgraph Al×l and AUC-A denotes the performance of the prediction on the major network A. The parameter indicates the optimal value of α selected by our approach. One thing missing from this table is the parameter r for the feature Sum of Power, the parameter is selected to be 3. Figures 3 to 6 show the ROC curve for the top-3 features from each auxiliary network. Several observations arise from the empirical evaluation. First, our proposed method selects the best feature for each source, i.e., the feature with minimal RMSE Copyright © SIAM. Unauthorized reproduction of this article is prohibited. Figure 3: ROC of the best features from Network B Figure 4: ROC of the best features from Network C corresponds to the best performance both in the known subgraph and the social network. Second, the prediction power of each source differs: the network B has the greatest prediction power. Since network B is a cocontact network, i.e., users are connected if they share a contact, this observation confirms the popular assumption both in our daily life and in social network research: ”Friends of friend are friends”. Thirdly, in the related work [6] of the similar topic, Lu, et al. tackles the problem of multiple sources by mandating the Katz index is the best feature for each source ,which is not flexible and does not respect the fact that different sources have varied best feature. Our work shows that the best feature in each source in Youtube is the sum of power of normalized network. Notice that when the data set changes, the best feature for each source may also vary, yet our proposed approach is still able to capture the best feature in each source for the link prediction. Finally, we observe that although the minimal RMSE corresponds to the best performance, when RMSE value is large, smaller RMSE does not necessarily indicate worse performance. For example, the RMSE for heat kernel and regularized commute kernel of B is 0.22 and 0.23, respectively. However, in terms of AUC, regularized commute kernel is superior to the heat kernel. This happens because large RMSE value means a significant part of the feature doesn’t fit Al×l . However, RMSE does not distinguish between mismatch from the major part and from the outliers. Figures 7 and 8 (better seen in color) show the curve fitting plot for heat kernel and regularized commute kernel, respectively. The x-axis denotes the value of f (Λii ) and y-axis denotes the value of U.iT Al×l U.i . As we can see, the regularized commute kernel fits more closely to the major part. Its RMSE value is bigger than that of the heat kernel because the heat kernel probably fits more closely to the outliers. This explains the possible reverse behavior of large RMSE and its corresponding 523 773 Figure 5: ROC of the best features the Network D Figure 7: Heat Kernel Curve Fitting Figure 6: ROC of the best features from Network E Figure 8: CommR Kernel Curve Fitting performance. 3.3 The Complement of Multiple Sources From the empirical evidence, we notice that network B has the greatest prediction power and sum of power of normalized B achieves the best performance. Also we observe that the best features from other three auxiliary networks also have good performance. As we perceive, people are sophisticated and people become friends in a variety of reasons. Therefore, although network B shows the greatest prediction power, which indicates the latent reason B represents is probably the major reason people becomes friends in Youtube, yet we can never safely rule out other possible reasons that other three auxiliary networks embody. We believe the four auxiliary networks are to some extent complementary in that the proper integration of those auxiliary networks should shed more lights on how people are connected in Youtube. To explore the the complement of multiple sources, we propose two alternatives. One is to select the best feature from each source and combine them to perform the prediction. The second is to combine the best features overall, i.e., select the features that have top performance. In this case, since network B has the greatest prediction power, multiple of its features will be selected. Figures 9 and 10 show the ROC curve of the two Copyright © SIAM. Unauthorized reproduction of this article is prohibited. 9. Table 3: Performance of Different Combinations on the major network A B C D E Figure 9: ROC curve of the first alternative Figure 10: ROC curve of the second alternative alternatives on the major network A. In Fig. 9, B, C, D, E denotes the sum of power of the normalized matrix B, C, D, E, respectively. B +C +D+E denotes the average ensemble of sum of powers of the normalized matrix. The ensemble result achieves better result than the best feature from each source. This supports our assertion that multiple sources are to some extent complementary. In Fig. 10, on the other hand, B + B + C + E denotes the average ensemble of two features from B (sum of power and regularized commute kernel) and the best features from C and E. We note that the performance of such ensemble is inferior to the first alternative but still better than the best feature of each source. Figure 11 shows the ensemble of the top-2 features in B, i.e., sum of powers of normalized B and regularized commute kernel, comparing with the performance of individual features. As we can see, the performance of the ensemble result, although very close to that of sum of power of normalized B, is still worse. This indicates that the performance increase in Figure 10 occurs because of the ensemble of features from different sources. It also explains the reason that the ensemble in Figure 10 has worse performance than that in Figure 524 774 C 87.5 72.9 - D 87.1 84.2 80.7 - E 84.7 78.0 82.6 72.7 Table 3 presents the performance of different combinations of the best feature from each source. Any cell in the table denotes the average ensemble of two features. Since the table is symmetric, we only show its upper triangle. It is clear that any combination of two features boosts the performance comparing to the feature itself. We also experiment on the combination of B+C+E, B+D+E and C+D+E. These combinations all boost the performance from the individual source. The results are inspiring in that we are able to achieve the desirable goal that the more sources we obtain, the more precisely we can predict. 4 Figure 11: Ensemble of two features from B B 84.1 - Regularization The feature selection phase chooses the feature that has the best performance. The feature is selected by aligning with the known subgraph Al×l . In the PCS problem, Al×l is usually small, thus the selected feature is prone to over-fitting. It is important to control the over-fitting through a proper regularization. Inspired by the work of Vert, et al [25], we propose the regularization strategy as follows: • In the major network A, construct a mapping h that maps the original nodes of A to a Hilbert space. • The mapping h must respect the known subgraph Al×l as much as possible and follow a regularization. • The prediction is made by computing the inner product between the mappings of two nodes. It is noted that the optimal mapping ĥ comes from the Reproducing Kernel Hilbert Space (RKHS) induced by the output of the first phase. The RKHS theory for real valued functions provides a powerful theoretical framework for regularization. Numerous models including support vector machines can be derived from the application of the representer theorem and different data-dependent cost functions. In the following, we briefly review the main ingredients of the RKHS theory for vector-valued functions that are needed for our regularization. Copyright © SIAM. Unauthorized reproduction of this article is prohibited. 4.1 RKHS for Vector-valued Functions For a given Hilbert space F† , L(F† ) denotes the set of bounded linear operator from F† to itself. Given Z ∈ L(F† ), Z ∗ denotes the adjoint of Z. Definition 1 [13, 14, 19] Let O be a set and F† a Hilbert space. Kx : O × O → L(F† ) is a kernel if: ′ ′ ′ ∗ • ∀(o, o ) ∈ O × O, Kx (o, o ) = Kx (o, o ) ∈ N, ∀{(oi , yi )}m i=1 i,j=1 hyi , Kx (oi , oj )yj iF† ≥ 0 • ∀m Pm ⊆ O × F† , The kernel defined here is operator-valued kernel with output in L(F† ) instead of R. The above definition is similar to the definition of normal kernel (symmetric, semi positive-definite). Theorem 1 [13, 14, 19] Let O be a set and F† a Hilbert space. If Kx : O × O → L(F† ) is an operatorvalued kernel, then there exists a unique RKHS H which admits Kx as the reproducing kernel, that is Theorem 1 is analogous to Moore-Aronszajn theorem [16] describing the relation between kernel and RKHS of real-valued functions: each kernel defines a unique RKHS. With the operator-valued kernel and Theorem 1, we have the following representer theorem. Theorem 2 [14, 19] Let O be a set and F† a Hilbert space. Given a set of labeled sample Sl = {(oi , yi )}m i=1 ⊆ O × F† , a RKHS H with reproducing kernel Kx : O × O → L(F† ), the minimizer ĥ of the following regularization: (4.11) argminJ(h) = h∈H k h(oi ) − yi k2F† +λ k h k2H i=1 with λ > 0, admits an expansion: ˆ = h(·) l X Kx (oj , ·)cj , i=1 where the vector cj ∈ F† , j = {1, · · · , l} satisfy the equations: yj = l X 4.2 Regularization First we show how to obtain a operator-valued kernel. If we have a scalar kernel: kx : O × O → R, we can define the operator-valued kernel Kx : O × O → L(F† ) as follows Kx : kx (o, o′ ) × IF† , where IF† is the identity matrix of size of dimension of F† . Clearly, Kx is symmetric. Furthermore, the semi positive-definite property of kx leads to the semi positive-definite of Kx : ∀m ∈ N, ∀{(oi , yi )}m i=1 ⊆ O × F† , we have m X hyi , Kx (oi , oj )yj iF† = i,j=1 ∀o ∈ O, ∀y ∈ F† , hh, Kx(o, ·)yiH = hh(o), yiF† , l X much as possible and follow a regularization. To benefit from RKHS theory, we should define a proper operatorvalued kernel based on the features selected in the first phase and find a representation for the labels of each node i.e., {yi }li=1 in the major network A. (Kx (oi , oj ) + λδij )ci , i=1 where δ is the Kronecker symbol: δii = 1 and ∀j 6= i, δij = 0. Theorem 2 is devoted to the penalized least square. Eq. 4.11 presents the desirable property for our regularization property: respect the known subgraph Al×l as 525 775 m X kx (oi , oj )hyi , yj iF† ≥ 0 i,j=1 The input scalar kernel is chosen to be the best feature K in the first phase, which, in this case, is the average ensemble of the sum of powers of normalized networks. It is noted that this feature may not be a semi positive-definite matrix, thus not a kernel per se. Theoretically, a kernel matrix K must be positive semidefinite. Empirically, for machine learning heuristics, choices of K that do not satisfy Mercer’s condition may still perform reasonably if K at least approximates the intuitive idea of similarity. The feature for link prediction captures the idea of proximity between users, thus can be used as a reasonable input to construct the operator-valued kernel. In this context, the known subgraph Al×l is viewed as the training set. But different from normal classification, we now don’t have labels {yi }li=1 for the nodes in the training set, rather we have labels for the links in the training set. To obtain the labels, we construct a Gram matrix KYl from the training set Al×l using any kernel that encodes the proximities of nodes in a given graph. In this work, we use the heat diffusion kernel: KYl = exp(−βLYl ), where LYl is the Laplacian of Al×l . We assume that a semi positive-definite kernel ky : O × O → R underlies this Gram matrix. By Mercer theorem [17], there exists a Hilbert space F† and a mapping y: O → F† such that ∀(o, o′ ) ∈ O, ky (o, o′ ) = hy(o), y(o′ )iF† . In this way, we obtain the label information {yi }li=1 about the nodes in the training set. Intuitively, yi can be seen as a vector which represents the position in the Hilbert space F† of the node oi in the training set. Copyright © SIAM. Unauthorized reproduction of this article is prohibited. When Kx and {yi }li=1 are defined, the solution to Eq. 4.11 reads C = Yl (KXl + λIl )−1 where Yl = (y1T , · · · , ylT ), KXl is the Gram matrix of size l × l associated with kernel kx and Il is the identity matrix of size l. Now we can compute the prediction between nodes oi and oj as follows: hh(oi ), h(oj )iF† = hCkx (o1,··· ,l , oi ), Ckx (o1,··· ,l , oj )iF† = kx (o1,··· ,l , oi )T QT KYl Qkx (o1,··· ,l , oi ) where Q = (KXl + λIl )−1 , kx (o1,··· ,l , oi ) is a l × 1 vector whose element is kx (ok , oi ) and ok is a node in the training set. It is noted that the label for nodes {yi }li=1 is never explicitly computed. Theoretically, any kernel can be acted as the input scalar kernel for the regularization. However, as our empirical evaluation will show, the quality of the input kernel has a huge influence on the performance of the regularization strategy. Simply put, the kernel that performs better in the first phase will have better performance than the kernel that performs worse in the first phase. Therefore, the first phase is justified from another perspective as the proper kernel selection for the regularization. In the following, we show the empirical evidences that support the claims: 1) Regularization is able to boost the performance from the first phase. 2) Different input kernels have great influences on the performance of the regularization. first phase. In Figure 13, the results are about regularization with different input kernels. In this plot, the legends B, C, D, E and B+C+D+E indicate that we choose sum of power of normalized auxiliary matrix and the average ensemble as the input kernel to the regularization. It is clear in the diagram that different kernels bear varied performances and the best input kernel from the first phase achieves the best performance. 5 Empirical Evaluation In previous sections, we present empirical evidence to support our conclusions. To make the empirical evaluation more complete, there are several issues needed to be discussed. 5.1 Low Rank Approximation In Section 2, we need to compute the varied features for auxiliary network. Those networks are huge and dense, therefore computing features out of them is not trivial. Brute force is seldom an option, instead, we have to use low rank approximation to compute the approximation of the features. Since all our features in Section 2 can be represented as f (X) and X is a symmetric matrix, a rank-k approximation of X is given by truncation leaving only k eigenvalues and eigenvectors of it: T , f(k) (X) = U(k) f (Λ(k) )U(k) when X is an auxiliary network and normalized auxiliary network, the biggest k eigenvalues are used while the smallest eigenvalues are used for the Laplacian and normalized Laplacian. It is envisioned that the parameter k will have influence on the performance. Table 4: Influence of Low Rank Approximation Feature Sopb Sopc Sopd Sope k =100 84.1 74.4 80.2 73.8 k =200 86.0 72.9 80.7 72.3 T.(k =100) 34s 21s 19s 61s T.(k=200 ) 86s 77s 36s 97s In Table 4, the first two columns show the AUC of different k and the last two columns show the time to compute each feature. Sopb denotes the sum of power of normalized of B. In Table 4, we see the influence of low Figure 12: ROC curve for Figure 13: ROC curve for rank approximation on the performance and the running the regularization different input kernels time. The experiments are carried on in Intel Duo Core In Figure 12, the legends B, C, D, E and B+C+D+E 2.7G with 4 GB memory. Clearly, increasing k will stand for the best feature from each source in the first increase running time, but larger k doesn’t necessarily phase and the average ensemble of the best features. indicate better performance. In our work, we take k to The black curve denotes the performance of regulariza- be 200. tion with input kernel of B+C+D+E. As seen from this figure, regularization boosts the performance from the 526 776 5.2 The Known Subgraph In our PCS problem, we know a small subgraph. In the empirical evidences Copyright © SIAM. Unauthorized reproduction of this article is prohibited. notice that this parameter is remotely relevant to the performance of the prediction, so we omit the reports here. 6 Figure 14: Parameter λ Sensitivity shown in Section 2 and 3, we take the size of subgraph l to be 10% of n, which is the size of A. To completely evaluate the influence of l, we take l to be 5%, 10% and 15% of n and randomly sample 10 times l users from the major networks and their corresponding links as the known subgraph, we summarize the results in the following table. Table 5: Influence of the Known Subgraph Method Sopb Ensemble Regularization 5% 85.5 ± 0.7 88.4 ± 0.6 90.2 ± 0.3 10% 86.1 ± 0.5 89.0 ± 0.4 90.8 ± 0.3 15% 86.2 ± 0.5 89.3 ± 0.5 91.2 ± 0.4 In this table, Sopb denotes the best feature from B: sum of power of normalized B, Ensemble denotes the average ensemble of the best feature from each source and Regularization denotes the regularization’s performance. We notice that with the increase of the training size, the performance of Sopb and Ensemble hardly increases. This happens because for a specific data set, the feature from each source that performs best in the link prediction in the social network is fixed. The same feature is always selected despite of varied training size. Regularization does show performance increase in terms of training size. This satisfies the basic machine learning assumption that with bigger training set, we can do better. 5.3 Parameter Sensitivity The performance of regularization is sensitive to the parameter λ. We tune the parameter by achieving the best prediction performance in the training set. Figure 14 shows the performance curve over different values of λ on one experiment. Another parameter is the β in the heat diffusion kernel that represents the proximity relations between nodes. We also tune the parameter by achieving the best prediction performance in the training set. We 527 777 Related Work Liben-Nowell and Kleinberg [1] introduce the link prediction problem and show that simple graph-theoretic measures, such as the number of common neighbors, are able to detect links that are likely to appear in a social network. By adopting more elaborated measures that consider the ensemble of all paths between two nodes (e.g., Katz index [7]), they further improve the prediction quality. Inspired by the seminal work by Liben-Nowell and Kleinberg, a huge amount of attentions have been paid to discover unsupervised measures that can better predict the link structure [9, 22, 26]. Those link prediction methods are ad-hoc, leading to the fact that there is no method consistently superior in all data sets. This makes sense in that different networks have distinct mechanisms of formation and evolution as shown in [22]. However, it is noted that Katz index [7] and its variant truncated Katz index show their success in many applications [1, 6]. Link prediction can also be viewed as a binary classification problem, in which the labels of the links are either 0 (non-existing) or 1 (existing). Hasan, et al. [24] propose to employ various classification approaches to handle link prediction directly. Lichtenwalter, et al. [22] notice the extreme imbalanced distribution of class variables of the links and propose over-sampling strategy to overcome imbalance. Vert, et al. [25] propose a novel supervised model in which points in the network are first mapping to the Euclidean space and the prediction is made by computing the distance between the points in the Euclidean space. The mapping is mandated to conform to the existing link structure. Inspired by the recent development of the semi-supervised learning in machine learning community [18], Kashima, et al. [20] propose to apply the label propagation framework [27] to link prediction problem and are able to bring down the time complexity of their algorithm from O(n4 ) to O(n3 ). Brouard, et al. [19] present a new representer theorem for semi-supervised link prediction and show that semi-supervised is able to boost the performance of supervised link prediction. Most of the research work on link prediction assume that we know a general link structure of the network, either the network at time t or the network with some missing links. In the work [2], Leroy, et al. introduce the cold start link prediction problem by assuming that there is no known link structure available. Their ambition is to rebuild the network from the aux- Copyright © SIAM. Unauthorized reproduction of this article is prohibited. iliary information. However, they manually inspect the possible features from the single auxiliary information and find the best feature by comparing with ground truth. When we have multiple auxiliary sources, their approach would lose much efficiency. Several researchers place their focus on the problems related to multiple sources in social networks. Tang, et al. [4] study the problem of clustering the social network in the context of multiple sources or multi-dimensional social networks. Du, et al. [5] study the behavior of the evolution of large multi-modal social network from Nokia. In the work that is most related to our work, Lu, et al. [6] propose a weight ensemble model to combine different features from various sources. Also, they proposes two regularization strategies to control the complexities of the model. One limitation is that they mandate the Katz index to be the best feature from each source. Such assumption may be broke in other data sets while we propose the feature selection strategy to determine the best feature in each source. 7 Future Work and Conclusion In this work, we introduced the pseudo cold start link prediction with multiple sources problem in the context that given a small subgraph and multiple auxiliary sources, we aimed to predict the link structure of the social network. The problem posed great challenges in that traditional link prediction methods are unable to work in this scenario. Also multiple auxiliary sources brought the curse: multiple sources implies the explosion of possible features for link prediction. Multiple sources also indicate various distinct even complementary perspectives to predict the link structure in the social network. To tackle this problem, we proposed a two-phase methods: we first proposed an efficient and effective feature selection strategy to cure the curse of multiple sources. The strategy found the best feature in each source by aligning the possible feature to the known subgraph. Its efficiency is guaranteed by its equivalent to one dimensional curve fitting. We also discovered that by combining the best feature from each source (they may not be the best feature s overall), we achieved a better result than the best individual feature among all sources. By considering the risk of over-fitting because of the small size of the known subgraph, in the second phase, we proposed a regularization method. The regularization first mapped the points in the social network to a Hilbert space and mandated the mapping to respect the known subgraph and a regularizer. We showed that the proposed regularization increased the performance given the best feature discovered in the first phase. Extensive experimental evaluation verified 528 778 our assumptions and conclusions, thus showing the benefits of our approach. In the future, we have great interests to explore more features from each source and design an efficient, unified framework to find the best feature for link prediction. Also, in this work, we used average ensemble of the best feature from each source as an input to the second phase, it is natural to shift into a weighted ensemble of the best features, therefore, an efficient weight scheme is our goal of interests in the future. References [1] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Journal of the American Society for Information Science and Technology, 2007 [2] V. Leroy, B. Cambazoglu and F. Bonchi. Cold start link prediction. KDD, 2010 [3] David Kempe, J. Kleinberg, and E. Tardos. Maximizing the spred of influence through a social network. KDD, 2003 [4] Tang, Lei and Liu, Huan and Zhang, Jianping and Nazeri, Zohreh, Community evolution in dynamic multimode networks, KDD, 2008 [5] Nan Du, Nokia Research Center; Hao Wang, Nokia Research Center; Christos Faloutsos, CMU, xSocial: Analysis of Large Multi-Modal Social Networks Patterns and a Generator, ECML/PAKDD, 2010 [6] Z. Lu, B. Savas, W. Tang and I. Dhillon, Supervised link prediction using multiple sources ICDM, 2010 [7] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 1953 [8] Kandola, J., Shawe-taylor, J., and Cristianini, N. Learning semantics similarity, NIPS, 2002 [9] Ito, T., Shimbo, M., Kudo, T., and Matsumoto, Y. Application of kernels to link analysis, ICDM, 2005 [10] Fouss, F., Pirotte, A., Renders, J.-M., and Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. TKDE, 2007 [11] Smola, A., and Kondor, R. Kernels and regularization on graphs. Proc. Conf. on Learning Theory and Kernel Machines [12] J. Kunegis, and A. Lommatzsch, Learning spectral graph transformation for link prediction, ICML, 2011 [13] Senkene, E. and Temple’man, A. Hilbert spaces of operator-valued functions Lithuanisan Mathematical Journal, 1973 [14] Caponnetto, A., Micchelli, C., Pontil, M. and Ying. Y., Universal multitask kernels. Journal of Machine Learning Research, 2008 [15] Micchelli, C. and Pontil, M. On learning vector-valued functions Neural Computation, 2005 [16] Aronszajn, Nachman ”Theory of Reproducing Kernels”. Transactions of the American Mathematical Society, 1950 Copyright © SIAM. Unauthorized reproduction of this article is prohibited. [17] Mercer, J. , ”Functions of positive and negative type and their connection with the theory of integral equations”, Philosophical Transactions of the Royal Society, 1909 [18] Belkin, M., Niyogi, P. and Sindhhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006 [19] Brouard, C., ALche-Buc, F. and Szafranski, M, Semisupervised penalized output kernel regression for link prediction, ICML, 2011 [20] Kashima, H, Kato, T., Yamanishi, Y., Sugiyama, M., and Tsuda, K. Link propagation: a fast semisupervised learning algorithm for link prediction, SDM, 2009 [21] Cortes, C. Mohri, M and Weston, J A general regression technique for learning transductions. ICML, 2005 [22] R. Lichtenwalter, J. Lussier and N. Chawla, New Perspectives and Methods in Link Prediction, KDD, 2010 [23] H, Kautz, B. Selman, and M. Shah Referral web: combining social networks and collaborative filtering, Communications of the ACM, 1997 [24] M. Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction using supervised learning, SDM, 2006 [25] J. Vert and Y. Yamanishi, Supervised Graph Inference, NIPS, 2004 [26] Z. Huang, Link Prediction Based on Graph Topology: The Predictive Value of the Generalized Clustering Coefficient, KDD, 2006 [27] Zhou, D., Boursquet, O., Lal, T. N, Weston, J. and Scholkopf,B Learning with local and global consistency. NIPS, 2004 529 779 Copyright © SIAM. Unauthorized reproduction of this article is prohibited.