Pseudo Cold Start Link Prediction with Multiple Sources in Social... Liang Ge , Aidong Zhang

advertisement
Pseudo Cold Start Link Prediction with Multiple Sources in Social Networks
Liang Ge 1 , Aidong Zhang 2
1,2
Computer Science and Engineering Department, State University of New York at Buffalo
Buffalo, 14260, U.S.A.
1
liangge@cse.buffalo.edu,
2
azhang@cse.buffalo.edu,
Abstract
of the social network, we only have the knowledge of
Link prediction is an important task in social networks and a small subgraph. Meanwhile we also obtain multiple
data mining for understanding the mechanisms by which the auxiliary networks between the same users in the social
social networks form and evolve. In most link prediction network. We are interested in predicting possible links
researches, it is assumed either a snapshot of the social
network or a social network with some missing links is among the vertices in the social network by exploiting
available. Most existing researches therefore approach this the multiple auxiliary networks. Notice that when
problem by exploring the topological structure of the social the size of the known subgraph is zero, i.e., we have
network using only one source of information. However, in
many application domains, in addition to the social network no knowledge of the existing link structure and only
of interest, there are a number of auxiliary information one source of information is available, the problem is
available.
reduced to the cold start link prediction introduced by
In this work, we introduce the pseudo cold start link prediction with multiple sources as the problem of predicting Leroy, et al. [2].
Consider, for instance, an online coupon company
the structure of a social network when only a small subgraph
of the social network is known and multiple heterogeneous G (e.g., Groupon) and a company F (e.g., Facebook)
sources are available. We propose a two-phase supervised
method: the first phase generates an efficient feature selec- that provides general-purpose social network services.
tion scheme to find the best feature from multiple sources Suppose G and F have made the following agreement:
that is used for predicting the structure in the social network. 1) F provides a functionality that users can share and
In the second phase, we propose a regularization method to
control the risk of over-fitting induced by the first phase. review their favorite online coupons and these sharing
We assess our method empirically over a large data collec- and reviews are made available to their contacts, and
tion obtained from Youtube. The extensive experimental 2) when a user clicks one online coupon, the user is
evaluations confirm the effectiveness of our approach.
redirected to the corresponding page in G’s website,
but 3) F keeps the structure of its social network
1 Introduction
as secret (this may be mandatory because of privacy
Link prediction, introduced by Nowell and Kleinberg [1], regulations). Thanks to the agreement, users of F might
refers to the problem that given a snapshot of a network influence each other to purchase online coupons and the
at time t and a future time t’, predict the new links that easiest way for them to buy coupons would be via the
are likely to appear in the network within the interval [t, website of G. In this scenario, G owns information about
t’]. The link prediction is about understanding to what the purchase history and reviews from its customers.
extent the mechanisms by which the networks form Without violating the privacy regulations of F, G may
and evolve can be modeled by features intrinsic to the have crawled a subgraph of the social network of F.
network itself [1]. Therefore, most existing researches
The question herein is: can G nonetheless infer the
tackle this problem by exploring the topological struc- social network, using the purchase and review informature of the social network of interest [1, 9, 12, 26]. In tion it owns and the small subgraph it crawls? This
addition to the social network, in many application do- would be quite helpful for G. It would enable G to adopt
mains, we also have exogenous information, most inter- viral marketing strategies [3] and if G decides, in the fuestingly the auxiliary networks between the same group ture, to develop its own social networking service, this
of people from heterogeneous sources. Take Facebook can be used to recommend possible friends, thus facilinetwork as an example, besides the friendship relations tating the initial growth of this social network.
between the users, there are other relations based on
The PCS problem brings distinct challenges in regroup membership and commenting, etc.
spect to the traditional link prediction. Many existing
In this paper, we tackle the pseudo cold start link research work in link prediction exploits the topological
prediction (PCS) with multiple sources problem: we still structure of the social network [1, 9, 12, 26]. However,
aim to predict the links but instead of having a snapshot
518
768
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
in the PCS problem, we only know a small subgraph of
the social network (e.g., the subgraph contains only 5%
of the total nodes). Such setting prevents us from applying many existing approaches of the link prediction.
Since we have obtained multiple auxiliary networks
from heterogeneous sources, we aim to predict the
links in social network by exploring these auxiliary
information. This brings us another challenge: the
curse of multiple sources. For each auxiliary source,
there are a great number of features or predictors that
can be explored or derived to predict the links in
the social network. Nowell, et al. [1] examined 10
topological features for link prediction while Leroy, et
al. [2] enumerated and inspected 15 features from the
group member information alone. Given the multiple
sources available, the possible number of features for
link prediction is overwhelming and the fact that the
computation of each feature in the large scale social
network is not trivial increases our burden. The curse
of multiple sources urges us to propose an efficient
and effective feature selection scheme to find the best
feature for link prediction from the multiple sources.
However, on the other hand, heterogeneous auxiliary
sources grant us broader perspectives about the link
structure in the social network than any single source.
It is thus expected that the more sources we obtain, the
more precisely we are able to predict. How to achieve
this expectation is yet another challenge in the PCS
problem.
In this paper, we propose a two-phase supervised
method for the PCS problem. In the first phase,
a feature selection approach is proposed to find the
feature in each source that can best infer the links
in the social network. The proposed approach selects
the best feature by aligning the feature to the known
small subgraph. The efficiency of the feature selection
approach is guaranteed by the fact that the alignment
is shown to be equivalent to the one dimensional curve
fitting. To control the possible risk of over-fitting in
the first phase, we propose a regularization method in
the second phase. The regularization is achieved by
mapping the original points in the social network to
a Hilbert space and mandating the mapping respect
the known small subgraph and a regularizer. The
predictions of the links are derived by the inner product
of the mappings. It is noted that the optimal mapping
is from a Reproducing Kernel Hilbert Space (RKHS)
induced by the output of the first phase.
We apply our method into a large data collection obtained from Youtube, a popular online video sharing
community. We aim to predict the contact links between users and explore four auxiliary networks (see
details in Section 2). The extensive empirical evalua-
519
769
tion results demonstrate the power of our approach.
Our contributions can be summarized as follows:
• We introduce pseudo cold start link prediction
(PCS) with multiple sources as the problem of predicting the link structure of a social network assuming a small subgraph and multiple auxiliary networks are available.
• We propose a two-phase supervised method as the
solution to the PCS problem which includes an
efficient feature selection scheme in the first phase
and a regularization method in the second phase.
• We apply our method to predict the link structure
of Youtube social network with four auxiliary networks and the experimental results show the power
of our approach.
• Our approach answers the challenge of the curse of
multiple sources and achieves the intuition that the
more sources we obtain, the more precisely we are
able to predict.
The rest of the paper is organized as follows: in the
next section, we present the formal definition of the
problem and describes the data sets characteristics. In
Sections 3 and 4 the proposed two-phase method is
presented. Extensive experimental results are shown
in Section 5. Related work is discussed in Section 6
and finally we discuss future research directions and
conclude the paper.
2
Problem and Data sets
2.1 Problem Definition For a set O of n users, we
aim to predict the link structure of the social network
An×n . We are given a subgraph of size l (l << n) and
its corresponding links and other auxiliary networks of
the same n users: Bn×n , Cn×n , · · · . Predicting the
links in A is to build a function h: O × O → {0, 1}.
2.2 Data Sets Youtube is a highly popular online
video sharing community. In Youtube, a user can place
another user as the contact, subscribe to other users’
videos and tag the favorite videos. In this work, we
focus the link prediction in the contact network. In
Youtube, links are directed. Most of the features we use
in the following sections are symmetric, which implies
we predict the same likelihood for the existence of links
in both directions.
We obtain the data sets of Youtube from Tang, et al.
[4]. The data set is composed of five networks of which
the contact network is the one to predict and we treat
the other four networks as auxiliary sources. There are
15088 users in the contact network, we remove the users
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
much denser. This simple observation shows the auxiliary networks are verbose in that a user, for instance,
may share favorite videos with many people but only a
few of them become his friends (contacts). The direct
consequence of this observation is that we can’t infer the
links of A by directly using links in auxiliary networks.
Almost all social networks are found to follow the
scale-free property, i.e., the degree distribution of the
Figure 1: Degree distribu- Figure 2: Degree distribu- social network follows the power-law distribution. Simtion of Network A
tion of Network B,C,D,E ilar skewed distributions of the degree are also observed
in networks A, B, C, D, E as shown in Figures 1 and 2.
In the PCS problem, we only know a subgraph of
size
l. In this paper, we take l to be 5%, 10% and
with zero contact. After the removal, there are 13723
15%
of n and randomly sample l number of nodes and
users left. The description of the data sets is as follows:
their links from A as the known subgraph. Notice that
• Major Network A: One user adds another user the subgraph is small in that when l is 10% of n, on
as the contact. It is the network of which we will average the subgraph only contains 0.9% of the total
links. Due to the limit of the space, in this paper, if
predict the link structure.
unless otherwise stated, all the empirical results assume
• Co-contact network B: Two users are connected l is 10% of n. We also report the results when l is 5%
and 15% of n.
if they share a contact.
• Co-subscription network C: Two users sharing 2.3 Result Presentation The link prediction problem is intrinsically a very difficult binary prediction
any subscription are connected.
problem due to the sparseness of the social network.
2
• Co-subscribed network D: Two users are con- Given n nodes, we have a space of n − n possible links,
among them only a very small fraction of links actually
nected if they are subscribed by the same user.
exists. In our data, the existing links only constitute ap• Favorite network E: Two users are connected if proximately 0.08% of all possible links. This means acthey share favorite videos.
curacy is not a very meaningful measure in link prediction. In our data sets, by predicting an empty network
The four auxiliary networks show distinct or even (the links don’t exist), we can always achieve accuracy
complementary perspectives of the link structure in A. of 99.92%. For the same reason, comparing different
Roughly speaking, these auxiliary networks may imply functions for link prediction by recall is not appropriate
the latent reasons that people in Youtube add others as either. In order to compare the performance of different
contacts. For example, people may connect to others predictive functions, we adopt the Receiver Operating
because they both know the same people or they share Characteristics (ROC) curve and Area Under the Curve
favorite videos, etc.
(AUC) to present our results.
Receiver Operating Characteristics (ROC) curve,
Table 1: Data sets Properties
with x-axis the false positive rate and y-axis the true
positive rate, is a reliable way to compare binary clasNetwork
Avg.
Max.
Total.
sifiers. As we move alone the x-axis, the larger percentA
11.2
534
153,350
age of links are selected (by lowering the link occurrence
B
1143.2 5637 15,688,290
score threshold), and the percentage of actually occurC
771.0
6517 10,581,208
ring links that are in the selected links monotonically
D
306.2
5686
4,202,604
increases. When all links are selected the percentage of
E
1357.4 7640 18,627,088
the selected links will be 1. Therefore the ROC curve
always starts at (0, 0) and ends at (1, 1). The ROC
The degree properties of various networks are sum- curve associated with the prediction by randomly semarized in Table 1 where the second and third column lecting links to include will result in a 45-degree line.
exhibit the average and maximum degree of each net- The standard measure to assess the quality of the ROC
work. The last column shows the total number of links curve is the Area Under the Curve (AUC) measure.
in each network. It is easy to see that the major net- The AUC measure is strictly bounded between 0 and
work A is rather sparse while the auxiliary networks are 1. The perfect algorithm has an AUC measure of 1 and
520
770
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
the random algorithm has an AUC measure of 0.5. We
normalize AUC to be between 0 and 100.
3
Feature Selection Phase
In the PCS problem, approaches that employ intrinsic features to the social network don’t work because
we only know a small subgraph. Naturally, we have
to explore the auxiliary networks to infer the possible
links. Since there are a great number of possible features that can be derived from auxiliary networks, we
encounter the curse of multiple sources. Furthermore,
it has been reported [1, 12] that there is no feature that
is consistently superior in all data sets. This observation makes sense because different social networks may
have distinct mechanisms of formation and evolution.
The massive scale of the social network and auxiliary
networks makes the problem even harder. Therefore,
an efficient and effective feature selection strategy has
to be developed to tackle the curse of multiple sources.
In this section, we first introduce the pool of candidate
features for each source and present the approach to
find the best feature for link prediction in the major
network. After that, we show the empirical evidence to
support our feature selection strategy. At last, we show
that by the proper integration of the best feature in each
source, we are able to get the result that is better than
the best feature from the single source.
3.1 The Pool of Candidate Features For a given
auxiliary network, for instance, B, we adopt four mappings to construct the link prediction features. They
are 1) B → B, mapping to its self. 2) B → B̃, mapping
−1/2
to its normalized matrix, where B̃ = W −1/2
PBW
and W is the diagonal matrix where Wii = j Bij . 3)
B → L, mapping to its Laplacian where L = W − B.
4) B → L̃, mapping to its normalized Laplacian where
L̃ = I − B̃ where I is the identity matrix of the same
size of B .
The first set of prediction features we examine is
the exponential [8] and von Neumann [9] graph kernels
defined as follows
(3.1)
(3.2)
matrix of an unweighed graph contains the number
of paths of length r connecting all node pairs, it is
assumed that nodes connecting by many paths should
be considered more close to the pair connected by fewer
paths. We notice the following:
(3.3)
exp(αB) =
∞
X
αi
i=0
(3.4)
−1
(I − αB)
=
i!
∞
X
Bi,
αi B i ,
i=0
Notice that von Neuman graph kernel is also called
Katz index [7] that enjoys many success in practise.
When the network is large, the inversion of I − αB
becomes too expensive, we can stop counting the path
after length r. Also some experiments show that small r
usually yields better results than large r [6]. Notice that
when r = 2, it is equivalent to another popular feature
that computes the closeness of the pair by counting
the common neighborhood. We thus define the sum
of powers as our another choice of feature for link
prediction:
(3.5)
fsop (B) =
r
X
αi B i ,
i=1
Notice that if we replace B with B̃ in Eqs. 3.1, 3.2,
3.5, we can have another three candidate features for
link prediction.
The second set of features for link prediction employs Laplacian and normalized Laplacian. The MoorePenrose pseudoinverse of the laplacian [10] is called the
commute time or resistance distance kernel. By the regularization of the commute kernel, we have regularized
Laplacian kernels [11]. These two are defined as
(3.6)
fcomm (L) = L+ ,
(3.7)
fcommr (L) = (I + αL)−1 ,
fexp (B) = exp(αB),
fneu (B) = (I − αB)−1 ,
where α is a positive parameter. The exponential and
von Neuman kernel are equivalent to the path counting
features for the link prediction. Path counting features
capture the proximity between two nodes in the terms
of the number of the length of paths between them.
Exploring the fact that, B r , the powers of the network
521
771
The ”commute time” kernel takes its name from
the average commute time, which is defined as the
average number of steps a random walker, starting
from node i 6= j, will take before entering a node j
for the first time, and go back to i. The kernel is
also known as resistance distance kernel because of a
close analogy with the effective resistance in electrical
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Table 2: Best features for Auxiliary Networks
Auxiliary Network
B
C
D
E
Best Predictor
Sum of Power
Heat Kernel
Commr kernel
Sum of Power
Heat Kernel
Sum of Power
Sum of Power
Sum of Power
Heat Kernel
Sum of Power
Sum of Power
Heat Kernel
Best Mapping
B̃
L
L
C̃
L
C
D̃
D
L
Ẽ
E
L
RMSE
0.08
0.22
0.23
0.11
0.38
0.39
0.08
0.10
0.39
0.09
0.26
0.28
networks. Anther candidate feature is the heat diffusion
kernel [9] defined as
(3.8)
AUC-Al×l
87.3
73.5
85.8
77.7
68.1
73.8
77.3
75.6
71.1
71.4
59.1
68.9
AUC-A
86.1
71.2
80.2
74.4
68.4
71.8
80.2
77.9
72.8
73.8
63.5
60.7
min k f (Xl×l) − Al×l kF
=k U f (Λ)U T − Al×l kF
fheat (L) = exp(−αL),
(3.10)
Also, if we replace L with L̃ in Eqs. 3.6, 3.7, 3.8, we
have another three candidate link prediction features.
Therefore, for an individual source, we can enumerate
12 candidate features to infer the link structure in
the major network. The candidate pool is obviously
not complete but it is good enough to produce good
performance and enables the efficient feature selection
strategy.
3.2 Feature Selection Strategy We notice that for
each feature in the pool, we can represent the feature as
a function of X: f (X). X could be B, B̃, L or L̃. Since
we already know a small subgraph of size l, we propose
to select the best feature by minimizing the following:
(3.9)
Parameter
0.65
0.89
0.92
0.43
0.93
0.01
0.33
0.01
0.79
0.56
0.89
0.95
min k f (Xl×l ) − Al×l kF ,
f
where f is from the pool, X is one of the four mappings,
Al×l is the given subgraph , Xl×l is the mapping on the
same nodes as in Al×l and k · kF denotes the Frobenius
norm.
To solve Eq. 3.9, we adopt the following two observations. Firstly, For any symmetric matrix, f (X) can
be computed as f (X) = U f (Λ)U T where U is the orthogonal matrix whose ith column is the ith eigenvectors
of X and Λ is the diagonal matrix in which Λii is the
ith eigenvalues of X. Secondly, the Frobenius norm is
invariant under the multiplication of an orthogonal matrix. Based on the two observations we have:
522
772
=k f (Λ) − U T Al×l U kF
X
∼
(f (Λii ) − U.iT Al×l U.i )2
i
in the last step of Eq. 3.10, notice that f (Λ) is a
diagonal matrix, thus we can drop off all the nondiagonal elements in U T Al×l U because it is irrelevant
to f . Therefore, the feature selection is equivalent to
the one-dimensional curve fitting, which can be solved
efficiently. Notice that the last step of Eq. 3.10
corresponds to the root mean squared error (RMSE).
Furthermore, most of the features in the candidate pool
is parameter-laden (except commute kernel). With Eq.
3.10, we can also tune the parameter by achieving the
minimal RMSE.
Now we show some empirical evidences to justify the
benefits of the proposed feature selection strategy. For
each auxiliary network, we only present the results of
the top-3 features in terms of RMSE and present the
AUC and ROC curve of each feature. The detailed results are shown in Table 2 where AU C − Al×l denotes
the performance of the prediction on the known subgraph Al×l and AUC-A denotes the performance of the
prediction on the major network A. The parameter indicates the optimal value of α selected by our approach.
One thing missing from this table is the parameter r for
the feature Sum of Power, the parameter is selected to
be 3. Figures 3 to 6 show the ROC curve for the top-3
features from each auxiliary network.
Several observations arise from the empirical evaluation. First, our proposed method selects the best feature
for each source, i.e., the feature with minimal RMSE
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Figure 3: ROC of the best
features from Network B
Figure 4: ROC of the best
features from Network C
corresponds to the best performance both in the known
subgraph and the social network. Second, the prediction power of each source differs: the network B has
the greatest prediction power. Since network B is a cocontact network, i.e., users are connected if they share a
contact, this observation confirms the popular assumption both in our daily life and in social network research:
”Friends of friend are friends”.
Thirdly, in the related work [6] of the similar topic,
Lu, et al. tackles the problem of multiple sources by
mandating the Katz index is the best feature for each
source ,which is not flexible and does not respect the
fact that different sources have varied best feature. Our
work shows that the best feature in each source in
Youtube is the sum of power of normalized network.
Notice that when the data set changes, the best feature for each source may also vary, yet our proposed
approach is still able to capture the best feature in each
source for the link prediction.
Finally, we observe that although the minimal RMSE
corresponds to the best performance, when RMSE value
is large, smaller RMSE does not necessarily indicate
worse performance. For example, the RMSE for heat
kernel and regularized commute kernel of B is 0.22 and
0.23, respectively. However, in terms of AUC, regularized commute kernel is superior to the heat kernel. This
happens because large RMSE value means a significant
part of the feature doesn’t fit Al×l . However, RMSE
does not distinguish between mismatch from the major
part and from the outliers.
Figures 7 and 8 (better seen in color) show the curve
fitting plot for heat kernel and regularized commute
kernel, respectively. The x-axis denotes the value of
f (Λii ) and y-axis denotes the value of U.iT Al×l U.i . As we
can see, the regularized commute kernel fits more closely
to the major part. Its RMSE value is bigger than that
of the heat kernel because the heat kernel probably fits
more closely to the outliers. This explains the possible
reverse behavior of large RMSE and its corresponding
523
773
Figure 5: ROC of the best
features the Network D
Figure 7: Heat Kernel
Curve Fitting
Figure 6: ROC of the best
features from Network E
Figure 8: CommR Kernel
Curve Fitting
performance.
3.3 The Complement of Multiple Sources From
the empirical evidence, we notice that network B has
the greatest prediction power and sum of power of
normalized B achieves the best performance. Also we
observe that the best features from other three auxiliary
networks also have good performance. As we perceive,
people are sophisticated and people become friends in
a variety of reasons. Therefore, although network B
shows the greatest prediction power, which indicates
the latent reason B represents is probably the major
reason people becomes friends in Youtube, yet we can
never safely rule out other possible reasons that other
three auxiliary networks embody. We believe the four
auxiliary networks are to some extent complementary in
that the proper integration of those auxiliary networks
should shed more lights on how people are connected
in Youtube. To explore the the complement of multiple
sources, we propose two alternatives. One is to select
the best feature from each source and combine them to
perform the prediction. The second is to combine the
best features overall, i.e., select the features that have
top performance. In this case, since network B has the
greatest prediction power, multiple of its features will
be selected.
Figures 9 and 10 show the ROC curve of the two
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
9.
Table 3: Performance of Different Combinations on the
major network A
B
C
D
E
Figure 9: ROC curve of
the first alternative
Figure 10: ROC curve of
the second alternative
alternatives on the major network A. In Fig. 9, B, C, D,
E denotes the sum of power of the normalized matrix B,
C, D, E, respectively. B +C +D+E denotes the average
ensemble of sum of powers of the normalized matrix.
The ensemble result achieves better result than the best
feature from each source. This supports our assertion
that multiple sources are to some extent complementary.
In Fig. 10, on the other hand, B + B + C + E denotes
the average ensemble of two features from B (sum of
power and regularized commute kernel) and the best
features from C and E. We note that the performance
of such ensemble is inferior to the first alternative but
still better than the best feature of each source.
Figure 11 shows the ensemble of the top-2 features in
B, i.e., sum of powers of normalized B and regularized
commute kernel, comparing with the performance of
individual features. As we can see, the performance
of the ensemble result, although very close to that of
sum of power of normalized B, is still worse. This
indicates that the performance increase in Figure 10
occurs because of the ensemble of features from different
sources. It also explains the reason that the ensemble
in Figure 10 has worse performance than that in Figure
524
774
C
87.5
72.9
-
D
87.1
84.2
80.7
-
E
84.7
78.0
82.6
72.7
Table 3 presents the performance of different combinations of the best feature from each source. Any cell
in the table denotes the average ensemble of two features. Since the table is symmetric, we only show its
upper triangle. It is clear that any combination of two
features boosts the performance comparing to the feature itself. We also experiment on the combination of
B+C+E, B+D+E and C+D+E. These combinations all
boost the performance from the individual source. The
results are inspiring in that we are able to achieve the
desirable goal that the more sources we obtain, the more
precisely we can predict.
4
Figure 11: Ensemble of two features from B
B
84.1
-
Regularization
The feature selection phase chooses the feature that
has the best performance. The feature is selected by
aligning with the known subgraph Al×l . In the PCS
problem, Al×l is usually small, thus the selected feature
is prone to over-fitting. It is important to control the
over-fitting through a proper regularization.
Inspired by the work of Vert, et al [25], we propose
the regularization strategy as follows:
• In the major network A, construct a mapping h
that maps the original nodes of A to a Hilbert
space.
• The mapping h must respect the known subgraph
Al×l as much as possible and follow a regularization.
• The prediction is made by computing the inner
product between the mappings of two nodes.
It is noted that the optimal mapping ĥ comes from
the Reproducing Kernel Hilbert Space (RKHS) induced
by the output of the first phase. The RKHS theory
for real valued functions provides a powerful theoretical framework for regularization. Numerous models including support vector machines can be derived from
the application of the representer theorem and different
data-dependent cost functions.
In the following, we briefly review the main ingredients of the RKHS theory for vector-valued functions
that are needed for our regularization.
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
4.1 RKHS for Vector-valued Functions For a
given Hilbert space F† , L(F† ) denotes the set of
bounded linear operator from F† to itself. Given Z ∈
L(F† ), Z ∗ denotes the adjoint of Z.
Definition 1 [13, 14, 19] Let O be a set and F† a
Hilbert space. Kx : O × O → L(F† ) is a kernel if:
′
′
′ ∗
• ∀(o, o ) ∈ O × O, Kx (o, o ) = Kx (o, o )
∈
N, ∀{(oi , yi )}m
i=1
i,j=1 hyi , Kx (oi , oj )yj iF† ≥ 0
• ∀m
Pm
⊆
O × F† ,
The kernel defined here is operator-valued kernel with
output in L(F† ) instead of R. The above definition is
similar to the definition of normal kernel (symmetric,
semi positive-definite).
Theorem 1 [13, 14, 19] Let O be a set and F† a
Hilbert space. If Kx : O × O → L(F† ) is an operatorvalued kernel, then there exists a unique RKHS H which
admits Kx as the reproducing kernel, that is
Theorem 1 is analogous to Moore-Aronszajn theorem
[16] describing the relation between kernel and RKHS
of real-valued functions: each kernel defines a unique
RKHS. With the operator-valued kernel and Theorem
1, we have the following representer theorem.
Theorem 2 [14, 19] Let O be a set and F† a
Hilbert space. Given a set of labeled sample Sl =
{(oi , yi )}m
i=1 ⊆ O × F† , a RKHS H with reproducing
kernel Kx : O × O → L(F† ), the minimizer ĥ of the
following regularization:
(4.11) argminJ(h) =
h∈H
k h(oi ) − yi k2F† +λ k h k2H
i=1
with λ > 0, admits an expansion:
ˆ =
h(·)
l
X
Kx (oj , ·)cj ,
i=1
where the vector cj ∈ F† , j = {1, · · · , l} satisfy the
equations:
yj =
l
X
4.2 Regularization First we show how to obtain
a operator-valued kernel. If we have a scalar kernel:
kx : O × O → R, we can define the operator-valued
kernel Kx : O × O → L(F† ) as follows
Kx : kx (o, o′ ) × IF† ,
where IF† is the identity matrix of size of dimension
of F† . Clearly, Kx is symmetric. Furthermore, the
semi positive-definite property of kx leads to the semi
positive-definite of Kx : ∀m ∈ N, ∀{(oi , yi )}m
i=1 ⊆ O ×
F† , we have
m
X
hyi , Kx (oi , oj )yj iF† =
i,j=1
∀o ∈ O, ∀y ∈ F† , hh, Kx(o, ·)yiH = hh(o), yiF† ,
l
X
much as possible and follow a regularization. To benefit
from RKHS theory, we should define a proper operatorvalued kernel based on the features selected in the first
phase and find a representation for the labels of each
node i.e., {yi }li=1 in the major network A.
(Kx (oi , oj ) + λδij )ci ,
i=1
where δ is the Kronecker symbol: δii = 1 and ∀j 6= i,
δij = 0.
Theorem 2 is devoted to the penalized least square.
Eq. 4.11 presents the desirable property for our regularization property: respect the known subgraph Al×l as
525
775
m
X
kx (oi , oj )hyi , yj iF† ≥ 0
i,j=1
The input scalar kernel is chosen to be the best
feature K in the first phase, which, in this case, is the
average ensemble of the sum of powers of normalized
networks. It is noted that this feature may not be a
semi positive-definite matrix, thus not a kernel per se.
Theoretically, a kernel matrix K must be positive semidefinite. Empirically, for machine learning heuristics,
choices of K that do not satisfy Mercer’s condition
may still perform reasonably if K at least approximates
the intuitive idea of similarity. The feature for link
prediction captures the idea of proximity between users,
thus can be used as a reasonable input to construct the
operator-valued kernel.
In this context, the known subgraph Al×l is viewed
as the training set. But different from normal classification, we now don’t have labels {yi }li=1 for the nodes
in the training set, rather we have labels for the links
in the training set. To obtain the labels, we construct a
Gram matrix KYl from the training set Al×l using any
kernel that encodes the proximities of nodes in a given
graph. In this work, we use the heat diffusion kernel:
KYl = exp(−βLYl ), where LYl is the Laplacian of Al×l .
We assume that a semi positive-definite kernel ky :
O × O → R underlies this Gram matrix. By Mercer theorem [17], there exists a Hilbert space F† and a
mapping y: O → F† such that ∀(o, o′ ) ∈ O, ky (o, o′ ) =
hy(o), y(o′ )iF† . In this way, we obtain the label information {yi }li=1 about the nodes in the training set. Intuitively, yi can be seen as a vector which represents the
position in the Hilbert space F† of the node oi in the
training set.
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
When Kx and {yi }li=1 are defined, the solution to Eq.
4.11 reads
C = Yl (KXl + λIl )−1
where Yl = (y1T , · · · , ylT ), KXl is the Gram matrix of
size l × l associated with kernel kx and Il is the identity
matrix of size l.
Now we can compute the prediction between nodes oi
and oj as follows:
hh(oi ), h(oj )iF† = hCkx (o1,··· ,l , oi ), Ckx (o1,··· ,l , oj )iF†
= kx (o1,··· ,l , oi )T QT KYl Qkx (o1,··· ,l , oi )
where Q = (KXl + λIl )−1 , kx (o1,··· ,l , oi ) is a l × 1 vector
whose element is kx (ok , oi ) and ok is a node in the
training set. It is noted that the label for nodes {yi }li=1
is never explicitly computed.
Theoretically, any kernel can be acted as the input
scalar kernel for the regularization. However, as our
empirical evaluation will show, the quality of the input
kernel has a huge influence on the performance of the
regularization strategy. Simply put, the kernel that
performs better in the first phase will have better
performance than the kernel that performs worse in the
first phase. Therefore, the first phase is justified from
another perspective as the proper kernel selection for
the regularization.
In the following, we show the empirical evidences that
support the claims: 1) Regularization is able to boost
the performance from the first phase. 2) Different input
kernels have great influences on the performance of the
regularization.
first phase. In Figure 13, the results are about regularization with different input kernels. In this plot, the
legends B, C, D, E and B+C+D+E indicate that we
choose sum of power of normalized auxiliary matrix and
the average ensemble as the input kernel to the regularization. It is clear in the diagram that different kernels
bear varied performances and the best input kernel from
the first phase achieves the best performance.
5
Empirical Evaluation
In previous sections, we present empirical evidence to
support our conclusions. To make the empirical evaluation more complete, there are several issues needed to
be discussed.
5.1 Low Rank Approximation In Section 2, we
need to compute the varied features for auxiliary network. Those networks are huge and dense, therefore
computing features out of them is not trivial. Brute
force is seldom an option, instead, we have to use low
rank approximation to compute the approximation of
the features. Since all our features in Section 2 can be
represented as f (X) and X is a symmetric matrix, a
rank-k approximation of X is given by truncation leaving only k eigenvalues and eigenvectors of it:
T
,
f(k) (X) = U(k) f (Λ(k) )U(k)
when X is an auxiliary network and normalized auxiliary network, the biggest k eigenvalues are used while
the smallest eigenvalues are used for the Laplacian and
normalized Laplacian. It is envisioned that the parameter k will have influence on the performance.
Table 4: Influence of Low Rank Approximation
Feature
Sopb
Sopc
Sopd
Sope
k =100
84.1
74.4
80.2
73.8
k =200
86.0
72.9
80.7
72.3
T.(k =100)
34s
21s
19s
61s
T.(k=200 )
86s
77s
36s
97s
In Table 4, the first two columns show the AUC of
different k and the last two columns show the time to
compute each feature. Sopb denotes the sum of power of
normalized
of B. In Table 4, we see the influence of low
Figure 12: ROC curve for Figure 13: ROC curve for
rank
approximation
on the performance and the running
the regularization
different input kernels
time. The experiments are carried on in Intel Duo Core
In Figure 12, the legends B, C, D, E and B+C+D+E 2.7G with 4 GB memory. Clearly, increasing k will
stand for the best feature from each source in the first increase running time, but larger k doesn’t necessarily
phase and the average ensemble of the best features. indicate better performance. In our work, we take k to
The black curve denotes the performance of regulariza- be 200.
tion with input kernel of B+C+D+E. As seen from this
figure, regularization boosts the performance from the
526
776
5.2 The Known Subgraph In our PCS problem,
we know a small subgraph. In the empirical evidences
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
notice that this parameter is remotely relevant to the
performance of the prediction, so we omit the reports
here.
6
Figure 14: Parameter λ Sensitivity
shown in Section 2 and 3, we take the size of subgraph
l to be 10% of n, which is the size of A. To completely
evaluate the influence of l, we take l to be 5%, 10%
and 15% of n and randomly sample 10 times l users
from the major networks and their corresponding links
as the known subgraph, we summarize the results in the
following table.
Table 5: Influence of the Known Subgraph
Method
Sopb
Ensemble
Regularization
5%
85.5 ± 0.7
88.4 ± 0.6
90.2 ± 0.3
10%
86.1 ± 0.5
89.0 ± 0.4
90.8 ± 0.3
15%
86.2 ± 0.5
89.3 ± 0.5
91.2 ± 0.4
In this table, Sopb denotes the best feature from
B: sum of power of normalized B, Ensemble denotes
the average ensemble of the best feature from each
source and Regularization denotes the regularization’s
performance. We notice that with the increase of the
training size, the performance of Sopb and Ensemble
hardly increases. This happens because for a specific
data set, the feature from each source that performs best
in the link prediction in the social network is fixed. The
same feature is always selected despite of varied training
size. Regularization does show performance increase in
terms of training size. This satisfies the basic machine
learning assumption that with bigger training set, we
can do better.
5.3 Parameter Sensitivity The performance of
regularization is sensitive to the parameter λ. We tune
the parameter by achieving the best prediction performance in the training set. Figure 14 shows the performance curve over different values of λ on one experiment.
Another parameter is the β in the heat diffusion
kernel that represents the proximity relations between
nodes. We also tune the parameter by achieving the
best prediction performance in the training set. We
527
777
Related Work
Liben-Nowell and Kleinberg [1] introduce the link prediction problem and show that simple graph-theoretic
measures, such as the number of common neighbors, are
able to detect links that are likely to appear in a social
network. By adopting more elaborated measures that
consider the ensemble of all paths between two nodes
(e.g., Katz index [7]), they further improve the prediction quality.
Inspired by the seminal work by Liben-Nowell and
Kleinberg, a huge amount of attentions have been
paid to discover unsupervised measures that can better
predict the link structure [9, 22, 26]. Those link
prediction methods are ad-hoc, leading to the fact that
there is no method consistently superior in all data sets.
This makes sense in that different networks have distinct
mechanisms of formation and evolution as shown in
[22]. However, it is noted that Katz index [7] and its
variant truncated Katz index show their success in many
applications [1, 6].
Link prediction can also be viewed as a binary classification problem, in which the labels of the links are
either 0 (non-existing) or 1 (existing). Hasan, et al. [24]
propose to employ various classification approaches to
handle link prediction directly. Lichtenwalter, et al. [22]
notice the extreme imbalanced distribution of class variables of the links and propose over-sampling strategy to
overcome imbalance. Vert, et al. [25] propose a novel
supervised model in which points in the network are
first mapping to the Euclidean space and the prediction
is made by computing the distance between the points
in the Euclidean space. The mapping is mandated to
conform to the existing link structure. Inspired by the
recent development of the semi-supervised learning in
machine learning community [18], Kashima, et al. [20]
propose to apply the label propagation framework [27]
to link prediction problem and are able to bring down
the time complexity of their algorithm from O(n4 ) to
O(n3 ). Brouard, et al. [19] present a new representer
theorem for semi-supervised link prediction and show
that semi-supervised is able to boost the performance
of supervised link prediction.
Most of the research work on link prediction assume
that we know a general link structure of the network,
either the network at time t or the network with some
missing links. In the work [2], Leroy, et al. introduce the cold start link prediction problem by assuming that there is no known link structure available.
Their ambition is to rebuild the network from the aux-
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
iliary information. However, they manually inspect the
possible features from the single auxiliary information
and find the best feature by comparing with ground
truth. When we have multiple auxiliary sources, their
approach would lose much efficiency.
Several researchers place their focus on the problems
related to multiple sources in social networks. Tang, et
al. [4] study the problem of clustering the social network
in the context of multiple sources or multi-dimensional
social networks. Du, et al. [5] study the behavior of
the evolution of large multi-modal social network from
Nokia. In the work that is most related to our work,
Lu, et al. [6] propose a weight ensemble model to
combine different features from various sources. Also,
they proposes two regularization strategies to control
the complexities of the model. One limitation is that
they mandate the Katz index to be the best feature
from each source. Such assumption may be broke in
other data sets while we propose the feature selection
strategy to determine the best feature in each source.
7
Future Work and Conclusion
In this work, we introduced the pseudo cold start link
prediction with multiple sources problem in the context that given a small subgraph and multiple auxiliary
sources, we aimed to predict the link structure of the
social network. The problem posed great challenges in
that traditional link prediction methods are unable to
work in this scenario. Also multiple auxiliary sources
brought the curse: multiple sources implies the explosion of possible features for link prediction. Multiple
sources also indicate various distinct even complementary perspectives to predict the link structure in the
social network.
To tackle this problem, we proposed a two-phase
methods: we first proposed an efficient and effective
feature selection strategy to cure the curse of multiple
sources. The strategy found the best feature in each
source by aligning the possible feature to the known
subgraph. Its efficiency is guaranteed by its equivalent
to one dimensional curve fitting. We also discovered
that by combining the best feature from each source
(they may not be the best feature s overall), we achieved
a better result than the best individual feature among
all sources. By considering the risk of over-fitting
because of the small size of the known subgraph, in
the second phase, we proposed a regularization method.
The regularization first mapped the points in the social
network to a Hilbert space and mandated the mapping
to respect the known subgraph and a regularizer. We
showed that the proposed regularization increased the
performance given the best feature discovered in the
first phase. Extensive experimental evaluation verified
528
778
our assumptions and conclusions, thus showing the
benefits of our approach.
In the future, we have great interests to explore more
features from each source and design an efficient, unified
framework to find the best feature for link prediction.
Also, in this work, we used average ensemble of the
best feature from each source as an input to the second
phase, it is natural to shift into a weighted ensemble of
the best features, therefore, an efficient weight scheme
is our goal of interests in the future.
References
[1] D. Liben-Nowell and J. Kleinberg. The link prediction
problem for social networks. Journal of the American
Society for Information Science and Technology, 2007
[2] V. Leroy, B. Cambazoglu and F. Bonchi. Cold start
link prediction. KDD, 2010
[3] David Kempe, J. Kleinberg, and E. Tardos. Maximizing the spred of influence through a social network.
KDD, 2003
[4] Tang, Lei and Liu, Huan and Zhang, Jianping and Nazeri, Zohreh, Community evolution in dynamic multimode networks, KDD, 2008
[5] Nan Du, Nokia Research Center; Hao Wang, Nokia
Research Center; Christos Faloutsos, CMU, xSocial:
Analysis of Large Multi-Modal Social Networks Patterns and a Generator, ECML/PAKDD, 2010
[6] Z. Lu, B. Savas, W. Tang and I. Dhillon, Supervised
link prediction using multiple sources ICDM, 2010
[7] Leo Katz. A new status index derived from sociometric
analysis. Psychometrika, 1953
[8] Kandola, J., Shawe-taylor, J., and Cristianini, N.
Learning semantics similarity, NIPS, 2002
[9] Ito, T., Shimbo, M., Kudo, T., and Matsumoto, Y.
Application of kernels to link analysis, ICDM, 2005
[10] Fouss, F., Pirotte, A., Renders, J.-M., and Saerens,
M. Random-walk computation of similarities between
nodes of a graph with application to collaborative
recommendation. TKDE, 2007
[11] Smola, A., and Kondor, R. Kernels and regularization
on graphs. Proc. Conf. on Learning Theory and Kernel
Machines
[12] J. Kunegis, and A. Lommatzsch, Learning spectral
graph transformation for link prediction, ICML, 2011
[13] Senkene, E. and Temple’man, A. Hilbert spaces of
operator-valued functions Lithuanisan Mathematical
Journal, 1973
[14] Caponnetto, A., Micchelli, C., Pontil, M. and Ying.
Y., Universal multitask kernels. Journal of Machine
Learning Research, 2008
[15] Micchelli, C. and Pontil, M. On learning vector-valued
functions Neural Computation, 2005
[16] Aronszajn, Nachman ”Theory of Reproducing Kernels”. Transactions of the American Mathematical Society, 1950
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
[17] Mercer, J. , ”Functions of positive and negative type
and their connection with the theory of integral equations”, Philosophical Transactions of the Royal Society,
1909
[18] Belkin, M., Niyogi, P. and Sindhhwani, V. Manifold
regularization: A geometric framework for learning
from labeled and unlabeled examples. Journal of Machine Learning Research, 2006
[19] Brouard, C., ALche-Buc, F. and Szafranski, M, Semisupervised penalized output kernel regression for link
prediction, ICML, 2011
[20] Kashima, H, Kato, T., Yamanishi, Y., Sugiyama,
M., and Tsuda, K. Link propagation: a fast semisupervised learning algorithm for link prediction, SDM,
2009
[21] Cortes, C. Mohri, M and Weston, J A general regression technique for learning transductions. ICML, 2005
[22] R. Lichtenwalter, J. Lussier and N. Chawla, New
Perspectives and Methods in Link Prediction, KDD,
2010
[23] H, Kautz, B. Selman, and M. Shah Referral web:
combining social networks and collaborative filtering,
Communications of the ACM, 1997
[24] M. Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction using supervised learning, SDM, 2006
[25] J. Vert and Y. Yamanishi, Supervised Graph Inference,
NIPS, 2004
[26] Z. Huang, Link Prediction Based on Graph Topology:
The Predictive Value of the Generalized Clustering
Coefficient, KDD, 2006
[27] Zhou, D., Boursquet, O., Lal, T. N, Weston, J. and
Scholkopf,B Learning with local and global consistency.
NIPS, 2004
529
779
Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Download