SimRank

SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab. Outline       Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 2 Introduction  Many applications require a measure of “similarity” between objects – “find-similar-document” query in search engine – Collaborative filtering in a recommender system 3 Introduction  Propose a general approach that exploits the object-to-object relationships in many domains – An algorithm to compute similarity scores between nodes based on the structural context  Intuition behind the algorithm – Similar objects are related to similar objects – The base case is that objects are similar to themselves “Two objects are similar if they are referenced by similar objects” 4 Basic Graph Model  G = (V, E) [vertex, edge] – Nodes in V: objects in the domain – Directed edges in E: relationships between objects – <p, q> : from object p to object q  For a node v, denote: – – – – O (Univ) I(v): the set of in-neighbors of v O(v): the set of out-neighbors of v Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| ) Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| ) 5 I (ProfB) Outline       Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 6 SimRank  Motivation – Two objects are similar if they are referenced by similar object – Consider an object maximally similar to itself (similarity score of 1) Similar nodes: {ProfA, ProfB}, {StudentA, StudentB}, {Univ, ProfB}, … 7 SimRank Basic SimRank Equation  The similarity between objects a and b: s(a, b) ∈ [0, 1] 1  s(a , b)   C  I (a) I (b)  (if a  b) I (a) I (b)   s(I (a), I i 1 j 1 i j (b)) (if a  b) – C is a constant between 0 and 1  Confidence level or decay factor  C gives the rate of decay as similarity flows across edges (since C < 1) – If a or b may not have any in-neighbors, s(a,b) = 0 – SimRank scores are symmetric, i.e., s(a,b) = s(b,a)  Similarity between a and b is the average similarity between inneighbors of a and in-neighbors of b 8 SimRank Basic SimRank Equation  Similarity can be thought of as “propagating” from pair to pair – Consider the derived graph G2=(V2, E2) where  V2=V x V, represents a pair (a,b) of nodes in G  An edge from (a,b) to (c,d) exists in E2, iff the edges <a,c> and <b,d> exist in G 9 SimRank Bipartite SimRank  Bipartite domains consist of two types of objects  Recommender system – People are similar if they purchase similar items – Items are similar if they are purchased by similar people 10 SimRank Bipartite SimRank  Bipartite Equation – Directed edges go from people to items – s(A,B) denote the similarity between persons A and B, (A≠B) C1 s( A, B)  O( A) O( B) O ( A) O ( B )   s(O ( A),O ( B)) i 1 j 1 i j – s(c,d) denote the similarity between items c and d, (c≠d) C2 s(c, d )  I (c ) I ( d ) I (c) I (d )   s( I (c), I i 1 j 1 i j (d )) – The similarity between persons A and B is the average similarity between the items they purchased – The similarity between items c and d is the average similarity between the people who purchased them 11 SimRank Computing SimRank - Naïve Method  Rk(a,b) gives the score between a and b on iteration k 0 (if a  b) R0 (a,b)   1(if a  b) C Rk 1 (a, b)  I (a) I (b)     I (a) I (b )   R ( I (a), I i 1 j 1 k i j (b)) The values Rk(*,*) are non-decreasing as k increase limk  Rk (a, b)  s(a, b) In experiments, when K = 5, Rk is rapidly converged Complexity – Space: O(n2) to store the result Rk, – Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b) 12 SimRank Computing SimRank - Pruning  Pruning the logical graph G2 – In naïve method,  All n2 nodes of G2 are considered  Similarity score are computed for every node-pair – Nodes far from a node v has less similarity score with v than nodes near v  Pruning – Set the similarity between two nodes far apart to be 0 – Consider node-pairs only for nodes which are near each other in the range of radius r – Complexity  space: O(ndr), dr is average nodes which are near from a node  time: O(Kndrd2) 13 Outline       Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 14 Random Surfer-Pairs Model  For the intuition of similarity scores, provide an intuitive model – Based on “random surfers” – Show the SimRank score s(a,b) measures how soon two random surfers are expected to meet at the same node  Expected Distance – u and v are nodes in strongly connected graph – The ED from u to v is exactly the expected number of steps a random surfer would take before he first reaches v, starting from u d (u ,v )   P[t ]l[t ] t :u v – Tour t = <w1, …, wk> – l[t]: length of t – P[t]: probability of traveling t 15 Random Surfer-Pairs Model  Expected Meeting Distance (EMD) – EMD is symmetric – EMD m(a,b) is simply the expected distance in G2 from (a,b) to any singleton node(x,x) ∈ V2 m(a, b)   P[t ]l[t ] t :( a ,b )( x , x ) m(*,*)= ∞ m(v,w)=1 m(*,*)= 3 m(u,v)=∞ m(u,w)=∞ 16 Random Surfer-Pairs Model  Expected-f Meeting Distance – Our approach to circumvent the “infinite EMD” problem  Map all distances to a finite interval: instead of computing expected length l(t) of a tour s' (a, b)   P[t ]c l (t ) t :( a ,b )( x , x )  Equivalence to SimRank – S’(*,*) is exactly models that our original definition of SimRank scores 17 Outline       Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 18 Conclusion  Main contribution – A formal definition for SimRank similarity scoring over arbitrary graphs, sev eral useful derivatives of SimRank, and an algorithm to compute SimRank – A graph-theoretic model for SimRank that gives intuitive mathematical insig ht into its use and computation – Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank 19 Future Work  Address efficiency and scalability issues – Including additional pruning heuristics and disk-based algorithms  Consider ternary (or more) relationships in computing structuralcontext similarity  Explore the combination of SimRank with other domain-specific similarity measures 20

SimRank

Related documents

Products

Support

SimRank

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib