SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU IDB Lab. Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 2 Introduction Many applications require a measure of “similarity” between objects – “find-similar-document” query in search engine – Collaborative filtering in a recommender system 3 Introduction Propose a general approach that exploits the object-to-object relationships in many domains – An algorithm to compute similarity scores between nodes based on the structural context Intuition behind the algorithm – Similar objects are related to similar objects – The base case is that objects are similar to themselves “Two objects are similar if they are referenced by similar objects” 4 Basic Graph Model G = (V, E) [vertex, edge] – Nodes in V: objects in the domain – Directed edges in E: relationships between objects – <p, q> : from object p to object q For a node v, denote: – – – – O (Univ) I(v): the set of in-neighbors of v O(v): the set of out-neighbors of v Ii(v): individual in-neighbor ( 1 ≤ i ≤ |I(v)| ) Oi(v): individual out-neighbor ( 1 ≤ i ≤ |O(v)| ) 5 I (ProfB) Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 6 SimRank Motivation – Two objects are similar if they are referenced by similar object – Consider an object maximally similar to itself (similarity score of 1) Similar nodes: {ProfA, ProfB}, {StudentA, StudentB}, {Univ, ProfB}, … 7 SimRank Basic SimRank Equation The similarity between objects a and b: s(a, b) ∈ [0, 1] 1 s(a , b) C I (a) I (b) (if a b) I (a) I (b) s(I (a), I i 1 j 1 i j (b)) (if a b) – C is a constant between 0 and 1 Confidence level or decay factor C gives the rate of decay as similarity flows across edges (since C < 1) – If a or b may not have any in-neighbors, s(a,b) = 0 – SimRank scores are symmetric, i.e., s(a,b) = s(b,a) Similarity between a and b is the average similarity between inneighbors of a and in-neighbors of b 8 SimRank Basic SimRank Equation Similarity can be thought of as “propagating” from pair to pair – Consider the derived graph G2=(V2, E2) where V2=V x V, represents a pair (a,b) of nodes in G An edge from (a,b) to (c,d) exists in E2, iff the edges <a,c> and <b,d> exist in G 9 SimRank Bipartite SimRank Bipartite domains consist of two types of objects Recommender system – People are similar if they purchase similar items – Items are similar if they are purchased by similar people 10 SimRank Bipartite SimRank Bipartite Equation – Directed edges go from people to items – s(A,B) denote the similarity between persons A and B, (A≠B) C1 s( A, B) O( A) O( B) O ( A) O ( B ) s(O ( A),O ( B)) i 1 j 1 i j – s(c,d) denote the similarity between items c and d, (c≠d) C2 s(c, d ) I (c ) I ( d ) I (c) I (d ) s( I (c), I i 1 j 1 i j (d )) – The similarity between persons A and B is the average similarity between the items they purchased – The similarity between items c and d is the average similarity between the people who purchased them 11 SimRank Computing SimRank - Naïve Method Rk(a,b) gives the score between a and b on iteration k 0 (if a b) R0 (a,b) 1(if a b) C Rk 1 (a, b) I (a) I (b) I (a) I (b ) R ( I (a), I i 1 j 1 k i j (b)) The values Rk(*,*) are non-decreasing as k increase limk Rk (a, b) s(a, b) In experiments, when K = 5, Rk is rapidly converged Complexity – Space: O(n2) to store the result Rk, – Time: O(Kn2d2), d2 is the average of |I(a)||I(b)| over all node pairs (a,b) 12 SimRank Computing SimRank - Pruning Pruning the logical graph G2 – In naïve method, All n2 nodes of G2 are considered Similarity score are computed for every node-pair – Nodes far from a node v has less similarity score with v than nodes near v Pruning – Set the similarity between two nodes far apart to be 0 – Consider node-pairs only for nodes which are near each other in the range of radius r – Complexity space: O(ndr), dr is average nodes which are near from a node time: O(Kndrd2) 13 Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 14 Random Surfer-Pairs Model For the intuition of similarity scores, provide an intuitive model – Based on “random surfers” – Show the SimRank score s(a,b) measures how soon two random surfers are expected to meet at the same node Expected Distance – u and v are nodes in strongly connected graph – The ED from u to v is exactly the expected number of steps a random surfer would take before he first reaches v, starting from u d (u ,v ) P[t ]l[t ] t :u v – Tour t = <w1, …, wk> – l[t]: length of t – P[t]: probability of traveling t 15 Random Surfer-Pairs Model Expected Meeting Distance (EMD) – EMD is symmetric – EMD m(a,b) is simply the expected distance in G2 from (a,b) to any singleton node(x,x) ∈ V2 m(a, b) P[t ]l[t ] t :( a ,b )( x , x ) m(*,*)= ∞ m(v,w)=1 m(*,*)= 3 m(u,v)=∞ m(u,w)=∞ 16 Random Surfer-Pairs Model Expected-f Meeting Distance – Our approach to circumvent the “infinite EMD” problem Map all distances to a finite interval: instead of computing expected length l(t) of a tour s' (a, b) P[t ]c l (t ) t :( a ,b )( x , x ) Equivalence to SimRank – S’(*,*) is exactly models that our original definition of SimRank scores 17 Outline Introduction Basic Graph Model SimRank Random Surfer-Pairs Model Conclusion Future Work 18 Conclusion Main contribution – A formal definition for SimRank similarity scoring over arbitrary graphs, sev eral useful derivatives of SimRank, and an algorithm to compute SimRank – A graph-theoretic model for SimRank that gives intuitive mathematical insig ht into its use and computation – Experimental results using an in-memory implementation of SimRank over two real data sets shows the effectiveness and feasibility of SimRank 19 Future Work Address efficiency and scalability issues – Including additional pruning heuristics and disk-based algorithms Consider ternary (or more) relationships in computing structuralcontext similarity Explore the combination of SimRank with other domain-specific similarity measures 20