Modeling Relationship Strength in Online Social Networks Rongjian Xiang1, Jennifer Neville1, Monica Rogati2 1Purdue University, 2LinkedIn WWW 2010 2010. 08. 13. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University Introduction – Social Network Homophily (동질성) the tendency of individuals to associate and bond with similar others “Birds of a feather flock together” Found in many real-world and online social networks Research Area Network Structure Analysis Link prediction – “Who will be my friend?” Community Detection Item Recommendation Copyright 2010 by CEBT 2 Introduction Past work has focused on social networks with binary ties. e.g., friends or not Binary indicators provide only a coarse indication of the relationship. Pairs of individuals with strong ties (e.g., close friends) are likely to exhibit greater similarity then those with weak ties (e.g., acquaintances) Treating all relationships as equal will increase the noise and degrade the performance Pruning away spurious relationships and highlighting stronger relationship has improved the accuracy of the models. Copyright 2010 by CEBT 3 Related Works I. Kahanda and J. Neviile. Using transactional information to predict link strength in online social networks, ICWSM09 E. Gilbert and K. Karahalios. Predicting tie strength with social media. CHI 09 Binary prediction task – Strong ties or Weak ties Supervised learning – Involved in efforts on human annotations Friendship Rating Top friend nomination Copyright 2010 by CEBT 4 Goal A model to infer relationship strength Based on profile similarity and interaction activity Automatically distinguishing strong relationships from weak ones – Relationship strength is represented as continuous value – Unsupervised Full spectrum of relation strength, from weak to strong Scalable approach – Suitable for online application Copyright 2010 by CEBT 5 Assumptions of the Model The higher the similarity, the stronger the tie There is many common feature between ‘용진’ and me, so we have strong relationship. Relationship strength directly impacts the nature and frequency of online interactions between a pair users ‘청림’ is close with me if he has many chat with me in messenger. The independence of interactions Copyright 2010 by CEBT 6 Variables of the Model Profile: the data of specific user 𝐱 (𝑖) : profile vector of individual 𝑖 e.g., school, company, region, industry, job of the user Interaction: the activity between two users (𝑖𝑗) : occurrences of the interaction between 𝑖 and 𝑗 𝑦𝑡 e.g., reply, retweet (in twitter) e.g., tagging the person in a picture, posting one’s wall (in facebook) Relationship Strength 𝑧 (𝑖𝑗) : latent relationship strength Copyright 2010 by CEBT 7 Graphical model representation 𝐱 (𝑖) 𝐱 𝑧 (𝑖𝑗) 𝑃 𝑧 𝑖𝑗 ,𝐲 (𝑗) (𝑖𝑗) (𝑖𝑗) 𝑦1 (𝑖𝑗) 𝑦2 𝑖𝑗 x (𝑖) : profile vector (𝑖𝑗) 𝑦𝑡 : occurrences of the interaction 𝑧 (𝑖𝑗) : latent relationship strength 𝑦𝑚 𝐱 (𝑖) , 𝐱 (𝑗) ) = 𝑃 𝑧 𝑖𝑗 𝐱 (𝑖) , 𝐱 (𝑗) ) Copyright 2010 by CEBT (𝑖𝑗) 𝑚 𝑖𝑗 𝑡=1 𝑃( 𝑦𝑡 |𝑧 ) 8 Model Specification Inferring relationship strength from user profile 𝐱 (𝑖) 𝐱 Using similarity vector – 𝑧 𝐬 = [𝑠1 , … , 𝑠𝑛 ] e.g., 𝑠𝑘 : 1 if 𝑖 and 𝑗 in the same company, 0 otherwise e.g., 𝑠𝑙 : logarithm of the normalized counts of common groups that 𝑖 and 𝑗 join Adopting the Gaussian distribution (𝑖𝑗) 𝑃 𝑧 𝑖𝑗 To be estimated 𝐱 (𝑖) , 𝐱 (𝑗) ) = 𝑁(𝐰 𝑇 𝐬 𝐱 𝑖 , 𝐱𝑗 , 𝑣) Weighted sum of similarity measures p Blue represents similar two users Red represents unsimilar two users 0 z Copyright 2010 by CEBT 9 (𝑗) Model Specification 𝑧 Inferring relationship strength from interactions Modeling all interactions as binary variables Introducing an auxiliary(보조) variables (𝑖𝑗) 𝐚𝑡 (𝑖𝑗) 𝑦1 (𝑖𝑗) (𝑖𝑗) 𝑦2 – Capturing auxiliary causes of the interactions which are independent of the relationship strength – e.g., the total number of pictures that a user has tagged represents their intrinsic tendency to tag pictures 𝑦𝑚 1 Using sigmoid function 𝜎 𝑥 = 1 + 𝑒 −𝑥 𝑃 𝑦𝑡 (𝑖𝑗) 𝑖𝑗 =1 𝑧 𝑖𝑗 1 (𝑖𝑗) , 𝐚𝑡 ) = 1+𝑒 𝑖𝑗 𝑖𝑗 −(𝜃𝑡1 𝑎𝒕1 + …+ 𝜃𝑡𝑙 𝑎 +𝜃𝑡𝑙+1 𝑧 𝑖𝑗 +𝑏) 𝑡𝑙 Weighted sum of auxiliary variables and z 𝛉 is to be estimated Copyright 2010 by CEBT 10 Model Specification 𝐱 𝐬 (𝑖) (𝑖𝑗) 𝐱 (𝑗) (𝑖𝑗) (𝑖𝑗) 𝑧 (𝑖𝑗) (𝑖𝑗) 𝑦2 (𝑖𝑗) (𝑖𝑗) 𝐚2 (𝑖𝑗) 𝑦1 (𝑖𝑗) (𝑖𝑗) 𝑦𝑚 (𝑖𝑗) 𝐚1 (𝑖𝑗) 𝐚𝑚 𝑃 𝐷, 𝑤, 𝜃 = 𝑃 𝐷 𝑤, 𝜃 𝑃 𝑤)𝑃(𝜃 𝑚 ∝ 𝑃 𝑧 (𝑖,𝑗)∈𝐷 𝑖𝑗 𝐱 𝑖 ,𝐱 𝑗 ) 𝑃( 𝑦𝑡 𝑖𝑗 (𝑖𝑗) |𝐚1 , 𝑧 𝑖𝑗 ) 𝑃 𝑤 𝑃(𝜃) 𝑡=1 Copyright 2010 by CEBT 11 Inference Find the point estimates 𝑤, 𝜃,𝑧 that maximize ℒ = 𝑃 𝐷, 𝑤, 𝜃 Using gradient method Using Newton-Raphson updates to weight updates – 𝑥𝑛+1 = 𝑥𝑛 − 𝑓(𝑥𝑛 ) 𝑓′(𝑥𝑛) Copyright 2010 by CEBT 12 Experiment Two dataset is prepared for experiments LinkedIn – Business-Oriented Social Network – Members can search member profiles and job postings Facebook Data Copyright 2010 by CEBT 13 LinkedIn Dataset 100 seed users and their tow-hop neighborhood (100000 pairs) 𝑖𝑗 (𝑖𝑗) Overall similarity 𝑠 (𝑖𝑗) = [𝑠1 , … . , 𝑠8 ]𝑇 𝒔𝟏 1 if 𝑖 and 𝑗 went to same school, 0 otherwise 𝒔𝟐 1 if 𝑖 and 𝑗 work in the same company, 0 otherwise 𝒔𝟑 1 if 𝑖 and 𝑗 are in the same geographical region, 0 otherwise 𝒔𝟒 1 if 𝑖 and 𝑗 are in the same industry, 0 otherwise 𝒔𝟓 1 if 𝑖 and 𝑗 have the same job title, 0 otherwise 𝒔𝟔 1 if 𝑖 and 𝑗 are in the same functional area, 0 otherwise 𝒔𝟕 Logarithm for the normalized counts of common groups that 𝑖 and 𝑗 join 𝒔𝟖 Logarithm for the normalized counts of common connections that 𝑖 and 𝑗 join Interaction features 𝒔𝟏 1 if 𝑖 and 𝑗 have established a connection, 0 otherwise 𝒔𝟐 1 if 𝑖 has written a recommendation for 𝑗, 0 otherwise 𝒔𝟑 1 if 𝑖 has viewed 𝑗 ‘s profile, 0 otherwise 𝒔𝟒 1 if 𝑖 has included 𝑗 in his or her online LinkedIn address book, 0 otherwise Copyright 2010 by CEBT 14 Evaluation (in LinkedIn Dataset) Estimating relationship strength with Job Functional area Geographical region Measuring how well the estimated relationship strengths Identifying feature values ( same school, same company, same industry) Measuring the are under the ROC curve (AUC) Comparing relationship strength to Recommendation links Profile view links Address book links Connection links Interaction count Profile similarity Copyright 2010 by CEBT 15 Receiver Operating Characteristic (ROC) TPR (sensitivity) eqv. with hit rate, recall TPR = TP / P = TP / (TP + FN) FPR eqv. with fall-out FPR = FP / N = FP / (FP + TN) AUC (Area Under ROC Curve) Copyright 2010 by CEBT 16 The result on LinkedIn dataset Copyright 2010 by CEBT 17 Facebook dataset 5 public Purdue Facebook user and their three-hop neighborhood 4500 nodes and 144,712 pairs 𝑖𝑗 (𝑖𝑗) (𝑖𝑗) Overall similarity 𝑠 (𝑖𝑗) = [𝑠1 , 𝑠2 , 𝑠3 ]𝑇 Not using personal profile data 𝒔𝟏 logarithm of the normalized counts of common networks for which 𝑖 and 𝑗 are both member 𝒔𝟐 logarithm of the normalized counts of common group that 𝑖 and 𝑗 join 𝒔𝟑 logarithm of the normalized counts of common friends that 𝑖 and 𝑗 share Interactions 𝒔𝟏 1 if 𝑖 has posted on 𝑗‘s wall, 0 otherwise 𝒔𝟐 1 if 𝑖 has tagged 𝑗 in a picture, 0 otherwise Copyright 2010 by CEBT 18 Evaluation (in Facebook Dataset) Comparing the relationship strength of the model to other weighted graph Friendship graph: strong/weak relationships Top-Friend graph: strong relationships Wall graph: interactions Picture graph: interactions Evaluating Autocorrelation improvement Classification improvement Copyright 2010 by CEBT 19 Evaluation (in Facebook Dataset) Autocorrelation Statistical dependency of the same attribute on related instances 𝜒2 = 𝑖∈𝐾 𝑗∈𝐾 𝑂𝑖𝑗 −𝐸𝑖𝑗 𝐸𝑖𝑗 – K is the number of possible categorical value of the attribute – 𝑂𝑖𝑗 is the observed occurrence – 𝐸𝑖𝑗 is the expected occurrence – If the observed occurrence is increasing, then the autocorrelation is also increasing – e.g., Geographical region attribute has higher autocorrelation than favorite baseball team attribute in friendship network Classification performance The Gaussian Random field (GRF) model is used to classification Copyright 2010 by CEBT 20 Autocorrelation improvement Copyright 2010 by CEBT 21 Classification improvement Copyright 2010 by CEBT 22 Conclusions A latent variable model for the task of relationship strength estimation Latent variable model capture the causality of the underlying social process Hybrid approach of generative model and discriminative model – Not suffering from sparsity of interaction – The latent variable is inferred using only upper level in model – Predicting future interactions is also possible Predicting new connections Experiments show estimated relationship strength gives higher autocorrelation and better classification performance Copyright 2010 by CEBT 23 Discussions General model to estimate relationship strength Easy to apply specific domain knowledge – Just define similarity of two users and interaction distributions But, Experiment is something weird No comparison to other state-of-the-art techniques – There is only comparison to raw data Similarity function is too simple – Considering the recent techniques Copyright 2010 by CEBT 24 Thank you 25