Social Ties and Link Prediction Kristina Lerman University of Southern California CS 599: Social Media Analysis University of Southern California 1 Link Prediction Will nodes 33 and 28 become friends in the future? What about nodes 27 and 4? Does network structure contain enough information to predict what new links will form in the future? Who to follow Strength of social ties (review) • Strong ties – surrounded by many mutual friends – characterized by lots of shared time together • Weak ties – have few mutual friends – Serve as bridges to diverse parts of the network – Provide access to novel information The Link-Prediction Problem for Social Networks (Liben-Nowell & Kleinberg) To what extent can the evolution of a social network be modeled using features intrinsic to the network itself? • Formalize the link prediction problem – Given a snapshot of a network, infer which new interactions between nodes are likely to occur in the future • Propose link prediction heuristics based on measures for analyzing the “proximity” of nodes in a network. • Evaluate link prediction heuristics on large coauthorship networks. Future coauthorships can be extracted from network topology. The intuition • In many networks, people who are “close” belong to the same social circles and will inevitably encounter one another and become linked themselves. • Link prediction heuristics measure how “close” people are x x y y Red nodes are close to each other Red nodes are more distant Link prediction heuristics • Local x • • • • Common neighbors (CN) Jaccard (JC) Adamic-Adar (AA) Preferential attachment (PA) … • Global y • Katz score • Hitting time • PageRank … Local link prediction heuristics •Link prediction heuristics – Common neighbors (CN) x y CN 3 • Neighborhood overlap – Jaccard (JC) – Adamic-Adar (AA) – Preferential attachment (PA) Local link prediction heuristics • Link prediction heuristics x – Common neighbors (CN) – Jaccard (JC) • Fraction of common neighbors y CN JC dx dy CN – Adamic-Adar (AA) – Preferential attachment (PA) Link prediction heuristics •Link prediction heuristics – Common neighbors (CN) – Jaccard (JC) – Adamic-Adar (AA) x • Nmbr common neighbors, with each neighbor z attenuated by log of its degree y – Preferential attachment (PA) AA zCN 1 log d z Local link prediction heuristics •Link prediction heuristics x y PA d x d y – – – – Common neighbors (CN) Jaccard (JC) Adamic-Adar (AA) Preferential attachment (PA) • Better connected nodes are more likely to form more links Global link prediction heuristics •Link prediction heuristics – Katz score x • Measures number of paths between two nodes, attenuated by their length – Hitting time y • Expected time for a random walk from x to reach y – … Data • Collaboration networks of physicists – Core nodes: authors who published at least 3 papers during the training period and at least 3 papers during test period • Training data: graph G(t0, t0’) of collaborations during time period [t0, t0‘] with V core nodes and Eold edges • Test data: graph G(t1, t1’) of collaborations during a later time period [t1, t1’] with V core nodes and Enew edges Evaluation metric •Link prediction algorithm • Score node pairs using a heuristic p • New links more likely among high scoring pairs • Each link prediction heuristic p outputs a ranked list L of new collaborations: pairs in VxV-Eold. • Focus evaluation on new links Enew* between core nodes • Performance metric: How many of the top n pairs in ranked list L are the actual new nodes in Enew*? Results Heuristics vs random predictor Results Heuristics vs graph distance predictor Summary • Graph-based link prediction heuristics outperform random guess by a factor of 40 • However, they still predict only 16% of new collaborations at best, leaving much room for improvement. CSCI 599 Social Media Analysis Link prediction in complex networks: a survey Presenter: Yuan Shi USC ID: 7678039433 L Lu and T Zhou, “Link prediction in complex networks: a survey”, Physica A 390(6):11501170 (2011) Link Prediction • Estimate the likelihood of the existence of a link between two nodes, based on observed links and the attributes of nodes • Application – Biological networks: costly to identify links between nodes through field/laboratorial experiments – Online social networks: predicting friendship and recommending new friends (predicting future links in evolving networks) Problem Description and Evaluation Metrics • Undirected network G = (V, E) • Universal set U containing |V|(|V|-1)/2 possible links • Task: Find out missing links in U – E. • Evaluation: randomly split E into two sets: training set ET , probe/validation set EP • k-folder cross validation – Randomly partition into k subsets – Each time one subset is selected as probe set, the others as training set – Repeat k times, each with a different probe set Evaluation Metrics • A link prediction algorithm gives a ranking on each link • AUC (area under the receiver operating characteristic curve) – Focus the whole list of ranks – The probability that a randomly chosen missing link is given a higher score than a randomly chosen nonexistent link • Precision – Focus on the top ranks – Take top-L predicted links, among which Lr links are right, the precision is Lr/L Similarity-Based Algorithms • Assign a score sxy to each pair of nodes x and y • The attributes of nodes are generally hidden -> focus on structural similarity: two nodes are linked if they have similar network structure • Similarity indices – Local similarity Indices: only use local information – Global similarity indices: use global information, more accurate but costly – Quasi-local indices: a tradeoff between local and global Local similarity Indices • 10 indices are discussed. • Common neighbors (CN) set of neighbors • Resource Allocation Index (RA) degree of note z Intuition: • Similarity(x, y) = the amount of resource y received from x • x sends some resource to y, with their common neighbors as transmitters • Each transmitter has a unit of resource and will equally distribute it to all its neighbors • Adamic-Adar Index (AA) Local similarity Indices - Evaluation Metric: AUC. Each number averaged by 10 implementations. Real-world networks PPI: protein-protein interaction NS: co-authorship Grid: electrical power-grid PB: US political blogs INT: router-level Internet USAir: US air transportation CN and AA have second best performance RA performs the best Global similarity Indices • 7 indices are discussed. Some examples are: • Katz Index • Average Commute Time Laplacian matrix • Random Walk with Restart (direct application of PageRank algorithm) • Global indices – Pros: more accurate than local indices – Cons: 1) time-consuming; 2) global topological information may not be available Quasi-local Indices • 3 indices are discussed. • Local Path Index (LP) – Outperforms local indices like RA, AA and CN – Performs competitively to global indices with much less computational cost • Local Random Walk (LRW): at time step t, q is initial configuration function, e.g. • Superposed Random Walk (SRW): at time step t, Some experiments show LRW and SRW performs better than LP Maximum Likelihood Methods • Methodology – Assume some organizing principles of the network structure – Rules and parameters are obtained by maximizing the likelihood of the observed structure – Likelihood of any non-observed link can be calculated according to those rules and parameters • Pros: provide valuable insights into the network organization • Cons: Time consuming; Prediction accuracy is not very high Example: Hierarchical Structure Model Assumption: 1. Each internal node r associated with a probability pr 2. Probability of linking a pair of leaves equals to pr’ where r’ is their lowest common ancestor Some statistics of the graph By maximizing the likelihood, Prediction: 1. Sample a large number of dendrograms with probability proportional to their likelihood 2. Compute the link probability by averaging the corresponding probability over all sampled dendrograms Application • Reconstruction of Networks – Not easy to reconstruct the “true” network since generally no one knows how many links are missing – Reliability of a network Global optimization is difficult -> use greedy algorithms • Classification of Partially Labeled Networks – Predict the labels of these unlabeled nodes based on the known labels and the network structure – Approach: add artificial links between every pair of labeled and unlabeled nodes Application • Evaluation of Network Evolving Mechanisms – link prediction algorithm tells the factors resulting in the existence of links – Example: Similarity indices for the Chinese city airline network CN: topological effects DIS: geographical distance POPU: population GDP TI: third sector of GDP, named the tertiary industry Outlook • Link prediction in directed networks • Multi-dimensional networks, where links could have different meanings (e.g. positive/negative) • Hybrid algorithms to combine different similarity indices • Leveraging external information (e.g. attributes) to improve accuracy • Time-series link prediction approach considering the temporal evolutions of link occurrences Romantic partnerships and the dispersion of social ties Romantic Partnerships and the Dispersion of Social Ties (Backstrom & Kleinberg) • Questions – Who are the most important individuals in a person’s social neighborhood? – What are the defining structural signatures of a person’s social neighborhood? • Contributions – Dispersion: a new measure for estimating tie strength – Characterize romantic relationships in terms of network structure – Empirical study of this characteristic across Facebook population Who are the most important people in one’s social neighborhood? • Following Granovetter, researchers use number of mutual friends (embeddedness) to identify strong ties – Close friends, who share much time together – Emotionally intense interactions A-B tie is highly embedded in the network A-B tie is not embedded in the network C D E F C A A D B B E F Romantic ties • Embeddedness is not able to identify “significant others” (romantic relationships, e.g., spouse, partner, boy/girlfriend) • Ego network – social neighborhood of an individual, showing all his/her friends and links between them Ego network of an individual Who is the “significant other”? Social foci • People have large clusters of friends corresponding to well-defined foci of interaction in their lives – These links have high embeddedness but are not very strong ties • In contrast, romantic partners may have lower embeddedness, but they often involve mutual friends from different foci Co-workers Ego network of an individual College friends Embeddedness vs dispersion Embeddedness: u and v have many mutual neighbors. Links u-b, u-c, and u-f have embeddedness 5 Link u-h has embeddedness 4 Dispersion: mutual neighbors of u and v are not wellconnected to one another, and hence u and v are the only intermediaries between these different parts of the network. Link u-h has high dispersion: u and h are the only intermediaries between c and f Link dispersion disp(u,b)=1 s(u,h)=4 Evaluation • Egonetworks of 1.3 million Facebook users, selected uniformly at random from among all users of age at least 20, with between 50 and 2000 friends, who list a spouse or relationship partner in their profile • Rank all friends by importance. Attempt to identify romantic partners • Measure: Precision of the first position, Pr@1 Performance – Pr@1 • How well does dispersion predict the “significant other”? – precision of the top-ranked person in the individual’s egonet – Beats others measures of interaction between users • viewing of profiles, sending of messages, and co-presence at events (photos) Performance as a function of neighborhood size • Performance is best when the neighborhood size is around 100 nodes (56%), & drops moderately (to 33%) as the egonet size increases by an order of magnitude to 1000 • Interaction features are better for larger neighborhoods, due to users with larger neighborhoods being more active Performance as a function of user’s time on site Best performance when combining features • Predict relationship status of users – Ground truth: 60% of users are in a relationship • Demographic features (age, gender, country, and time on site) work better than network-based features (dispersion) • Best performance combining demographic and network features How does performance vary based on age of the relationship? Marriage • Performance of dispersion measures increases as people approach time of their marriage Persistence of relationships • Transition probability from the status ‘in a relationship’ to the status ‘single’ over a 60-day period. The transition probabilities decrease monotonically, and by significant factors, for users with high normalized or recursive dispersion to their respective partners. Summary • Graph structure contains information predictive of individual relationships – New collaborations – Romantic partnerships • In many cases, graph-based algorithms outperform featurebased machine learning algorithms • These suggest complex interactions between personal relationships and global network structure