Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook By Lars Backstrom, Jon Kleinberg Presented by: Marina Simakov 1 Main goal • How can we identify important people within an individual’s network neighborhood? • The authors investigate this question for strong ties involving spouses or romantic partners. • Given all the connections among a person’s friends, can you recognize his or her romantic partner from the network structure alone? 2 A Network Neighborhood • An individual’s network neighborhood - the set of people to whom he or she is linked. • A person’s network neighbors, encompass a profoundly diverse set of relationships. • An important issue for the analysis of on-line social networks is to use features in the available data to recognize this variation across types of relationships. • Methods to do this effectively can play an important role for many applications at the interface between an individual and the rest of the network. 3 Tie Strength • Tie strength informally refers to the ‘closeness’ of a friendship - a spectrum that ranges from strong ties with close friends to weak ties with more distant acquaintances. • Some fundamental questions connected to the understanding of strong ties: • How can we identify the most important individuals in a person’s social network neighborhood using the underlying network structure? • What are the de๏ฌning structural signatures of a person’s strongest ties, and how do we recognize them? 4 Dataset and Problem Description • The dataset contains 1.3 million randomly sampled Facebook users who declared a relationship partner in their pro๏ฌle (‘married’, ‘engaged’ or ‘in a relationship’). • Given a Facebook user with a declared relationship partner, hide the identity of this partner. • Problem: Given the user’s network neighborhood - the set of all friends and the links among them - how accurately can we identify the relationship partner using this structural information alone? 5 Embeddedness • The standard characterization of a tie’s strength is embeddedness. • Embeddedness - the number of mutual friends two people share. • This is a key structural feature used for analyzing and estimating tie strength in on-line domains. • Embeddedness typically increases with tie strength. • Are there other structural measures, that may be more appropriate for characterizing particular types of strong ties? 6 Embeddedness • Embeddedness has served as the key de๏ฌnition in structural analyses for the special case of relationship partners • It captures how much the two partners’ social circles ‘overlap’. • A natural predictor for identifying a user u’s partner: select the link from u of maximum embeddedness, and propose the other end v of this link as u’s partner. a c u b d 7 Embeddedness • Embeddedness-based predictor, and others, are evaluated according to their performance: the fraction of instances on which they correctly identify the partner. • Embeddedness achieves a performance of 24.7% — provides evidence about the power of structural information for this task, but also offers a baseline that other approaches can potentially exceed. • It is possible to achieve more than twice the performance of this embeddedness baseline using a new network measure - dispersion. 8 Dispersion • The measure of dispersion looks not just at the number of mutual friends of two people, but also at the network structure on these mutual friends. • A link between two people has high dispersion when their mutual friends are not well connected to one another. High dispersion Low dispersion 9 Theoretical Basis for Dispersion • A basic limitation of embeddedness as a predictor, draws on the theory of social foci. • Many individuals have large clusters of friends corresponding to well-de๏ฌned foci of interaction in their lives (family, co-workers, college, etc.). • Many people within these clusters know each other. • The clusters contain links of very high embeddedness, even though they do not necessarily correspond to particularly strong ties. 10 Theoretical Basis for Dispersion • The links to a person’s relationship partner is usually characterized by: • Lower embeddedness. • Mutual neighbors from several different foci. • For example: • A husband who knows several of his wife’s co-workers, family members, and former classmates. • These people belong to different foci and do not know each other. 11 Theoretical Basis for Dispersion • Thus, instead of embeddedness, the link between an individual and his or her partner v should display a ‘dispersed’ structure. • The mutual neighbors of u and v are not well-connected to one another. • Hence u and v act jointly as the only intermediaries between these different parts of the network. 12 Definitions • ๐ฎ๐ - the subgraph induced on u and all neighbors of u. • For a node v in ๐บ๐ข we de๏ฌne ๐ช๐๐ to be the set of common neighbors of u and v. • ๐ ๐ - the distance function on the nodes of ๐ถ๐ข๐ฃ . • The absolute dispersion of the u-v link, ๐ ๐๐๐ ๐, ๐ - the sum of all pairwise distances between nodes in ๐ถ๐ข๐ฃ as measured in ๐บ๐ข − {๐ข, ๐ฃ} : ๐๐๐ ๐ ๐ข, ๐ฃ = ๐๐ฃ (๐ , ๐ก) ๐ ,๐ก∈๐ถ๐ข๐ฃ 13 The function ๐๐ฃ • Different choices of ๐๐ฃ will give rise to different measures of absolute dispersion. • The best performance was displayed when ๐๐ฃ (๐ , ๐ก) was defined to be: ๐ ๐ ๐, ๐ = 1, ๐ ๐๐๐ ๐ก ๐๐๐ ๐๐๐ก ๐๐๐๐๐๐ก๐๐ฆ ๐๐๐๐๐๐ ๐๐๐ ๐๐๐ ๐ โ๐๐ฃ๐ ๐๐ ๐๐๐๐๐๐ ๐๐๐๐โ๐๐๐๐ ๐๐ ๐บ๐ข ๐๐กโ๐๐ ๐กโ๐๐ ๐ข ๐๐๐ ๐ฃ 0, ๐๐กโ๐๐๐ค๐๐ ๐ • In the following discussion, we will use this distance function as the basis to the measures of dispersion. 14 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 15 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 emb(u,h) = 4 16 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 emb(u,h) = 4 • Dispersion: disp(u,h) = 4 17 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 emb(u,h) = 4 • Dispersion: disp(u,h) = 4 disp(u,b) = 1 18 Strengthenings of Dispersion • How can we create a function that predicts whether or not v is the partner of u in terms of the two variables disp(u,v) and emb(u,v)? • Performance is highest for functions that are monotonically increasing in disp(u,v) and monotonically decreasing in emb(u,v). • We define: norm(u,v) = disp(u,v)/emb(u,v). • Predicting u’s partner to be the individual v maximizing norm(u,v) gives the correct answer in 48.0% of all instances. 19 Strengthenings of Dispersion • There are two strengthenings of the normalized dispersion that lead to increased performance: 1. The ๏ฌrst is to rank nodes by a function of the form: ๐ ๐๐๐ ๐,๐ +๐ ๐ถ . ๐๐๐ ๐,๐ +๐ Maximum performance of 50.5% is reached when α = 0.61, b = 0, and c = 5. 20 Strengthenings of Dispersion 2. Performance can be strengthened by applying the idea of dispersion recursively: • Assign values to the nodes re๏ฌecting the dispersion of their links with u. • Update these values in terms of the dispersion values associated with other nodes. • Speci๏ฌcally, we initially de๏ฌne ๐ฅ๐ฃ = 1 for all neighbors v of u, and then iteratively update each ๐ฅ๐ฃ to be: 2 ๐ฅ + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐ค ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ฃ = ๐๐๐(๐ข, ๐ฃ) 21 Mathematical Properties of the Recursive Dispersion ๐ฅ๐ฃ = 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) • If we assign ๐ฅ๐ = 1 to each node ๐ ∈ ๐บ๐ข , then: ๐,๐∈๐ช๐๐ ๐ ๐ • Hence, ๐, ๐ ๐๐ ๐๐ = ๐ ๐๐๐(๐). ๐ ๐ ๐,๐ ๐๐ ๐๐ ๐๐๐(๐,๐) ๐,๐∈๐ช๐๐ = ๐๐๐๐(๐). • Goal: elevate ๐ฅ๐ฃ when ๐ข and ๐ฃ act as intermediaries between many node pairs ๐ and ๐ก that, recursively, have large values of ๐ฅ๐ and ๐ฅ๐ก . 22 Mathematical Properties of the Recursive Dispersion ๐ฅ๐ฃ = 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) • We can de๏ฌne an iteration in which ๐ฅ๐ฃ is updated to be ๐ฅ๐ฃ ← ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ ,๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข,๐ฃ) . • Problem: • • • • The numerator is equal to 0 for many nodes v. Even more nodes will acquire a 0 value in subsequent iterations. Very few nodes end up with a positive value ๐ฅ๐ฃ . The performance in identifying partners would be hurt. • Thus, it is useful to have a mechanism that continuously introduces non-zero weight into the system. 23 Mathematical Properties of the Recursive Dispersion ๐ฅ๐ฃ = 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) • We want our function to have the following properties: 1. The ๏ฌrst iteration produces values ๐ฅ๐ฃ whose sorted order agrees with the order of values according to normalized dispersion. 2. Additional quadratic terms in the numerator, that will match the quadratic degree of the existing term in the numerator ( • Adding ๐ ๐∈๐ช๐๐ ๐๐ ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ). to the numerator achieves these two properties. 24 Mathematical Properties of the Recursive Dispersion ๐ฅ๐ฃ = 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) • After the ๏ฌrst iteration, ๐ฅ๐ฃ = 1 + 2 · ๐๐๐๐(๐ข, ๐ฃ), and hence ranking nodes by ๐ฅ๐ฃ after the ๏ฌrst iteration is equivalent to ranking nodes by ๐๐๐๐(๐ข, ๐ฃ). • The highest performance is achieved when we rank nodes by the values of ๐ฅ๐ฃ after the third iteration. • Denote the value ๐ฅ๐ฃ in the third iteration by - the recursive dispersion rec(u,v). 25 Recursive Dispersion - Example Initial values 1 1 1 b a c ๐ฅ๐ฃ = u d 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) 1 e 1 26 Recursive Dispersion - Example First iteration 1 1 1 b a c ๐ฅ๐ฃ = u d 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) ๐ฅ๐ = 2 1+2โ0 =1 1 2+2โ1 ๐ฅ๐ = =2 2 e 1 27 Recursive Dispersion - Example Second iteration 1 4 1 b a c ๐ฅ๐ฃ = u d 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) ๐ฅ๐ = 2 4+2โ0 =4 1 2+2โ1 ๐ฅ๐ = =2 2 e 4 28 Recursive Dispersion - Example Third iteration 1 4 1 b a c ๐ฅ๐ฃ = u d 2 ๐ค∈๐ถ๐ข๐ฃ ๐ฅ๐ค + 2 ๐ ,๐ก∈๐ถ๐ข๐ฃ ๐๐ฃ ๐ , ๐ก ๐ฅ๐ ๐ฅ๐ก ๐๐๐(๐ข, ๐ฃ) ๐ฅ๐ = 32 4+2โ0 =4 1 32 + 2 โ 16 ๐ฅ๐ = = 32 2 e 4 29 Performance of Structural and Interaction Measures • We can compare the structural measures to features derived from a variety of different forms of real-time interaction between users such as: • Pro๏ฌle viewing • Sending of messages • Co-presence at events • The use of such ‘interaction features’ is motivated by the way in which tie strength can be estimated from the volume of interaction between two people. 30 Performance of Structural and Interaction Measures • Within the category of interaction features, the two that consistently display the best performance are: • Rank neighbors of u by the number of photos in which they appear with u. • Rank neighbors of u by the total number of times that u has viewed their pro๏ฌle page in the previous 90 days. • Lets examine the performance of different measures for identifying spouses and romantic partners. 31 Performance of Structural and Interaction Measures • Lets examine the precision at the ๏ฌrst position—the fraction of instances in which the user ranked ๏ฌrst by the measure is in fact the true partner. • The performance of the structural measures is much higher for married users (60.7%) than for users in a relationship (34.4%). • In contrast, pro๏ฌle viewing achieves higher performance than recursive dispersion for users in a relationship. • The performance of structural measures is signi๏ฌcantly higher for males than for females. 32 Performance of Structural and Interaction Measures • For certain more focused subsets of the data, the performance is even stronger: • For example, on the subset corresponding to married male Facebook users in the US, the friend with the highest recursive dispersion is the user’s spouse 76.9% of the time. • We can also evaluate performance on the subset of users in same-sex relationships. Here we focus on users whose status is ‘in a relationship.’ • For female users, the absolute level of performance is almost identical regardless of whether their listed partner is female or male. • For male users, the performance is signi๏ฌcantly higher for relationships in which the partner is male (.450, in contrast to .369). 33 Performance of Structural and Interaction Measures • When the user v who scores highest under one of the measures is not the partner of u, what role does v play among u’s network neighbors? • v is often a family member of u. • What happens when we ask for the top-ranked friend to be either the partner or a family member? • For married users, the friend v that maximizes ๐๐๐(๐ข, ๐ฃ) is the partner or a family member over 75% of the time. • The performance gap between the genders essentially vanishes in the case of married users. • Female users are more likely to have their partner/family member at the top of the ranking by recursive dispersion. 34 A Broader Set of Measures for ๐๐ฃ Since measures of dispersion are based on an underlying distance function ๐๐ฃ , we can investigate how the performance depends on the choice of ๐ ๐ : • We can set a distance threshold ๐, and declare: ๐๐ฃ (๐ , ๐ก) = 1, ๐ ๐๐๐ ๐ก ๐๐๐ ๐๐ก ๐๐๐๐ ๐ก ๐ โ๐๐๐ ๐๐๐๐๐ก ๐๐ ๐บ๐ข − {๐ข, ๐ฃ} 0, ๐๐กโ๐๐๐ค๐๐ ๐ • We can declare ๐๐ฃ (๐ , ๐ก) as follows: ๐๐ฃ (๐ , ๐ก) = 1, ๐ ๐๐๐ ๐ก ๐๐๐๐๐๐ ๐ก๐ different connected components of ๐บ๐ข − {๐ข, ๐ฃ} 0, ๐๐กโ๐๐๐ค๐๐ ๐ This follows the first approach, with ๐ = ∞. 35 A Broader Set of Measures for ๐๐ฃ • Divide ๐บ๐ข into communities according to a community detection algorithm, and declare: ๐๐ฃ ๐ , ๐ก = 1, ๐ ๐๐๐ ๐ก ๐๐๐๐๐๐ ๐ก๐ different ๐๐๐๐๐ข๐๐๐ก๐๐๐ 0, ๐๐กโ๐๐๐ค๐๐ ๐ • Lets examine the performance of each of the distance functions. • Recursive dispersion using a distance function ๐๐ฃ based on a distance threshold of 3 produces the highest accuracy. 36 Performance as a Function of Neighborhood Size and TimeonSite • Further important sources of variation among users: • Size of their network neighborhoods • The amount of time since they joined Facebook. • These two properties are related - after a user joins the site, his or her network neighborhood will generally grow monotonically over time. • How can a large network neighborhood affect the performance? • A mature network neighborhood will be more complex, which may hurt performance. • It will most likely re๏ฌect the user’s off-line relationships in richer detail, which may help performance. 37 Performance as a Function of Neighborhood Size • Lets examine the performance as a function of the neighborhood size. For structural measures: • Performance is best when the neighborhood size is around 100 nodes(56%), and drops moderately (to 33%) as the neighborhood size increases. • Recursive dispersion is the highest-performing structural measure for every range of neighborhood sizes except the extremes (50 to 100, and above 1000). • At the extremes, the normalized dispersion is slightly better. 38 Performance as a Function of Neighborhood Size • The bene๏ฌts of large neighborhoods are clearer when we consider the performance of interaction features. • Their performance tends to be approximately constant, or even increasing, as a function of neighborhood size. • What might be the causes for the improved performance? • Users with large neighborhoods also tend to be more active. • The number of relationships that are actively maintained grows slowly in comparison to the total neighborhood size. • The number of candidates for the relationship partner grows more slowly than the neighborhood size. 39 Performance as a Function of TimeonSite Lets examine the performance as a function of the time on site – the number of days since the user joined Facebook. • We consider a subset of users where: • The neighborhood size lies between 100 and 150. • The time since the relationship was reported lies between 100 and 200 days. • There is a weak increase in performance as a function of time on site. 40 Combining Features using Machine Learning • Different features may capture different aspects of the user’s neighborhood. • How well can we predict partners when combining information from many structural or interaction features via machine learning? • For the machine learning experiments, 48 structural features and 72 interaction features are used. • Can machine leaning bring a significant improvement in performance? 41 Combining Features using Machine Learning • By combining all of the 48 structural features, we can increase performance from 50.6% to 53.1%. • Overall, interaction features perform slightly better than structural features (56.0% vs. 53.1%). • For married users, structural features do much better (62.4% vs. 52.6%). • On all categories the combination of interaction features and structural features signi๏ฌcantly outperforms either on its own. 42 Machine Learning to Predict Relationship Status • Our focus is on the problem of identifying relationship partners for users where we know that they are in a relationship. • How can we estimate whether an arbitrary user is in a relationship or not? • This latter question is more challenging and requires a different set of techniques. 43 Machine Learning to Predict Relationship Status • Why is it more difficult to predict whether a user is in a relationship or not? • Consider a user u who has a link of high dispersion to a user v. • If we know that u is in a relationship, then v is a good candidate to be the partner. • Dispersion is useful to identify individuals with interesting connections to u, in the sense that they have been introduced into multiple foci that u belongs to. • A user generally will have such friends even when u is not in a romantic relationship. 44 Machine Learning to Predict Relationship Status • An experiment was conducted, taking approximately 129,000 Facebook users, sampled uniformly over all users of age at least 20 with between 50 and 2000 friends. • 40% of these users were single, while the remaining were either in a relationship, engaged, or married. • Prediction tasks: 1. Determining whether a user is in any sort of a relationship. 2. Look only at single and married users, and attempt to determine which category a user belongs to. • Sets of features are used for these tasks: 1. Demographic features(age, gender, country, etc). 2. Structural features of the network neighborhood. 3. The union of these two sets. 45 Machine Learning to Predict Relationship Status • Age is a powerful feature for predicting relationship status, thus, demographic features do well. • Network features are not as strong, re๏ฌecting the notion that even users not in relationships have friends with similar structural properties. • Despite this, network features add predictive power to demographic features. 46 Temporal Properties • How does performance vary based on the time since the relationship was ๏ฌrst reported by the user? • This is an approximation for the age of the relationship itself, since the relationship may have existed for some time before it was reported. • Let’s examine how this property affects the performance of structural and interactional measures. • We will test this on users who are married and users who are in a relationship. 47 Temporal Properties – Married Users • The structural measures are more accurate on older relationships than on newer ones, while the pro๏ฌle viewing feature is less accurate. • The structural signature of the relationship needs time to ‘burn in’ to the network, while the interaction level via pro๏ฌle viewing is high almost immediately. • For married users, recursive dispersion has the highest performance across the full time range. 48 Temporal Properties – Users in a Relationship • For users in a relationship, an interesting crossover occurs: • For relationships less than a year old, the pro๏ฌle viewing featured produces the highest performance. • At approximately one year recursive dispersion and photo viewing lead to a better performance. • There is a trade-off between a decreasing level of observation as a relationship goes on, contrasted with an increasing level of dispersion in the network as the link structure adapts around the two individuals. 49 Temporal Properties • How do these measures change over time in the period leading up to a change in relationship status? • Normalized and recursive dispersion rise quickly to the point of marriage • Embeddedness not only has lower performance but also rises more slowly. • When both embeddedness and recursive dispersion eventually identify the spouse correctly, recursive dispersion does so an average of approximately 80 days sooner. 50 Temporal Properties • Are partnerships that are more strongly identi๏ฌed by the measures are also more likely to persist over time? • We consider the users who listed themselves as being in a relationship, and see which of them list their relationship status as ‘single’ 60 days later. • A user whose partner has a high normalized or recursive dispersion is signi๏ฌcantly less likely to transition to ‘single’ status over this time period. 51 Temporal Properties • We can view the persistence of relationships by comparing relationships on which recursive dispersion correctly identi๏ฌes the partner to those on which it does not. • Relationships on which recursive dispersion fails to correctly identify the partner are signi๏ฌcantly more likely to transition to ‘single’ status over a 60 day period. • This effect holds across all relationship ages and is particularly pronounced for relationships up to 12 months in age. 52 Beyond Immediate Neighborhoods • All of the network measures are based on the immediate 1-hop neighborhoods of individuals. • It is interesting to consider how accurate more expansive methods might be, if they take the broader structure of the network into account. • Because many individuals have 2-hop neighborhoods with hundreds of thousands of nodes, doing this is computationally challenging, and heuristics are required to make it feasible. • For example: 1. Take a single structural measure and ๏ฌlter down to an individual’s top 20 friends as ranked by this metric. 2. Compute the network measures in the (1-hop) neighborhoods of each of these 20 people. 53 Beyond Immediate Neighborhoods • To evaluate a given friend v as the potential partner of u, we can use the measures computed in u’s neighborhood and also in v’s. • Taking u’s top 20 friends with respect to rec(u,v), and then ranking them by min(rec(u,v),rec(v,u)), improves performance by about 6% to 0.534. • This performs almost as well as more complex models. • This con๏ฌrms the intuitive result that relationship partners are best found by looking for pairs of people who have high scores in both directions. 54 Conclusions • Understanding the structural roles of a romantic partner in online social networks is a broad question that requires a combination of different approaches. • Dispersion is a measure which provides a powerful method for recognizing such partners from network data alone. • Dispersion is a structural means of capturing the notion that a romantic partner spans many contexts in one’s social life. • This is why it is not only spouses or romantic partners who exhibit high dispersion, but also family members — dispersion identi๏ฌes people who span foci. 55