Marina's presentation

Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook By Lars Backstrom, Jon Kleinberg Presented by: Marina Simakov 1 Main goal • How can we identify important people within an individual’s network neighborhood? • The authors investigate this question for strong ties involving spouses or romantic partners. • Given all the connections among a person’s friends, can you recognize his or her romantic partner from the network structure alone? 2 A Network Neighborhood • An individual’s network neighborhood - the set of people to whom he or she is linked. • A person’s network neighbors, encompass a profoundly diverse set of relationships. • An important issue for the analysis of on-line social networks is to use features in the available data to recognize this variation across types of relationships. • Methods to do this effectively can play an important role for many applications at the interface between an individual and the rest of the network. 3 Tie Strength • Tie strength informally refers to the ‘closeness’ of a friendship - a spectrum that ranges from strong ties with close friends to weak ties with more distant acquaintances. • Some fundamental questions connected to the understanding of strong ties: • How can we identify the most important individuals in a person’s social network neighborhood using the underlying network structure? • What are the defining structural signatures of a person’s strongest ties, and how do we recognize them? 4 Dataset and Problem Description • The dataset contains 1.3 million randomly sampled Facebook users who declared a relationship partner in their profile (‘married’, ‘engaged’ or ‘in a relationship’). • Given a Facebook user with a declared relationship partner, hide the identity of this partner. • Problem: Given the user’s network neighborhood - the set of all friends and the links among them - how accurately can we identify the relationship partner using this structural information alone? 5 Embeddedness • The standard characterization of a tie’s strength is embeddedness. • Embeddedness - the number of mutual friends two people share. • This is a key structural feature used for analyzing and estimating tie strength in on-line domains. • Embeddedness typically increases with tie strength. • Are there other structural measures, that may be more appropriate for characterizing particular types of strong ties? 6 Embeddedness • Embeddedness has served as the key definition in structural analyses for the special case of relationship partners • It captures how much the two partners’ social circles ‘overlap’. • A natural predictor for identifying a user u’s partner: select the link from u of maximum embeddedness, and propose the other end v of this link as u’s partner. a c u b d 7 Embeddedness • Embeddedness-based predictor, and others, are evaluated according to their performance: the fraction of instances on which they correctly identify the partner. • Embeddedness achieves a performance of 24.7% — provides evidence about the power of structural information for this task, but also offers a baseline that other approaches can potentially exceed. • It is possible to achieve more than twice the performance of this embeddedness baseline using a new network measure - dispersion. 8 Dispersion • The measure of dispersion looks not just at the number of mutual friends of two people, but also at the network structure on these mutual friends. • A link between two people has high dispersion when their mutual friends are not well connected to one another. High dispersion Low dispersion 9 Theoretical Basis for Dispersion • A basic limitation of embeddedness as a predictor, draws on the theory of social foci. • Many individuals have large clusters of friends corresponding to well-defined foci of interaction in their lives (family, co-workers, college, etc.). • Many people within these clusters know each other. • The clusters contain links of very high embeddedness, even though they do not necessarily correspond to particularly strong ties. 10 Theoretical Basis for Dispersion • The links to a person’s relationship partner is usually characterized by: • Lower embeddedness. • Mutual neighbors from several different foci. • For example: • A husband who knows several of his wife’s co-workers, family members, and former classmates. • These people belong to different foci and do not know each other. 11 Theoretical Basis for Dispersion • Thus, instead of embeddedness, the link between an individual and his or her partner v should display a ‘dispersed’ structure. • The mutual neighbors of u and v are not well-connected to one another. • Hence u and v act jointly as the only intermediaries between these different parts of the network. 12 Definitions • 𝑮𝒖 - the subgraph induced on u and all neighbors of u. • For a node v in 𝐺𝑢 we define 𝑪𝒖𝒗 to be the set of common neighbors of u and v. • 𝒅𝒗 - the distance function on the nodes of 𝐶𝑢𝑣 . • The absolute dispersion of the u-v link, 𝒅𝒊𝒔𝒑 𝒖, 𝒗 - the sum of all pairwise distances between nodes in 𝐶𝑢𝑣 as measured in 𝐺𝑢 − {𝑢, 𝑣} : 𝑑𝑖𝑠𝑝 𝑢, 𝑣 = 𝑑𝑣 (𝑠, 𝑡) 𝑠,𝑡∈𝐶𝑢𝑣 13 The function 𝑑𝑣 • Different choices of 𝑑𝑣 will give rise to different measures of absolute dispersion. • The best performance was displayed when 𝑑𝑣 (𝑠, 𝑡) was defined to be: 𝒅𝒗 𝒔, 𝒕 = 1, 𝑠 𝑎𝑛𝑑 𝑡 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑑𝑖𝑟𝑒𝑐𝑡𝑙𝑦 𝑙𝑖𝑛𝑘𝑒𝑑 𝑎𝑛𝑑 𝑎𝑙𝑠𝑜 ℎ𝑎𝑣𝑒 𝑛𝑜 𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑖𝑒𝑔ℎ𝑏𝑜𝑟𝑠 𝑖𝑛 𝐺𝑢 𝑜𝑡ℎ𝑒𝑟 𝑡ℎ𝑎𝑛 𝑢 𝑎𝑛𝑑 𝑣 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • In the following discussion, we will use this distance function as the basis to the measures of dispersion. 14 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 15 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 emb(u,h) = 4 16 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 emb(u,h) = 4 • Dispersion: disp(u,h) = 4 17 Example – Embeddedness vs Dispersion • Embeddedness: emb(u,b) = 5 emb(u,h) = 4 • Dispersion: disp(u,h) = 4 disp(u,b) = 1 18 Strengthenings of Dispersion • How can we create a function that predicts whether or not v is the partner of u in terms of the two variables disp(u,v) and emb(u,v)? • Performance is highest for functions that are monotonically increasing in disp(u,v) and monotonically decreasing in emb(u,v). • We define: norm(u,v) = disp(u,v)/emb(u,v). • Predicting u’s partner to be the individual v maximizing norm(u,v) gives the correct answer in 48.0% of all instances. 19 Strengthenings of Dispersion • There are two strengthenings of the normalized dispersion that lead to increased performance: 1. The first is to rank nodes by a function of the form: 𝒅𝒊𝒔𝒑 𝒖,𝒗 +𝒃 𝜶 . 𝒆𝒎𝒃 𝒖,𝒗 +𝒄 Maximum performance of 50.5% is reached when α = 0.61, b = 0, and c = 5. 20 Strengthenings of Dispersion 2. Performance can be strengthened by applying the idea of dispersion recursively: • Assign values to the nodes reflecting the dispersion of their links with u. • Update these values in terms of the dispersion values associated with other nodes. • Specifically, we initially define 𝑥𝑣 = 1 for all neighbors v of u, and then iteratively update each 𝑥𝑣 to be: 2 𝑥 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑤 𝑤∈𝐶𝑢𝑣 𝑥𝑣 = 𝑒𝑚𝑏(𝑢, 𝑣) 21 Mathematical Properties of the Recursive Dispersion 𝑥𝑣 = 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) • If we assign 𝑥𝑠 = 1 to each node 𝑠 ∈ 𝐺𝑢 , then: 𝒔,𝒕∈𝑪𝒖𝒗 𝒅𝒗 • Hence, 𝒔, 𝒕 𝒙𝒔 𝒙𝒕 = 𝒅𝒊𝒔𝒑(𝒗). 𝒅𝒗 𝒔,𝒕 𝒙𝒔 𝒙𝒕 𝒆𝒎𝒃(𝒖,𝒗) 𝒔,𝒕∈𝑪𝒖𝒗 = 𝒏𝒐𝒓𝒎(𝒗). • Goal: elevate 𝑥𝑣 when 𝑢 and 𝑣 act as intermediaries between many node pairs 𝑠 and 𝑡 that, recursively, have large values of 𝑥𝑠 and 𝑥𝑡 . 22 Mathematical Properties of the Recursive Dispersion 𝑥𝑣 = 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) • We can define an iteration in which 𝑥𝑣 is updated to be 𝑥𝑣 ← 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠,𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢,𝑣) . • Problem: • • • • The numerator is equal to 0 for many nodes v. Even more nodes will acquire a 0 value in subsequent iterations. Very few nodes end up with a positive value 𝑥𝑣 . The performance in identifying partners would be hurt. • Thus, it is useful to have a mechanism that continuously introduces non-zero weight into the system. 23 Mathematical Properties of the Recursive Dispersion 𝑥𝑣 = 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) • We want our function to have the following properties: 1. The first iteration produces values 𝑥𝑣 whose sorted order agrees with the order of values according to normalized dispersion. 2. Additional quadratic terms in the numerator, that will match the quadratic degree of the existing term in the numerator ( • Adding 𝟐 𝒘∈𝑪𝒖𝒗 𝒙𝒘 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 ). to the numerator achieves these two properties. 24 Mathematical Properties of the Recursive Dispersion 𝑥𝑣 = 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) • After the first iteration, 𝑥𝑣 = 1 + 2 · 𝑛𝑜𝑟𝑚(𝑢, 𝑣), and hence ranking nodes by 𝑥𝑣 after the first iteration is equivalent to ranking nodes by 𝑛𝑜𝑟𝑚(𝑢, 𝑣). • The highest performance is achieved when we rank nodes by the values of 𝑥𝑣 after the third iteration. • Denote the value 𝑥𝑣 in the third iteration by - the recursive dispersion rec(u,v). 25 Recursive Dispersion - Example Initial values 1 1 1 b a c 𝑥𝑣 = u d 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) 1 e 1 26 Recursive Dispersion - Example First iteration 1 1 1 b a c 𝑥𝑣 = u d 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) 𝑥𝑎 = 2 1+2∙0 =1 1 2+2∙1 𝑥𝑑 = =2 2 e 1 27 Recursive Dispersion - Example Second iteration 1 4 1 b a c 𝑥𝑣 = u d 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) 𝑥𝑐 = 2 4+2∙0 =4 1 2+2∙1 𝑥𝑑 = =2 2 e 4 28 Recursive Dispersion - Example Third iteration 1 4 1 b a c 𝑥𝑣 = u d 2 𝑤∈𝐶𝑢𝑣 𝑥𝑤 + 2 𝑠,𝑡∈𝐶𝑢𝑣 𝑑𝑣 𝑠, 𝑡 𝑥𝑠 𝑥𝑡 𝑒𝑚𝑏(𝑢, 𝑣) 𝑥𝑐 = 32 4+2∙0 =4 1 32 + 2 ∙ 16 𝑥𝑑 = = 32 2 e 4 29 Performance of Structural and Interaction Measures • We can compare the structural measures to features derived from a variety of different forms of real-time interaction between users such as: • Profile viewing • Sending of messages • Co-presence at events • The use of such ‘interaction features’ is motivated by the way in which tie strength can be estimated from the volume of interaction between two people. 30 Performance of Structural and Interaction Measures • Within the category of interaction features, the two that consistently display the best performance are: • Rank neighbors of u by the number of photos in which they appear with u. • Rank neighbors of u by the total number of times that u has viewed their profile page in the previous 90 days. • Lets examine the performance of different measures for identifying spouses and romantic partners. 31 Performance of Structural and Interaction Measures • Lets examine the precision at the first position—the fraction of instances in which the user ranked first by the measure is in fact the true partner. • The performance of the structural measures is much higher for married users (60.7%) than for users in a relationship (34.4%). • In contrast, profile viewing achieves higher performance than recursive dispersion for users in a relationship. • The performance of structural measures is significantly higher for males than for females. 32 Performance of Structural and Interaction Measures • For certain more focused subsets of the data, the performance is even stronger: • For example, on the subset corresponding to married male Facebook users in the US, the friend with the highest recursive dispersion is the user’s spouse 76.9% of the time. • We can also evaluate performance on the subset of users in same-sex relationships. Here we focus on users whose status is ‘in a relationship.’ • For female users, the absolute level of performance is almost identical regardless of whether their listed partner is female or male. • For male users, the performance is significantly higher for relationships in which the partner is male (.450, in contrast to .369). 33 Performance of Structural and Interaction Measures • When the user v who scores highest under one of the measures is not the partner of u, what role does v play among u’s network neighbors? • v is often a family member of u. • What happens when we ask for the top-ranked friend to be either the partner or a family member? • For married users, the friend v that maximizes 𝑟𝑒𝑐(𝑢, 𝑣) is the partner or a family member over 75% of the time. • The performance gap between the genders essentially vanishes in the case of married users. • Female users are more likely to have their partner/family member at the top of the ranking by recursive dispersion. 34 A Broader Set of Measures for 𝑑𝑣 Since measures of dispersion are based on an underlying distance function 𝑑𝑣 , we can investigate how the performance depends on the choice of 𝒅𝒗 : • We can set a distance threshold 𝒓, and declare: 𝑑𝑣 (𝑠, 𝑡) = 1, 𝑠 𝑎𝑛𝑑 𝑡 𝑎𝑟𝑒 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑟 ℎ𝑜𝑝𝑠 𝑎𝑝𝑎𝑟𝑡 𝑖𝑛 𝐺𝑢 − {𝑢, 𝑣} 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • We can declare 𝑑𝑣 (𝑠, 𝑡) as follows: 𝑑𝑣 (𝑠, 𝑡) = 1, 𝑠 𝑎𝑛𝑑 𝑡 𝑏𝑒𝑙𝑜𝑛𝑔 𝑡𝑜 different connected components of 𝐺𝑢 − {𝑢, 𝑣} 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 This follows the first approach, with 𝒓 = ∞. 35 A Broader Set of Measures for 𝑑𝑣 • Divide 𝐺𝑢 into communities according to a community detection algorithm, and declare: 𝑑𝑣 𝑠, 𝑡 = 1, 𝑠 𝑎𝑛𝑑 𝑡 𝑏𝑒𝑙𝑜𝑛𝑔 𝑡𝑜 different 𝑐𝑜𝑚𝑚𝑢𝑛𝑖𝑡𝑖𝑒𝑠 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • Lets examine the performance of each of the distance functions. • Recursive dispersion using a distance function 𝑑𝑣 based on a distance threshold of 3 produces the highest accuracy. 36 Performance as a Function of Neighborhood Size and TimeonSite • Further important sources of variation among users: • Size of their network neighborhoods • The amount of time since they joined Facebook. • These two properties are related - after a user joins the site, his or her network neighborhood will generally grow monotonically over time. • How can a large network neighborhood affect the performance? • A mature network neighborhood will be more complex, which may hurt performance. • It will most likely reflect the user’s off-line relationships in richer detail, which may help performance. 37 Performance as a Function of Neighborhood Size • Lets examine the performance as a function of the neighborhood size. For structural measures: • Performance is best when the neighborhood size is around 100 nodes(56%), and drops moderately (to 33%) as the neighborhood size increases. • Recursive dispersion is the highest-performing structural measure for every range of neighborhood sizes except the extremes (50 to 100, and above 1000). • At the extremes, the normalized dispersion is slightly better. 38 Performance as a Function of Neighborhood Size • The benefits of large neighborhoods are clearer when we consider the performance of interaction features. • Their performance tends to be approximately constant, or even increasing, as a function of neighborhood size. • What might be the causes for the improved performance? • Users with large neighborhoods also tend to be more active. • The number of relationships that are actively maintained grows slowly in comparison to the total neighborhood size. • The number of candidates for the relationship partner grows more slowly than the neighborhood size. 39 Performance as a Function of TimeonSite Lets examine the performance as a function of the time on site – the number of days since the user joined Facebook. • We consider a subset of users where: • The neighborhood size lies between 100 and 150. • The time since the relationship was reported lies between 100 and 200 days. • There is a weak increase in performance as a function of time on site. 40 Combining Features using Machine Learning • Different features may capture different aspects of the user’s neighborhood. • How well can we predict partners when combining information from many structural or interaction features via machine learning? • For the machine learning experiments, 48 structural features and 72 interaction features are used. • Can machine leaning bring a significant improvement in performance? 41 Combining Features using Machine Learning • By combining all of the 48 structural features, we can increase performance from 50.6% to 53.1%. • Overall, interaction features perform slightly better than structural features (56.0% vs. 53.1%). • For married users, structural features do much better (62.4% vs. 52.6%). • On all categories the combination of interaction features and structural features significantly outperforms either on its own. 42 Machine Learning to Predict Relationship Status • Our focus is on the problem of identifying relationship partners for users where we know that they are in a relationship. • How can we estimate whether an arbitrary user is in a relationship or not? • This latter question is more challenging and requires a different set of techniques. 43 Machine Learning to Predict Relationship Status • Why is it more difficult to predict whether a user is in a relationship or not? • Consider a user u who has a link of high dispersion to a user v. • If we know that u is in a relationship, then v is a good candidate to be the partner. • Dispersion is useful to identify individuals with interesting connections to u, in the sense that they have been introduced into multiple foci that u belongs to. • A user generally will have such friends even when u is not in a romantic relationship. 44 Machine Learning to Predict Relationship Status • An experiment was conducted, taking approximately 129,000 Facebook users, sampled uniformly over all users of age at least 20 with between 50 and 2000 friends. • 40% of these users were single, while the remaining were either in a relationship, engaged, or married. • Prediction tasks: 1. Determining whether a user is in any sort of a relationship. 2. Look only at single and married users, and attempt to determine which category a user belongs to. • Sets of features are used for these tasks: 1. Demographic features(age, gender, country, etc). 2. Structural features of the network neighborhood. 3. The union of these two sets. 45 Machine Learning to Predict Relationship Status • Age is a powerful feature for predicting relationship status, thus, demographic features do well. • Network features are not as strong, reflecting the notion that even users not in relationships have friends with similar structural properties. • Despite this, network features add predictive power to demographic features. 46 Temporal Properties • How does performance vary based on the time since the relationship was first reported by the user? • This is an approximation for the age of the relationship itself, since the relationship may have existed for some time before it was reported. • Let’s examine how this property affects the performance of structural and interactional measures. • We will test this on users who are married and users who are in a relationship. 47 Temporal Properties – Married Users • The structural measures are more accurate on older relationships than on newer ones, while the profile viewing feature is less accurate. • The structural signature of the relationship needs time to ‘burn in’ to the network, while the interaction level via profile viewing is high almost immediately. • For married users, recursive dispersion has the highest performance across the full time range. 48 Temporal Properties – Users in a Relationship • For users in a relationship, an interesting crossover occurs: • For relationships less than a year old, the profile viewing featured produces the highest performance. • At approximately one year recursive dispersion and photo viewing lead to a better performance. • There is a trade-off between a decreasing level of observation as a relationship goes on, contrasted with an increasing level of dispersion in the network as the link structure adapts around the two individuals. 49 Temporal Properties • How do these measures change over time in the period leading up to a change in relationship status? • Normalized and recursive dispersion rise quickly to the point of marriage • Embeddedness not only has lower performance but also rises more slowly. • When both embeddedness and recursive dispersion eventually identify the spouse correctly, recursive dispersion does so an average of approximately 80 days sooner. 50 Temporal Properties • Are partnerships that are more strongly identified by the measures are also more likely to persist over time? • We consider the users who listed themselves as being in a relationship, and see which of them list their relationship status as ‘single’ 60 days later. • A user whose partner has a high normalized or recursive dispersion is significantly less likely to transition to ‘single’ status over this time period. 51 Temporal Properties • We can view the persistence of relationships by comparing relationships on which recursive dispersion correctly identifies the partner to those on which it does not. • Relationships on which recursive dispersion fails to correctly identify the partner are significantly more likely to transition to ‘single’ status over a 60 day period. • This effect holds across all relationship ages and is particularly pronounced for relationships up to 12 months in age. 52 Beyond Immediate Neighborhoods • All of the network measures are based on the immediate 1-hop neighborhoods of individuals. • It is interesting to consider how accurate more expansive methods might be, if they take the broader structure of the network into account. • Because many individuals have 2-hop neighborhoods with hundreds of thousands of nodes, doing this is computationally challenging, and heuristics are required to make it feasible. • For example: 1. Take a single structural measure and filter down to an individual’s top 20 friends as ranked by this metric. 2. Compute the network measures in the (1-hop) neighborhoods of each of these 20 people. 53 Beyond Immediate Neighborhoods • To evaluate a given friend v as the potential partner of u, we can use the measures computed in u’s neighborhood and also in v’s. • Taking u’s top 20 friends with respect to rec(u,v), and then ranking them by min(rec(u,v),rec(v,u)), improves performance by about 6% to 0.534. • This performs almost as well as more complex models. • This confirms the intuitive result that relationship partners are best found by looking for pairs of people who have high scores in both directions. 54 Conclusions • Understanding the structural roles of a romantic partner in online social networks is a broad question that requires a combination of different approaches. • Dispersion is a measure which provides a powerful method for recognizing such partners from network data alone. • Dispersion is a structural means of capturing the notion that a romantic partner spans many contexts in one’s social life. • This is why it is not only spouses or romantic partners who exhibit high dispersion, but also family members — dispersion identifies people who span foci. 55

Marina's presentation

Related documents

Products

Support

Marina's presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib