Marina's presentation

advertisement
Romantic Partnerships and the Dispersion
of Social Ties: A Network Analysis of
Relationship Status on Facebook
By Lars Backstrom, Jon Kleinberg
Presented by: Marina Simakov
1
Main goal
• How can we identify important people within an individual’s network
neighborhood?
• The authors investigate this question for strong ties involving spouses or romantic
partners.
• Given all the connections among a person’s friends, can you recognize his or her
romantic partner from the network structure alone?
2
A Network Neighborhood
• An individual’s network neighborhood - the set of people to whom he or
she is linked.
• A person’s network neighbors, encompass a profoundly diverse set of
relationships.
• An important issue for the analysis of on-line social networks is to use features
in the available data to recognize this variation across types of relationships.
• Methods to do this effectively can play an important role for many applications
at the interface between an individual and the rest of the network.
3
Tie Strength
• Tie strength informally refers to the ‘closeness’ of a friendship - a spectrum
that ranges from strong ties with close friends to weak ties with more distant
acquaintances.
• Some fundamental questions connected to the understanding of strong ties:
• How can we identify the most important individuals in a person’s social network
neighborhood using the underlying network structure?
• What are the de๏ฌning structural signatures of a person’s strongest ties, and how do
we recognize them?
4
Dataset and Problem Description
• The dataset contains 1.3 million randomly sampled Facebook users who declared a
relationship partner in their pro๏ฌle (‘married’, ‘engaged’ or ‘in a relationship’).
• Given a Facebook user with a declared relationship partner, hide the identity of this
partner.
• Problem: Given the user’s network neighborhood - the set of
all friends and the links among them - how accurately can we
identify the relationship partner using this structural
information alone?
5
Embeddedness
• The standard characterization of a tie’s strength is embeddedness.
• Embeddedness - the number of mutual friends two people share.
• This is a key structural feature used for analyzing and estimating tie strength in
on-line domains.
• Embeddedness typically increases with tie strength.
• Are there other structural measures, that may be more appropriate for
characterizing particular types of strong ties?
6
Embeddedness
• Embeddedness has served as the key de๏ฌnition in structural analyses for the
special case of relationship partners
• It captures how much the two partners’ social circles ‘overlap’.
• A natural predictor for identifying a user u’s partner: select the link from u of
maximum embeddedness, and propose the other end v of this link as u’s partner.
a
c
u
b
d
7
Embeddedness
• Embeddedness-based predictor, and others, are evaluated according to their
performance: the fraction of instances on which they correctly identify the
partner.
• Embeddedness achieves a performance of 24.7% — provides evidence
about the power of structural information for this task, but also offers a
baseline that other approaches can potentially exceed.
• It is possible to achieve more than twice the performance of this
embeddedness baseline using a new network measure - dispersion.
8
Dispersion
• The measure of dispersion looks not just at the number of mutual friends
of two people, but also at the network structure on these mutual friends.
• A link between two people has high dispersion when their mutual friends are
not well connected to one another.
High dispersion
Low dispersion
9
Theoretical Basis for Dispersion
• A basic limitation of embeddedness as a predictor, draws on the theory of
social foci.
• Many individuals have large clusters of friends corresponding to well-de๏ฌned
foci of interaction in their lives (family, co-workers, college, etc.).
• Many people within these clusters know each other.
• The clusters contain links of very high embeddedness, even though
they do not necessarily correspond to particularly strong ties.
10
Theoretical Basis for Dispersion
• The links to a person’s relationship partner is usually characterized by:
• Lower embeddedness.
• Mutual neighbors from several different foci.
• For example:
• A husband who knows several of his wife’s co-workers, family members, and
former classmates.
• These people belong to different foci and do not know each other.
11
Theoretical Basis for Dispersion
• Thus, instead of embeddedness, the link between an individual and his or her
partner v should display a ‘dispersed’ structure.
• The mutual neighbors of u and v are not well-connected to one another.
• Hence u and v act jointly as the only intermediaries between these different
parts of the network.
12
Definitions
• ๐‘ฎ๐’– - the subgraph induced on u and all neighbors of u.
• For a node v in ๐บ๐‘ข we de๏ฌne ๐‘ช๐’–๐’— to be the set of common neighbors of u
and v.
• ๐’…๐’— - the distance function on the nodes of ๐ถ๐‘ข๐‘ฃ .
• The absolute dispersion of the u-v link, ๐’…๐’Š๐’”๐’‘ ๐’–, ๐’— - the sum of all pairwise
distances between nodes in ๐ถ๐‘ข๐‘ฃ as measured in ๐บ๐‘ข − {๐‘ข, ๐‘ฃ} :
๐‘‘๐‘–๐‘ ๐‘ ๐‘ข, ๐‘ฃ =
๐‘‘๐‘ฃ (๐‘ , ๐‘ก)
๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ
13
The function ๐‘‘๐‘ฃ
• Different choices of ๐‘‘๐‘ฃ will give rise to different measures of absolute
dispersion.
• The best performance was displayed when ๐‘‘๐‘ฃ (๐‘ , ๐‘ก) was defined to be:
๐’…๐’— ๐’”, ๐’• =
1, ๐‘  ๐‘Ž๐‘›๐‘‘ ๐‘ก ๐‘Ž๐‘Ÿ๐‘’ ๐‘›๐‘œ๐‘ก ๐‘‘๐‘–๐‘Ÿ๐‘’๐‘๐‘ก๐‘™๐‘ฆ ๐‘™๐‘–๐‘›๐‘˜๐‘’๐‘‘ ๐‘Ž๐‘›๐‘‘ ๐‘Ž๐‘™๐‘ ๐‘œ โ„Ž๐‘Ž๐‘ฃ๐‘’ ๐‘›๐‘œ
๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘› ๐‘›๐‘–๐‘’๐‘”โ„Ž๐‘๐‘œ๐‘Ÿ๐‘  ๐‘–๐‘› ๐บ๐‘ข ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ ๐‘กโ„Ž๐‘Ž๐‘› ๐‘ข ๐‘Ž๐‘›๐‘‘ ๐‘ฃ
0,
๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
• In the following discussion, we will use this distance function as the basis to
the measures of dispersion.
14
Example – Embeddedness vs Dispersion
• Embeddedness:
emb(u,b) = 5
15
Example – Embeddedness vs Dispersion
• Embeddedness:
emb(u,b) = 5
emb(u,h) = 4
16
Example – Embeddedness vs Dispersion
• Embeddedness:
emb(u,b) = 5
emb(u,h) = 4
• Dispersion:
disp(u,h) = 4
17
Example – Embeddedness vs Dispersion
• Embeddedness:
emb(u,b) = 5
emb(u,h) = 4
• Dispersion:
disp(u,h) = 4
disp(u,b) = 1
18
Strengthenings of Dispersion
• How can we create a function that predicts whether or not v is the partner
of u in terms of the two variables disp(u,v) and emb(u,v)?
• Performance is highest for functions that are monotonically increasing in
disp(u,v) and monotonically decreasing in emb(u,v).
• We define: norm(u,v) = disp(u,v)/emb(u,v).
• Predicting u’s partner to be the individual v maximizing norm(u,v) gives
the correct answer in 48.0% of all instances.
19
Strengthenings of Dispersion
• There are two strengthenings of the normalized dispersion that lead to
increased performance:
1. The ๏ฌrst is to rank nodes by a function of
the form:
๐’…๐’Š๐’”๐’‘ ๐’–,๐’— +๐’ƒ ๐œถ
.
๐’†๐’Ž๐’ƒ ๐’–,๐’— +๐’„
Maximum performance of 50.5% is
reached when α = 0.61, b = 0, and c = 5.
20
Strengthenings of Dispersion
2.
Performance can be strengthened by applying the idea of dispersion recursively:
• Assign values to the nodes re๏ฌ‚ecting the dispersion of their links with u.
• Update these values in terms of the dispersion values associated with other nodes.
• Speci๏ฌcally, we initially de๏ฌne ๐‘ฅ๐‘ฃ = 1 for all neighbors v of u, and then iteratively
update each ๐‘ฅ๐‘ฃ to be:
2
๐‘ฅ
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘ค
๐‘ค∈๐ถ๐‘ข๐‘ฃ
๐‘ฅ๐‘ฃ =
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
21
Mathematical Properties of the Recursive Dispersion
๐‘ฅ๐‘ฃ =
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
• If we assign ๐‘ฅ๐‘  = 1 to each node ๐‘  ∈ ๐บ๐‘ข , then:
๐’”,๐’•∈๐‘ช๐’–๐’— ๐’…๐’—
• Hence,
๐’”, ๐’• ๐’™๐’” ๐’™๐’• = ๐’…๐’Š๐’”๐’‘(๐’—).
๐’…๐’— ๐’”,๐’• ๐’™๐’” ๐’™๐’•
๐’†๐’Ž๐’ƒ(๐’–,๐’—)
๐’”,๐’•∈๐‘ช๐’–๐’—
= ๐’๐’๐’“๐’Ž(๐’—).
• Goal: elevate ๐‘ฅ๐‘ฃ when ๐‘ข and ๐‘ฃ act as intermediaries between many node pairs ๐‘ 
and ๐‘ก that, recursively, have large values of ๐‘ฅ๐‘  and ๐‘ฅ๐‘ก .
22
Mathematical Properties of the Recursive Dispersion
๐‘ฅ๐‘ฃ =
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
• We can de๏ฌne an iteration in which ๐‘ฅ๐‘ฃ is updated to be ๐‘ฅ๐‘ฃ ←
๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ
๐‘‘๐‘ฃ ๐‘ ,๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข,๐‘ฃ)
.
• Problem:
•
•
•
•
The numerator is equal to 0 for many nodes v.
Even more nodes will acquire a 0 value in subsequent iterations.
Very few nodes end up with a positive value ๐‘ฅ๐‘ฃ .
The performance in identifying partners would be hurt.
• Thus, it is useful to have a mechanism that continuously introduces non-zero weight
into the system.
23
Mathematical Properties of the Recursive Dispersion
๐‘ฅ๐‘ฃ =
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
• We want our function to have the following properties:
1. The ๏ฌrst iteration produces values ๐‘ฅ๐‘ฃ whose sorted order agrees with the
order of values according to normalized dispersion.
2. Additional quadratic terms in the numerator, that will match the quadratic
degree of the existing term in the numerator (
• Adding
๐Ÿ
๐’˜∈๐‘ช๐’–๐’— ๐’™๐’˜
๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ
๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก ).
to the numerator achieves these two properties.
24
Mathematical Properties of the Recursive Dispersion
๐‘ฅ๐‘ฃ =
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
• After the ๏ฌrst iteration, ๐‘ฅ๐‘ฃ = 1 + 2 · ๐‘›๐‘œ๐‘Ÿ๐‘š(๐‘ข, ๐‘ฃ), and hence ranking
nodes by ๐‘ฅ๐‘ฃ after the ๏ฌrst iteration is equivalent to ranking nodes by
๐‘›๐‘œ๐‘Ÿ๐‘š(๐‘ข, ๐‘ฃ).
• The highest performance is achieved when we rank nodes by the values of
๐‘ฅ๐‘ฃ after the third iteration.
• Denote the value ๐‘ฅ๐‘ฃ in the third iteration by - the recursive dispersion
rec(u,v).
25
Recursive Dispersion - Example
Initial values
1
1
1
b
a
c
๐‘ฅ๐‘ฃ =
u
d
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
1
e
1
26
Recursive Dispersion - Example
First iteration
1
1
1
b
a
c
๐‘ฅ๐‘ฃ =
u
d
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
๐‘ฅ๐‘Ž =
2
1+2โˆ™0
=1
1
2+2โˆ™1
๐‘ฅ๐‘‘ =
=2
2
e
1
27
Recursive Dispersion - Example
Second iteration
1
4
1
b
a
c
๐‘ฅ๐‘ฃ =
u
d
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
๐‘ฅ๐‘ =
2
4+2โˆ™0
=4
1
2+2โˆ™1
๐‘ฅ๐‘‘ =
=2
2
e
4
28
Recursive Dispersion - Example
Third iteration
1
4
1
b
a
c
๐‘ฅ๐‘ฃ =
u
d
2
๐‘ค∈๐ถ๐‘ข๐‘ฃ ๐‘ฅ๐‘ค
+ 2 ๐‘ ,๐‘ก∈๐ถ๐‘ข๐‘ฃ ๐‘‘๐‘ฃ ๐‘ , ๐‘ก ๐‘ฅ๐‘  ๐‘ฅ๐‘ก
๐‘’๐‘š๐‘(๐‘ข, ๐‘ฃ)
๐‘ฅ๐‘ =
32
4+2โˆ™0
=4
1
32 + 2 โˆ™ 16
๐‘ฅ๐‘‘ =
= 32
2
e
4
29
Performance of Structural
and Interaction Measures
• We can compare the structural measures to features derived from a variety of
different forms of real-time interaction between users such as:
• Pro๏ฌle viewing
• Sending of messages
• Co-presence at events
• The use of such ‘interaction features’ is motivated by the way in which tie strength
can be estimated from the volume of interaction between two people.
30
Performance of Structural
and Interaction Measures
• Within the category of interaction features, the two that consistently display
the best performance are:
• Rank neighbors of u by the number of photos in which they appear with u.
• Rank neighbors of u by the total number of times that u has viewed their
pro๏ฌle page in the previous 90 days.
• Lets examine the performance of different measures for identifying
spouses and romantic partners.
31
Performance of Structural
and Interaction Measures
• Lets examine the precision at the ๏ฌrst position—the
fraction of instances in which the user ranked ๏ฌrst by the
measure is in fact the true partner.
• The performance of the structural measures is much
higher for married users (60.7%) than for users in a
relationship (34.4%).
• In contrast, pro๏ฌle viewing achieves higher performance
than recursive dispersion for users in a relationship.
• The performance of structural measures is signi๏ฌcantly
higher for males than for females.
32
Performance of Structural
and Interaction Measures
• For certain more focused subsets of the data, the performance is even stronger:
• For example, on the subset corresponding to married male Facebook users in the US, the
friend with the highest recursive dispersion is the user’s spouse 76.9% of the time.
• We can also evaluate performance on the subset of users in same-sex
relationships. Here we focus on users whose status is ‘in a relationship.’
• For female users, the absolute level of performance is almost identical regardless
of whether their listed partner is female or male.
• For male users, the performance is signi๏ฌcantly higher for relationships in which
the partner is male (.450, in contrast to .369).
33
Performance of Structural
and Interaction Measures
• When the user v who scores highest under one of the measures is not the partner
of u, what role does v play among u’s network neighbors?
• v is often a family member of u.
• What happens when we ask for the top-ranked friend
to be either the partner or a family member?
• For married users, the friend v that maximizes ๐‘Ÿ๐‘’๐‘(๐‘ข, ๐‘ฃ)
is the partner or a family member over 75% of the time.
• The performance gap between the genders essentially
vanishes in the case of married users.
• Female users are more likely to have their partner/family
member at the top of the ranking by recursive dispersion.
34
A Broader Set of Measures for ๐‘‘๐‘ฃ
Since measures of dispersion are based on an underlying distance function ๐‘‘๐‘ฃ , we can
investigate how the performance depends on the choice of ๐’…๐’— :
• We can set a distance threshold ๐’“, and declare:
๐‘‘๐‘ฃ (๐‘ , ๐‘ก) =
1, ๐‘  ๐‘Ž๐‘›๐‘‘ ๐‘ก ๐‘Ž๐‘Ÿ๐‘’ ๐‘Ž๐‘ก ๐‘™๐‘’๐‘Ž๐‘ ๐‘ก ๐‘Ÿ โ„Ž๐‘œ๐‘๐‘  ๐‘Ž๐‘๐‘Ž๐‘Ÿ๐‘ก ๐‘–๐‘› ๐บ๐‘ข − {๐‘ข, ๐‘ฃ}
0, ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
• We can declare ๐‘‘๐‘ฃ (๐‘ , ๐‘ก) as follows:
๐‘‘๐‘ฃ (๐‘ , ๐‘ก) =
1, ๐‘  ๐‘Ž๐‘›๐‘‘ ๐‘ก ๐‘๐‘’๐‘™๐‘œ๐‘›๐‘” ๐‘ก๐‘œ different connected components of ๐บ๐‘ข − {๐‘ข, ๐‘ฃ}
0, ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
This follows the first approach, with ๐’“ = ∞.
35
A Broader Set of Measures for ๐‘‘๐‘ฃ
• Divide ๐บ๐‘ข into communities according to a
community detection algorithm, and declare:
๐‘‘๐‘ฃ ๐‘ , ๐‘ก =
1, ๐‘  ๐‘Ž๐‘›๐‘‘ ๐‘ก ๐‘๐‘’๐‘™๐‘œ๐‘›๐‘” ๐‘ก๐‘œ different ๐‘๐‘œ๐‘š๐‘š๐‘ข๐‘›๐‘–๐‘ก๐‘–๐‘’๐‘ 
0, ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
• Lets examine the performance of each of the
distance functions.
• Recursive dispersion using a distance
function ๐‘‘๐‘ฃ based on a distance threshold
of 3 produces the highest accuracy.
36
Performance as a Function of Neighborhood
Size and TimeonSite
• Further important sources of variation among users:
• Size of their network neighborhoods
• The amount of time since they joined Facebook.
• These two properties are related - after a user joins the site, his or her network
neighborhood will generally grow monotonically over time.
• How can a large network neighborhood affect the performance?
• A mature network neighborhood will be more complex, which may hurt performance.
• It will most likely re๏ฌ‚ect the user’s off-line relationships in richer detail, which may help
performance.
37
Performance as a Function of
Neighborhood Size
• Lets examine the performance as a function of the
neighborhood size.
For structural measures:
• Performance is best when the neighborhood size is
around 100 nodes(56%), and drops moderately (to
33%) as the neighborhood size increases.
• Recursive dispersion is the highest-performing
structural measure for every range of neighborhood
sizes except the extremes (50 to 100, and above 1000).
• At the extremes, the normalized dispersion is
slightly better.
38
Performance as a Function of
Neighborhood Size
• The bene๏ฌts of large neighborhoods are clearer when
we consider the performance of interaction features.
• Their performance tends to be approximately
constant, or even increasing, as a function of
neighborhood size.
• What might be the causes for the improved
performance?
• Users with large neighborhoods also tend to be more active.
• The number of relationships that are actively maintained
grows slowly in comparison to the total neighborhood size.
• The number of candidates for the relationship partner
grows more slowly than the neighborhood size.
39
Performance as a Function of
TimeonSite
Lets examine the performance as a function of the time on site – the
number of days since the user joined Facebook.
• We consider a subset of users where:
• The neighborhood size lies between 100 and
150.
• The time since the relationship was reported lies
between 100 and 200 days.
• There is a weak increase in performance
as a function of time on site.
40
Combining Features using Machine Learning
• Different features may capture different aspects of the user’s neighborhood.
• How well can we predict partners when combining information from many
structural or interaction features via machine learning?
• For the machine learning experiments, 48 structural features and 72
interaction features are used.
• Can machine leaning bring a significant improvement in
performance?
41
Combining Features using Machine Learning
• By combining all of the 48 structural features, we can increase
performance from 50.6% to 53.1%.
• Overall, interaction features perform slightly better than structural features
(56.0% vs. 53.1%).
• For married users, structural features do much better (62.4% vs. 52.6%).
• On all categories the combination of interaction features and structural
features signi๏ฌcantly outperforms either on its own.
42
Machine Learning to Predict Relationship Status
• Our focus is on the problem of identifying relationship partners for
users where we know that they are in a relationship.
• How can we estimate whether an arbitrary user is in a relationship
or not?
• This latter question is more challenging and requires a different set of
techniques.
43
Machine Learning to Predict Relationship Status
• Why is it more difficult to predict whether a user is in a relationship or not?
• Consider a user u who has a link of high dispersion to a user v.
• If we know that u is in a relationship, then v is a good candidate to be the partner.
• Dispersion is useful to identify individuals with interesting connections to u, in the sense that
they have been introduced into multiple foci that u belongs to.
• A user generally will have such friends even when u is not in a romantic relationship.
44
Machine Learning to Predict Relationship Status
• An experiment was conducted, taking approximately 129,000 Facebook users, sampled
uniformly over all users of age at least 20 with between 50 and 2000 friends.
• 40% of these users were single, while the remaining were either in a relationship, engaged, or
married.
• Prediction tasks:
1. Determining whether a user is in any sort of a relationship.
2. Look only at single and married users, and attempt to determine which category a user belongs to.
• Sets of features are used for these tasks:
1. Demographic features(age, gender, country, etc).
2. Structural features of the network neighborhood.
3. The union of these two sets.
45
Machine Learning to Predict Relationship Status
• Age is a powerful feature for predicting relationship status, thus,
demographic features do well.
• Network features are not as strong, re๏ฌ‚ecting the notion that even users not
in relationships have friends with similar structural properties.
• Despite this, network features add predictive power to demographic features.
46
Temporal Properties
• How does performance vary based on the time since the relationship
was ๏ฌrst reported by the user?
• This is an approximation for the age of the relationship itself, since the
relationship may have existed for some time before it was reported.
• Let’s examine how this property affects the performance of structural and
interactional measures.
• We will test this on users who are married and users who are in a
relationship.
47
Temporal Properties – Married Users
• The structural measures are more accurate on older relationships than on newer
ones, while the pro๏ฌle viewing feature is less accurate.
• The structural signature of the relationship needs time to ‘burn in’ to the
network, while the interaction level via pro๏ฌle viewing is high almost immediately.
• For married users, recursive dispersion has the highest performance across the full
time range.
48
Temporal Properties – Users in a Relationship
• For users in a relationship, an interesting crossover occurs:
• For relationships less than a year old, the pro๏ฌle viewing featured produces the highest performance.
• At approximately one year recursive dispersion and photo viewing lead to a better performance.
• There is a trade-off between a decreasing level of observation as a relationship goes on,
contrasted with an increasing level of dispersion in the network as the link structure adapts
around the two individuals.
49
Temporal Properties
• How do these measures change over time in the period
leading up to a change in relationship status?
• Normalized and recursive dispersion rise quickly to the
point of marriage
• Embeddedness not only has lower performance but also
rises more slowly.
• When both embeddedness and recursive dispersion
eventually identify the spouse correctly, recursive dispersion
does so an average of approximately 80 days sooner.
50
Temporal Properties
• Are partnerships that are more strongly identi๏ฌed by the measures are
also more likely to persist over time?
• We consider the users who listed themselves as being in a relationship, and
see which of them list their relationship status as ‘single’ 60 days later.
• A user whose partner has a high normalized or
recursive dispersion is signi๏ฌcantly less likely to
transition to ‘single’ status over this time period.
51
Temporal Properties
• We can view the persistence of relationships by comparing relationships on which
recursive dispersion correctly identi๏ฌes the partner to those on which it does not.
• Relationships on which recursive dispersion
fails to correctly identify the partner are
signi๏ฌcantly more likely to transition to
‘single’ status over a 60 day period.
• This effect holds across all relationship ages
and is particularly pronounced for
relationships up to 12 months in age.
52
Beyond Immediate Neighborhoods
• All of the network measures are based on the immediate 1-hop neighborhoods of
individuals.
• It is interesting to consider how accurate more expansive methods might be, if they take the
broader structure of the network into account.
• Because many individuals have 2-hop neighborhoods with hundreds of thousands of nodes,
doing this is computationally challenging, and heuristics are required to make it feasible.
• For example:
1. Take a single structural measure and ๏ฌlter down to an individual’s top 20 friends as ranked
by this metric.
2. Compute the network measures in the (1-hop) neighborhoods of each of these 20 people.
53
Beyond Immediate Neighborhoods
• To evaluate a given friend v as the potential partner of u, we can use the
measures computed in u’s neighborhood and also in v’s.
• Taking u’s top 20 friends with respect to rec(u,v), and then ranking them by
min(rec(u,v),rec(v,u)), improves performance by about 6% to 0.534.
• This performs almost as well as more complex models.
• This con๏ฌrms the intuitive result that relationship partners are best found by
looking for pairs of people who have high scores in both directions.
54
Conclusions
• Understanding the structural roles of a romantic partner in online social networks
is a broad question that requires a combination of different approaches.
• Dispersion is a measure which provides a powerful method for recognizing such
partners from network data alone.
• Dispersion is a structural means of capturing the notion that a romantic partner
spans many contexts in one’s social life.
• This is why it is not only spouses or romantic partners who exhibit high dispersion,
but also family members — dispersion identi๏ฌes people who span foci.
55
Download