Lada Adamic, HP Labs, Palo Alto, CA
Talk outline
Information flow through blogs
Information flow through email
Search through email networks
Search within the enterprise
Search in an online community
Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose
– Record real-world and virtual experiences
– Note and discuss things “seen” on the net
– Great to track “memes” (catchy ideas)
• Patterns of information flow
– How does the popularity of a topic evolve over time?
– Who is getting information from whom?
• Ranking algorithms that take advantage of transmission patterns
Slashdot Effect
BoingBoing Effect
Time
Blogdex, BlogPulse, etc. track the most popular links/phrases of the day
Different kinds of information have different popularity profiles
0.5
0.4
0.3
0.2
0.1
0
1
0.9
0.8
0.7
0.6
Slashdot postings
Front-page news
Major-news site (editorial content) – back of the paper
Products, etc.
5 10 15 5 10 15 5 10 15 5
% of hits received on each day since first appearance
10 15
Micro example: Giant Microbes
– Timings
– Underlying network b
2 b
1 b
3
Time of infection t
0 t
1
– Root may be unknown
– Multiple possible paths
– Uncrawled space, alternate media (email, voice)
– No links b n b
2
?
b
1 ?
b
3 t
0
Time of infection t
1
who is getting info from whom
– Via links are even better
– Use ML algorithm for link inference problem
• Support Vector Machine (SVM)
• Logistic Regression
– What we can use
• Full text
• Blogs in common
• Links in common
• History of infection
http://www-idl.hpl.hp.com/blogstuff
– Using GraphViz (by AT&T) layouts
– If single, explicit link exists, draw it
– Otherwise use ML algorithm
• Pick the most likely explicit link
• Pick the most likely possible link
Giant Microbes epidemic visualization via link explicit link inferred link blog
Find early sources of good information using inferred information paths or timing b
1
True source b
2
Popular site b
3 b
4 b
5
… b n
• Draw a weighted edge for all pairs of blogs that cite the same URL
• higher weight for mentions closer together
• run PageRank
• control for ‘spam’ t
0
Time of infection t
1
02:00 AM Friday Mar. 05, 2004 PST Wired publishes:
" Warning: Blogs Can Be Infectious .”
7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:
" Bloggers' Plagiarism Scientifically Proven "
9:55 AM Friday Mar. 05, 2004 PST Metafilter announces
" A good amount of bloggers are outright thieves."
Fang Wu, Bernardo Huberman, Lada Adamic, Joshua Tyler
Spread of disease is affected by the underlying network mom co-worker college friend mike co-worker co-worker
Spread of computer viruses is affected by the underlying network mom co-worker college friend mike co-worker co-worker
Difference between information flow and disease/virus spread
Viruses (computer and otherwise) are shared indiscriminately (involuntarily)
Information is passed selectively from one host to another based on knowledge of the recipient’s interests
Spread of information is affected by its content, potential recipients, and network topology mom co-worker college friend mike co-worker co-worker
homophily : individuals with like interests associate with one another personal homepages at Stanford
1.2
1
0.8
0.6
0.4
0.2
0
0 5 10 15 20
The Model:
Decay in transmission probability as a function of the distance m between potential target and originating node
T (m) = (m+1)
b
T power-law implies slowest decay m=2 m=0 m=1
10
-4
10
-6
Virus, information transmission on a scale free network
P ( k )
Ck
e
k /
10
0 outdegree distribution
= 2.0 fit
10
-2
10
-8
10
0
10
1
10
2
10
3
10
4 outdegree k outdegree
Degree distribution of all senders of email passing through the HP email server
epidemics on scale free graphs
10 6 nodes, epidemic if 1% (10 4 ) infected
0.4
0.2
0
1
1
0.8
0.6
1.5
2
=100,
=
, b
=0
=100,
2.5
3
Wu et al. (2004)
Newman (2002)
3.5
Pastor-Satorras
& Vespignani (2001)
4 b
=0 b
=1
Study of the spread of URLs and attachments
40 participants (30 within HPL, 10 elsewhere in HP & other orgs)
6370 URLs and 3401 attachments crypotgraphically hashed
Question: How many recipients in our sample did each item reach?
caveats: messages are deleted (still, the median number of messages > 2000) non-uniform sample
forwarded message
Only forwarded messages are counted forwarded URLs
10
2
10
1
0
10
10
0
Results average = 1.1 for attachments, and 1.2 for URLs
10
4
10
3 email attachments x
-4.1
URLs x
-3.6
number of recipients
10
1 ads at the bottom of hotmail & yahoo messages short term expense control
Simulate transmission on email log each message has a probability p of transmitting information from an infected individual to the recipient
02/19/2003
02/19/2003
02/19/2003
02/19/2003
02/19/2003
02/19/2003
02/19/2003
02/19/2003
02/19/2003
15:45:33 I-1
15:45:33 I-1
15:45:40 E-1
15:45:52 I-5
15:45:55 E-3
15:45:58 I-7
15:46:00 E-4
15:46:05 I-10
15:46:10 I-12
02/19/2003
02/19/2003
15:46:10 I-12
15:46:10 I-12
I-14
I-15
02/19/2003 15:46:14 I-16 E-5
. . . .
. . . .
I-6
I-8
I-9
I-11
I-13
I-2
I-3
I-4
E-2 internal node external node
Simulation of information transmission on the actual HP Labs email graph an individual is infected if they receive a particular piece of information individuals remain infected for 24 hours start by infecting one individual at random every time an infected individual sends an email they have a probability p of infecting the recipient track epidemic over the course of a week, most run their course in 1-2 days
distance 2
Introduce a decay in the transmission probability based on the hierarchical distance p
p
0 h
1 .
75 h
AB
= 5 distance 1 distance 2
A B distance 1
1000
500
0
0
2500
7119 potential recipients outbreak w/ decay epidemic w/ decay outbreak w/o decay epidemic w/o decay 2000
1500
0.2
0.4
0.6
probability of transmission
0.8
p
0
1
Conclusions on info flow in social groups
Information spread typically does not reach epidemic proportions
Information is passed on to individuals with matching properties
The likelihood that properties match decreases with distance from the source
Model gives a finite threshold
Results are consistent with observed URL & attachment frequencies in a sample
Simulations following real email patterns also consistent
How to search in a small world
NE
MA
Milgram’s experiment :
Given a target individual and a particular property, pass the message to a person you correspond with who is “closest” to the target.
Small world experiment at Columbia
Dodds, Muhamad, Watts, Science 301, (2003) email experiement conducted in 2002
18 targets in 13 different countries
24,163 message chains
384 reached their targets average path length 4.0
Why study small world phenomena?
Curiosity:
Why is the world small?
How are people able to route messages?
Social Networking as a Business:
Friendster, Orkut, MySpace
LinkedIn, Spoke, VisiblePath
Six degrees of separation - to be expected
Pool and Kochen (1978) - average person has 500-1500 acquaintances
Ignoring clustering, other redundancy …
~ 10 3 first neighbors, 10 6 second neighbors, 10 9 third neighbors
But networks are clustered: my friends’ friends tend to be my friends
Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an average shortest path close to that of a random graph
But how are people are able to find short paths?
How to choose among hundreds of acquaintances?
Strategy:
Simple greedy algorithm - each participant chooses correspondent who is closest to target with respect to the given property
Models geography
Kleinberg (2000) hierarchical groups
Watts, Dodds, Newman (2001), Kleinberg(2001) high degree nodes
Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)
Kleinberg (2000)
Spatial search
“The geographic movement of the [message] from Nebraska to
Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain”
S.Milgram ‘The small world problem’, Psychology Today 1,61,1967 nodes are placed on a lattice and connect to nearest neighbors additional links placed with f(d)~ d(u,v)
-r if r
= 2, can search in polylog (< (logN) 2 ) time
Kleinberg: searching hierarchical structures
‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001
Hierarchical network models: h is the distance between two individuals in hierarchy with branching b f(h) ~ b
-
h
If
= 1, can search in O(log n) steps
Group structure models: q = size of smallest group that two individuals belong to f(q) ~ q
-
If
= 1, can achieve in O(log n) steps
Identity and search in social networks
Watts, Dodds, Newman (2001) individuals belong to hierarchically nested groups multiple independent hierarchies coexist p ij
~ exp(-
x)
Identity and search in social networks
Watts, Dodds, Newman (2001)
There is an attrition rate r
Network is ‘searchable’ if a fraction q of messages reach the target
N=409600
N=204800
N=102400
High degree search
Adamic et al. Phys. Rev. E, 64 46135 (2001)
Mary
Who could introduce me to
Richard Gere?
Bob
Jane
number of nodes found
94
63
67
54
2
6
1 power-law graph
number of nodes found
93
Poisson graph
19
11
3
15
7
1
10
3
Scaling of search time with size of graph
Sharp cutoff at k~N 1/
,
2 nd degree neighbors random walk
= 0.37 fit degree sequence
= 0.24 fit
10
2
10
1
10
0
10
1
10
2
10
3 size of graph
10
4
10
5
Testing the models on social networks
( w/ Eytan Adar)
Use a well defined network:
HP Labs email correspondence over 3.5 months
Edges are between individuals who sent at least 6 email messages each way
Node properties specified: degree geographical location position in organizational hierarchy
Can greedy strategies work?
Strategy 1: High degree search
Degree distribution of all senders of email passing through the HP email server
10
0 outdegree distribution
= 2.0 fit
10
-2
10
-4
10
-6
10
-8
10
0
10
1
10
2
10
3
10
4
Filtered network
(6 messages sent each way)
35
Degree distribution no longer power-law, but Poisson
10
0
30
450 users median degree = 10
25 mean degree = 13
10
-2
20 average shortest path = 3
15
10
-4
0 20 40 k
60 80
10
5
High degree search performance (poor): median # steps = 16 mean = 40
0
0 20 40 60 number of email correspondents, k
80
Strategy 2:
Geography
4U
Communication across corporate geography
1U
1L
87 % of the
4000 links are between individuals on the same floor
3U
2U 2L 3L
10
0
Cubicle distance vs. probability of being linked measured
1/r
1/r
2
10
-1
10
-2
10
-3 optimum for search
10
2 distance in feet
10
3
Finding someone in a sea of cubicles
16000
14000
12000
10000
8000
6000
4000
2000
0
0 2 4 6 8 10 12 number of steps
14 median = 7 mean = 12
16 18 20
Strategy 3: Organizational hierarchy
Email correspondence scrambled
Actual email correspondence
distance 2
Example of search path distance 1 distance 1 hierarchical distance = 5 search path distance = 4 distance 1
Probability of linking vs. distance in hierarchy observed fit exp(-0.92*h)
0.6
0.5
0.4
0.3
0.2
0.1
0
2 4 6 hierarchical distance h
8 in the ‘searchable’ regime: 0 <
< 2 (Watts 2001)
10
Results
2
1
0
0
5 x 10
4
4
3 distance search geodesic org random median 4 mean 5.7 (4.7)
3
3.1
6
6.1
28
57.4
5 10 15 number of steps in search
20 25
Group size vs. probability of linking
10
0
Group size and probability of linking observed fit g
-0.74
g
-1
10
-1
10
-2 optimum for search (Kleinberg 2001)
10
1
10
2
Search Conclusions
Individuals associate on different levels into groups.
Group structure facilitates decentralized search using social ties.
HP Labs as a social network is searchable but not quite optimal.
searching using the organizational hierarchy is faster than using physical location
A fraction of ‘important’ individuals are easily findable
Humans may be much more resourceful in executing search tasks: making use of weak ties using more sophisticated strategies
PeopleFinder 2 – a search engine for HP people
Extract & disambiguate names from publicly available documents
Enrich information available about individuals
Search for them by topic
Identify knowledge communities from co-occurrence of names
Live Demo
If live demo fails:
Current PeopleFinder functionality
PeopleFinder 2 info on a person
Extracted topics for a person
Social network
Social network visualization
Search for individuals by topic
Visualize knowledge network
Find social network paths to experts
To find out more:
(papers, slides, other research in the group)
Information dynamics group (IDL) at HP Labs: http://www.hpl.hp.com/research/idl
List of publications http://www.hpl.hp.com/personal/Lada_Adamic/research.html