Lada Adamic, HP Labs, Palo Alto, CA

advertisement

Lada Adamic, HP Labs, Palo Alto, CA

Talk outline

Information flow through blogs

Information flow through email

Search through email networks

Search within the enterprise

Search in an online community

Implicit Structure and Dynamics of BlogSpace

Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose

• Blog use:

– Record real-world and virtual experiences

– Note and discuss things “seen” on the net

• Blog structure: blog-to-blog linking

• Use + Structure

– Great to track “memes” (catchy ideas)

Approaches and uses of blog analysis

• Patterns of information flow

– How does the popularity of a topic evolve over time?

– Who is getting information from whom?

• Ranking algorithms that take advantage of transmission patterns

Tracking popularity over time

Slashdot Effect

BoingBoing Effect

Time

Blogdex, BlogPulse, etc. track the most popular links/phrases of the day

Different kinds of information have different popularity profiles

0.5

0.4

0.3

0.2

0.1

0

1

0.9

0.8

0.7

0.6

Slashdot postings

Front-page news

Major-news site (editorial content) – back of the paper

Products, etc.

5 10 15 5 10 15 5 10 15 5

% of hits received on each day since first appearance

10 15

Micro example: Giant Microbes

Microscale Dynamics

• What do we need track specific info ‘epidemics’?

– Timings

– Underlying network b

2 b

1 b

3

Time of infection t

0 t

1

Microscale Dynamics

• Challenges

– Root may be unknown

– Multiple possible paths

– Uncrawled space, alternate media (email, voice)

– No links b n b

2

?

b

1 ?

b

3 t

0

Time of infection t

1

Microscale Dynamics

who is getting info from whom

• Explicit blog to blog links (easy)

– Via links are even better

• Implicit/Inferred transfer (harder)

– Use ML algorithm for link inference problem

• Support Vector Machine (SVM)

• Logistic Regression

– What we can use

• Full text

• Blogs in common

• Links in common

• History of infection

Visualization

http://www-idl.hpl.hp.com/blogstuff

• Zoomgraph tool

– Using GraphViz (by AT&T) layouts

• Simple algorithm

– If single, explicit link exists, draw it

– Otherwise use ML algorithm

• Pick the most likely explicit link

• Pick the most likely possible link

• Tool lets you zoom around space, control threshold, link types, etc.

Giant Microbes epidemic visualization via link explicit link inferred link blog

iRank

Find early sources of good information using inferred information paths or timing b

1

True source b

2

Popular site b

3 b

4 b

5

… b n

iRank Algorithm

• Draw a weighted edge for all pairs of blogs that cite the same URL

• higher weight for mentions closer together

• run PageRank

• control for ‘spam’ t

0

Time of infection t

1

Do Bloggers Kill Kittens?

02:00 AM Friday Mar. 05, 2004 PST Wired publishes:

" Warning: Blogs Can Be Infectious .”

7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:

" Bloggers' Plagiarism Scientifically Proven "

9:55 AM Friday Mar. 05, 2004 PST Metafilter announces

" A good amount of bloggers are outright thieves."

Information flow in social groups

Fang Wu, Bernardo Huberman, Lada Adamic, Joshua Tyler

Spread of disease is affected by the underlying network mom co-worker college friend mike co-worker co-worker

Spread of computer viruses is affected by the underlying network mom co-worker college friend mike co-worker co-worker

Difference between information flow and disease/virus spread

Viruses (computer and otherwise) are shared indiscriminately (involuntarily)

Information is passed selectively from one host to another based on knowledge of the recipient’s interests

Spread of information is affected by its content, potential recipients, and network topology mom co-worker college friend mike co-worker co-worker

homophily : individuals with like interests associate with one another personal homepages at Stanford

1.2

1

0.8

0.6

0.4

0.2

0

0 5 10 15 20

The Model:

Decay in transmission probability as a function of the distance m between potential target and originating node

T (m) = (m+1)

b

T power-law implies slowest decay m=2 m=0 m=1

10

-4

10

-6

Virus, information transmission on a scale free network

P ( k )

Ck

  e

 k /

10

0 outdegree distribution

= 2.0 fit

10

-2

10

-8

10

0

10

1

10

2

10

3

10

4 outdegree k outdegree

Degree distribution of all senders of email passing through the HP email server

epidemics on scale free graphs

10 6 nodes, epidemic if 1% (10 4 ) infected

0.4

0.2

0

1

1

0.8

0.6

1.5

2

=100,

=

, b

=0

=100,

2.5

3

Wu et al. (2004)

Newman (2002)

3.5

Pastor-Satorras

& Vespignani (2001)

4 b

=0 b

=1

Study of the spread of URLs and attachments

40 participants (30 within HPL, 10 elsewhere in HP & other orgs)

6370 URLs and 3401 attachments crypotgraphically hashed

Question: How many recipients in our sample did each item reach?

caveats: messages are deleted (still, the median number of messages > 2000) non-uniform sample

forwarded message

Only forwarded messages are counted forwarded URLs

10

2

10

1

0

10

10

0

Results average = 1.1 for attachments, and 1.2 for URLs

10

4

10

3 email attachments x

-4.1

URLs x

-3.6

number of recipients

10

1 ads at the bottom of hotmail & yahoo messages short term expense control

Simulate transmission on email log each message has a probability p of transmitting information from an infected individual to the recipient

02/19/2003

02/19/2003

02/19/2003

02/19/2003

02/19/2003

02/19/2003

02/19/2003

02/19/2003

02/19/2003

15:45:33 I-1

15:45:33 I-1

15:45:40 E-1

15:45:52 I-5

15:45:55 E-3

15:45:58 I-7

15:46:00 E-4

15:46:05 I-10

15:46:10 I-12

02/19/2003

02/19/2003

15:46:10 I-12

15:46:10 I-12

I-14

I-15

02/19/2003 15:46:14 I-16 E-5

. . . .

. . . .

I-6

I-8

I-9

I-11

I-13

I-2

I-3

I-4

E-2 internal node external node

Simulation of information transmission on the actual HP Labs email graph an individual is infected if they receive a particular piece of information individuals remain infected for 24 hours start by infecting one individual at random every time an infected individual sends an email they have a probability p of infecting the recipient track epidemic over the course of a week, most run their course in 1-2 days

distance 2

Introduce a decay in the transmission probability based on the hierarchical distance p

 p

0 h

1 .

75 h

AB

= 5 distance 1 distance 2

A B distance 1

1000

500

0

0

2500

7119 potential recipients outbreak w/ decay epidemic w/ decay outbreak w/o decay epidemic w/o decay 2000

1500

0.2

0.4

0.6

probability of transmission

0.8

p

0

1

Conclusions on info flow in social groups

Information spread typically does not reach epidemic proportions

Information is passed on to individuals with matching properties

The likelihood that properties match decreases with distance from the source

Model gives a finite threshold

Results are consistent with observed URL & attachment frequencies in a sample

Simulations following real email patterns also consistent

How to search in a small world

NE

MA

Milgram’s experiment :

Given a target individual and a particular property, pass the message to a person you correspond with who is “closest” to the target.

Small world experiment at Columbia

Dodds, Muhamad, Watts, Science 301, (2003) email experiement conducted in 2002

18 targets in 13 different countries

24,163 message chains

384 reached their targets average path length 4.0

Why study small world phenomena?

Curiosity:

Why is the world small?

How are people able to route messages?

Social Networking as a Business:

Friendster, Orkut, MySpace

LinkedIn, Spoke, VisiblePath

Six degrees of separation - to be expected

Pool and Kochen (1978) - average person has 500-1500 acquaintances

Ignoring clustering, other redundancy …

~ 10 3 first neighbors, 10 6 second neighbors, 10 9 third neighbors

But networks are clustered: my friends’ friends tend to be my friends

Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an average shortest path close to that of a random graph

But how are people are able to find short paths?

How to choose among hundreds of acquaintances?

Strategy:

Simple greedy algorithm - each participant chooses correspondent who is closest to target with respect to the given property

Models geography

Kleinberg (2000) hierarchical groups

Watts, Dodds, Newman (2001), Kleinberg(2001) high degree nodes

Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)

Kleinberg (2000)

Spatial search

“The geographic movement of the [message] from Nebraska to

Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain”

S.Milgram ‘The small world problem’, Psychology Today 1,61,1967 nodes are placed on a lattice and connect to nearest neighbors additional links placed with f(d)~ d(u,v)

-r if r

= 2, can search in polylog (< (logN) 2 ) time

Kleinberg: searching hierarchical structures

‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001

Hierarchical network models: h is the distance between two individuals in hierarchy with branching b f(h) ~ b

-

 h

If

= 1, can search in O(log n) steps

Group structure models: q = size of smallest group that two individuals belong to f(q) ~ q

-

If

= 1, can achieve in O(log n) steps

Identity and search in social networks

Watts, Dodds, Newman (2001) individuals belong to hierarchically nested groups multiple independent hierarchies coexist p ij

~ exp(-

 x)

Identity and search in social networks

Watts, Dodds, Newman (2001)

There is an attrition rate r

Network is ‘searchable’ if a fraction q of messages reach the target

N=409600

N=204800

N=102400

High degree search

Adamic et al. Phys. Rev. E, 64 46135 (2001)

Mary

Who could introduce me to

Richard Gere?

Bob

Jane

number of nodes found

94

63

67

54

2

6

1 power-law graph

number of nodes found

93

Poisson graph

19

11

3

15

7

1

10

3

Scaling of search time with size of graph

Sharp cutoff at k~N 1/

 ,

2 nd degree neighbors random walk

= 0.37 fit degree sequence

= 0.24 fit

10

2

10

1

10

0

10

1

10

2

10

3 size of graph

10

4

10

5

Testing the models on social networks

( w/ Eytan Adar)

Use a well defined network:

HP Labs email correspondence over 3.5 months

Edges are between individuals who sent at least 6 email messages each way

Node properties specified: degree geographical location position in organizational hierarchy

Can greedy strategies work?

Strategy 1: High degree search

Degree distribution of all senders of email passing through the HP email server

10

0 outdegree distribution

= 2.0 fit

10

-2

10

-4

10

-6

10

-8

10

0

10

1

10

2

10

3

10

4

Filtered network

(6 messages sent each way)

35

Degree distribution no longer power-law, but Poisson

10

0

30

450 users median degree = 10

25 mean degree = 13

10

-2

20 average shortest path = 3

15

10

-4

0 20 40 k

60 80

10

5

High degree search performance (poor): median # steps = 16 mean = 40

0

0 20 40 60 number of email correspondents, k

80

Strategy 2:

Geography

4U

Communication across corporate geography

1U

1L

87 % of the

4000 links are between individuals on the same floor

3U

2U 2L 3L

10

0

Cubicle distance vs. probability of being linked measured

1/r

1/r

2

10

-1

10

-2

10

-3 optimum for search

10

2 distance in feet

10

3

Finding someone in a sea of cubicles

16000

14000

12000

10000

8000

6000

4000

2000

0

0 2 4 6 8 10 12 number of steps

14 median = 7 mean = 12

16 18 20

Strategy 3: Organizational hierarchy

Email correspondence scrambled

Actual email correspondence

distance 2

Example of search path distance 1 distance 1 hierarchical distance = 5 search path distance = 4 distance 1

Probability of linking vs. distance in hierarchy observed fit exp(-0.92*h)

0.6

0.5

0.4

0.3

0.2

0.1

0

2 4 6 hierarchical distance h

8 in the ‘searchable’ regime: 0 < 

< 2 (Watts 2001)

10

Results

2

1

0

0

5 x 10

4

4

3 distance search geodesic org random median 4 mean 5.7 (4.7)

3

3.1

6

6.1

28

57.4

5 10 15 number of steps in search

20 25

Group size vs. probability of linking

10

0

Group size and probability of linking observed fit g

-0.74

g

-1

10

-1

10

-2 optimum for search (Kleinberg 2001)

10

1

10

2

Search Conclusions

Individuals associate on different levels into groups.

Group structure facilitates decentralized search using social ties.

HP Labs as a social network is searchable but not quite optimal.

searching using the organizational hierarchy is faster than using physical location

A fraction of ‘important’ individuals are easily findable

Humans may be much more resourceful in executing search tasks: making use of weak ties using more sophisticated strategies

PeopleFinder 2 – a search engine for HP people

Extract & disambiguate names from publicly available documents

Enrich information available about individuals

Search for them by topic

Identify knowledge communities from co-occurrence of names

Live Demo

If live demo fails:

Current PeopleFinder functionality

PeopleFinder 2 info on a person

Extracted topics for a person

Social network

Social network visualization

Search for individuals by topic

Visualize knowledge network

Find social network paths to experts

To find out more:

(papers, slides, other research in the group)

Information dynamics group (IDL) at HP Labs: http://www.hpl.hp.com/research/idl

List of publications http://www.hpl.hp.com/personal/Lada_Adamic/research.html

Download