Representation of Web Document Spaces Andrew Tomkins, Yahoo! Research

advertisement
Representation of Web
Document Spaces
Andrew Tomkins, Yahoo! Research
Yahoo! Research
Overview: some issues in web document representation
 “Classical” representation of a document corpus
– Each document (and query) is a point in a corpus-dependent high-dimensional
space
– Many concerns, but no more effective model generally accepted for information
retrieval
 For web search, many other factors intrude:
– “Regions” of a page
– Hyperlinks (some examples to follow)
– Anchortext
– Spam
– Site-level structure (templates, metadata, etc)
 And new document types that don’t fit the current model:
– Message-structured data (a brief discussion here too)
– Evolving documents (a few quick points, time permitting)
Techniques for Exploiting the
Graph
Yahoo! Research
One view of the Internet: Inter-Domain Connectivity
Shells: 1
3
2
Core
 Core: maximal clique of high-degree
nodes
 Shells: nodes in 1-neighborhood of
core, or of previous shell, with
degree > 1
 Legs: 1-degree nodes
[Tauro, Palmer, Siganos, Faloutsos, 2001 Global Internet]
Yahoo! Research
Another view of the web: the hyperlink graph
 Each static html page = a node
 Each hyperlink = a directed edge
 Currently ~1010 nodes (mostly junk), 1011 edges
Yahoo! Research
Breadth-first search from random starts

How many vertices are reachable from a random vertex?
Yahoo! Research
A Picture of (~200M) pages.
[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]
Yahoo! Research
Some distance measurements




Pr[u reachable from v] ~ 1/4
Max distance between 2 SCC nodes: 28
Max distance between 2 nodes (if there is a path) > 900
Avg distance between 2 SCC nodes: 16
Yahoo! Research
The Navigational Backbone
Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
[Dill, Kumar, Mccurley, Rajagopalan, Sivakumar, Tomkins 2002]
Yahoo! Research
Overview
 Any document representation must make it possible (efficient?) to perform
various useful operations on the directed graph
 Examples:
– Link-based static ranking (e.g., PageRank)
– Batch mining operations (e.g., communities, dense subgraphs, link spam
detection, etc)
– Online query processing based on the graph (e.g., HITS, connection
subgraphs)
Batch Processing on the Graph:
Mining Communities
[Kumar, Raghavan, Rajagopalan, Tomkins 99]
Yahoo! Research
General approach
 It’s hard (though getting easier) to analyze the content of all pages on the
web
 It’s easier (though still hard) to analyze the graph
 How successfully can we extract useful semantic knowledge (ie, community
structure) from links alone?
Yahoo! Research
Web Communities
Outdoor Magazine
Fishing
Bill's Fishing Resources
LDP
Linux
Linux Links
Different communities appear to have very different structure.
Yahoo! Research
Web Communities
Outdoor Magazine
Fishing
Bill's Fishing Resources
LDP
Linux
Linux Links
But both contain a common “footprint”: two pages (
Point to three other pages in common ( )
) that both
Yahoo! Research
Communities and cores
Definition: A "core" K ij consists of i left nodes,
j right nodes, and all left->right edges.
Example K2,3
Critical facts:
1. Almost all communities contain a core [expected]
2. Almost all cores betoken a community [unexpected]
Yahoo! Research
Other footprint structures
Newsgroup thread
Corporate partnership
Web ring
Intranet fragment
Yahoo! Research
Subgraph enumeration
 Goal: Given a graph-theoretic "footprint" for structures of
interest, find ALL occurrences of these footprints.
Yahoo! Research
Enumerating cores
Clean data by removing:
mirrors (true and approximate)
empty pages, too-popular pages,
nepotistic pages
a
Preprocessing
When no more pruning
is possible, finish using
database techniques
Postprocessing
b1
b2
b3
a belongs to a K 2,3 if and
only if some node points
to b1, b2, b3.
Inclusion/Exclusion Pruning
Yahoo! Research
The cores are interesting
Explicit communities.




Yahoo!, Excite, Infoseek
webrings
news groups
mailing lists
Implicit communities




japanese elementary schools
turkish student associations
oil spills off the coast of japan
australian fire brigades
(1) Implicit communities are defined by cores.
(2) There are an order of magnitude more of these.
(3) Can grow the core to the community using further processing.
Online Processing of the Graph:
Connection Subgraphs
[Faloutsos, McCurley, Tomkins 04]
Yahoo! Research
Informal Problem Statement
 Given a large network and two distinguished vertices s and t, show the
“relationship” between s and t in the network
 Example: show the relationship between “Nicole Kidman” and “Cameron
Diaz” in a social network of people
Yahoo! Research
Standard Approaches
 Standard approach number 1: show an edge if one exists:
Nicole Kidman
Cameron Diaz
Acted in a movie
together
 Standard approach number 2: if no edge exists, show a path:
Nicole Kidman
Carmen Electra
Cameron Diaz
Yahoo! Research
Proposed Approach

Show a small subgraph that may capture exponentially many paths
concisely:
Diaz
Kidman
Yahoo! Research
How big a subgraph?
s
t
Given a graph with initial and final vertices s and t, and a budget
B, return a B-node subgraph that best connects s and t.
Yahoo! Research
Budget: 3 nodes
Yahoo! Research
Budget: 5 nodes
Yahoo! Research
Budget: 6 nodes
Yahoo! Research
A larger example: Byron Dom to David Filo
Yahoo! Research
Outline






Introduction / Motivation
Survey
Proposed Method
Algorithms
Experiments
Conclusions
Yahoo! Research
Proposed method for selecting a subgraph

part 1: measuring quality of a path:
–

electrical current / random walks
part 2: selecting a subgraph
– dynamic programming

part 3: scalability
– heuristics
Yahoo! Research
Path quality, part 1

Why not shortest path?
s
t
f
Yahoo! Research
Path quality, part 2


Why not shortest path?
Why not net. flow?
s
t
f
Yahoo! Research
Path quality, part 3



Why not shortest path?
Why not net. flow?
Why not plain ‘voltages’?
s
+1V
t
0V
f
Yahoo! Research
Path quality, part 4



Why not shortest path?
Why not net. flow?
Why not plain ‘voltages’?
s
+1V
+0.5V
t
0V
f
Yahoo! Research
Proposed path quality measure

Proposed method: voltages with universal sink:
– ~ ‘tax collector’


goodness of a path:
its electric current(*)!
s
t
+1V
f
0V
0V
...
Yahoo! Research
Outline






Introduction / Motivation
Survey
Proposed Method
Algorithms
Experiments
Conclusions
Yahoo! Research
Electricity – Algorithm


Voltages/Amperages can be computed easily ( O(E) )
without universal sink:
v(i) = Σumj [v(j) * C(i,j) / C(i,*) ]
i != source, sink
v(source)=1; v(sink)=0
Yahoo! Research
Electricity – Algorithm
With universal sink:
v(i) = 1/(1+a) Σumj [v(j) * C(i,j) / C(i,*) ]
(~ insensitive to a (=1))
Yahoo! Research
Part 2: From paths to subgraphs
 Using Part 1, compute an s-t flow on the entire graph
 Find a subgraph that “captures” much of this flow
1
1
1/2
1/2
1
1
s
t
1
1
1
1
1/2
1/2
 Given the flow above, how good is the specified path?
 “Delivered current”: how many electrons travel from s to t along that path
Yahoo! Research
Delivered current of a subgraph
 All units of flow (ie, electrons) that travel from s to t via edges in the
subgraph:
Yahoo! Research
Algorithm for selecting subgraph
 Combinatorial problem: find a B-node subgraph to optimize delivered
current – hard to solve exactly or provide approximation algorithms
 Dynamic program to compute:
– Path which maximizes delivered current per node
 Recursive greedy application
Yahoo! Research
Part 3: Scalability
Begin with enormous out-of-core graph
Slowly expand from s and t to find a candidate subgraph for algorithm:
Begin with nodes s and t in expansion pool
Until (stoppingCriterion)
Use pickHeuristic() to pick a node n from expansion pool
Add n to candidate subgraph
Add neighbors of n to expansion pool
Apply electrical flow and dynamic program to candidate subgraph
Yahoo! Research
Part 3: Scalability

By successive, careful expansions
source
s
t
sink
Yahoo! Research
Part 3: Scalability
s
t
Yahoo! Research
Part 3: Scalability
s
t
Yahoo! Research
Part 3: Scalability
s
t
Yahoo! Research
Pseudo-code
pickHeuristic() favors
 Nearby nodes with
– Strong connections to source or sink
– Small degree
Yahoo! Research
Outline






Introduction / Motivation
Survey
Proposed Method
Algorithms
Experiments
Conclusions
Yahoo! Research
Experiments

on large real graph
– ~15M nodes, ~100M edges, weighted
– ‘who co-appears with whom’ (from 500M web pages)


Q1: Quality of ‘voltage’ approach?
Q2: Speed/accuracy trade-off?
Yahoo! Research
Q1: Quality




Actors (A); Computer-Scientists (CS)
Kidman-Diaz (A-A)
Negreponte-Palmisano (CS-CS)
Turing-Stone (CS-A)
Yahoo! Research
(A-A) Kidman-Diaz

What are the best paths between ‘Kidman’ and ‘Diaz’?
Diaz
Kidman
Strong, direct link
Yahoo! Research
CS-CS: Negreponte - Palmisano
NN
• Mainly: CEOs of major Computer companies
(Dell, Gates, Fiorina, ++)
SP
Yahoo! Research
CS-CS: Negreponte - Palmisano
NN
Esther Dyson
Louis Gerstner
SP
Yahoo! Research
CS-A: Turing - Stone
Turing
Anderson
Stone
Yahoo! Research
Outline



Introduction / Motivation
...
Experiments
– Q1: quality
– Q2: speed/accuracy trade-off

Conclusions
Yahoo! Research
Speed/Accuracy Trade-off
delivered
current
Kleinberg-Newell
Rivest-Hoffman
Turing-Stone
Kidman-Diaz
number of nodes kept (‘b’)
Yahoo! Research
Speed/accuracy trade-off



80/20-like rule:
the first few nodes/paths contribute the vast majority of ‘delivered current’
Thus: CandidateGen makes sense
Yahoo! Research
Conclusions





Defined the problem
Part 1: Electricity-based method to measure quality
Part 2: Dynamic programming to spot best paths (‘DisplayGen’)
Part 3: Scalability with good accuracy (‘CandidateGen’)
Operational system
Message-Structured Data
[Kumar, Novak, Liben-Nowell, Raghavan 05]
Yahoo! Research
Dataset










1.3M LiveJournal bloggers, as of February 2004
500K list a home town in the United States
Home towns mapped to lat/long
Granularity of locations: roughly cities
Extracted self-reported “friends” of each blogger: 4M friendships
80% of friendships are reciprocal
¾ of network form giant strongly-connected component
Clustering coefficient: 0.2
Lognormal degree distribution
Each blogger has a profile
– Name, age, …
– Geographic information (city, state, zip, …)
– Friends and friend of
– Interests/communities
Yahoo! Research
Eg, LiveJournal user “bill”
Yahoo! Research
LJ bloggers in US
< 1K
< 5K
< 10K
< 25K
< 50K
~ 100K
Yahoo! Research
LJ bloggers world-wide
< 1K
< 2K
< 5K
~ 25K
~ 50K
~ 75K
Yahoo! Research
Who are they?
Age
%
Representative interests
Yahoo! Research
Friendship graph






Directed
80% mutual
Average degree ~ 14
Power law degrees
Clustering coeff. ~ 0.2
Most friendships explained by age,
location, interest
Age 1%
5%
16%
Location
20%
Interest
16%
22%
A More Complex Example:
Evolution of Communities
in Message-Structured Data
Yahoo! Research
A word on evolution
 Phenomenon to characterize: A topic in a temporal stream occurs in a
“burst of activity”
 Model source as multi-state
 Each state has certain emission properties
 Traversal between states is controlled by a Markov Model
 Determine most likely underlying state sequence over time, given
observable output
[Kleinberg02]
Yahoo! Research
Example
State 1:
Output rate: very low
I’ve been
thinking about
your idea with
the asparagus…
Uh huh
I think I
see…
1
0.01
2
0.005
Uh huh
Yeah, that’s
what I’m
saying…
State 2:
Output rate: very high
So then I said
“Hey, let’s give
it a try”
And anyway
she said
maybe, okay?
Time
Pr[1] ~ 1
Pr[1] ~ 10
Pr[1] ~ 5
Pr[1] ~ 10
Pr[1] ~ 2
Pr[1] ~ 1
Pr[1] ~ 2
Pr[2] ~ 5
Pr[2] ~ 2
Pr[2] ~ 5
Pr[2] ~ 2
Pr[2] ~ 7
Pr[2] ~ 10
Pr[2] ~ 10
Most likely “hidden” sequence:
1
1
1
1
2
2
2
Yahoo! Research
More bursts




Infinite chain of increasingly high-output states
Allows hierarchical bursts
Example 1: email messages
Example 2: conference titles
Yahoo! Research
Integrating evolution and graph-based community analysis
Number of blog pages that
belong to a community
Number of blog communities
Number of communities
identified automatically as
exhibiting “bursty” behavior –
measure of cohesiveness of
the blogspace
Wired magazine publishes
an article on weblogs that
impacts the tech community
Newsweek magazine publishes an article that
reaches the population at large, responding to
emergence, and triggering mainstream adoption
[Kumar, Novak, Raghavan, Tomkins 03]
Yahoo! Research
Conclusions
 Documents are complex objects whose pieces have many flavors
 For all representations, there exists a problem such that a better
representation exists
Download