Representation of Web Document Spaces Andrew Tomkins, Yahoo! Research Yahoo! Research Overview: some issues in web document representation “Classical” representation of a document corpus – Each document (and query) is a point in a corpus-dependent high-dimensional space – Many concerns, but no more effective model generally accepted for information retrieval For web search, many other factors intrude: – “Regions” of a page – Hyperlinks (some examples to follow) – Anchortext – Spam – Site-level structure (templates, metadata, etc) And new document types that don’t fit the current model: – Message-structured data (a brief discussion here too) – Evolving documents (a few quick points, time permitting) Techniques for Exploiting the Graph Yahoo! Research One view of the Internet: Inter-Domain Connectivity Shells: 1 3 2 Core Core: maximal clique of high-degree nodes Shells: nodes in 1-neighborhood of core, or of previous shell, with degree > 1 Legs: 1-degree nodes [Tauro, Palmer, Siganos, Faloutsos, 2001 Global Internet] Yahoo! Research Another view of the web: the hyperlink graph Each static html page = a node Each hyperlink = a directed edge Currently ~1010 nodes (mostly junk), 1011 edges Yahoo! Research Breadth-first search from random starts How many vertices are reachable from a random vertex? Yahoo! Research A Picture of (~200M) pages. [Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001] Yahoo! Research Some distance measurements Pr[u reachable from v] ~ 1/4 Max distance between 2 SCC nodes: 28 Max distance between 2 nodes (if there is a path) > 900 Avg distance between 2 SCC nodes: 16 Yahoo! Research The Navigational Backbone Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs [Dill, Kumar, Mccurley, Rajagopalan, Sivakumar, Tomkins 2002] Yahoo! Research Overview Any document representation must make it possible (efficient?) to perform various useful operations on the directed graph Examples: – Link-based static ranking (e.g., PageRank) – Batch mining operations (e.g., communities, dense subgraphs, link spam detection, etc) – Online query processing based on the graph (e.g., HITS, connection subgraphs) Batch Processing on the Graph: Mining Communities [Kumar, Raghavan, Rajagopalan, Tomkins 99] Yahoo! Research General approach It’s hard (though getting easier) to analyze the content of all pages on the web It’s easier (though still hard) to analyze the graph How successfully can we extract useful semantic knowledge (ie, community structure) from links alone? Yahoo! Research Web Communities Outdoor Magazine Fishing Bill's Fishing Resources LDP Linux Linux Links Different communities appear to have very different structure. Yahoo! Research Web Communities Outdoor Magazine Fishing Bill's Fishing Resources LDP Linux Linux Links But both contain a common “footprint”: two pages ( Point to three other pages in common ( ) ) that both Yahoo! Research Communities and cores Definition: A "core" K ij consists of i left nodes, j right nodes, and all left->right edges. Example K2,3 Critical facts: 1. Almost all communities contain a core [expected] 2. Almost all cores betoken a community [unexpected] Yahoo! Research Other footprint structures Newsgroup thread Corporate partnership Web ring Intranet fragment Yahoo! Research Subgraph enumeration Goal: Given a graph-theoretic "footprint" for structures of interest, find ALL occurrences of these footprints. Yahoo! Research Enumerating cores Clean data by removing: mirrors (true and approximate) empty pages, too-popular pages, nepotistic pages a Preprocessing When no more pruning is possible, finish using database techniques Postprocessing b1 b2 b3 a belongs to a K 2,3 if and only if some node points to b1, b2, b3. Inclusion/Exclusion Pruning Yahoo! Research The cores are interesting Explicit communities. Yahoo!, Excite, Infoseek webrings news groups mailing lists Implicit communities japanese elementary schools turkish student associations oil spills off the coast of japan australian fire brigades (1) Implicit communities are defined by cores. (2) There are an order of magnitude more of these. (3) Can grow the core to the community using further processing. Online Processing of the Graph: Connection Subgraphs [Faloutsos, McCurley, Tomkins 04] Yahoo! Research Informal Problem Statement Given a large network and two distinguished vertices s and t, show the “relationship” between s and t in the network Example: show the relationship between “Nicole Kidman” and “Cameron Diaz” in a social network of people Yahoo! Research Standard Approaches Standard approach number 1: show an edge if one exists: Nicole Kidman Cameron Diaz Acted in a movie together Standard approach number 2: if no edge exists, show a path: Nicole Kidman Carmen Electra Cameron Diaz Yahoo! Research Proposed Approach Show a small subgraph that may capture exponentially many paths concisely: Diaz Kidman Yahoo! Research How big a subgraph? s t Given a graph with initial and final vertices s and t, and a budget B, return a B-node subgraph that best connects s and t. Yahoo! Research Budget: 3 nodes Yahoo! Research Budget: 5 nodes Yahoo! Research Budget: 6 nodes Yahoo! Research A larger example: Byron Dom to David Filo Yahoo! Research Outline Introduction / Motivation Survey Proposed Method Algorithms Experiments Conclusions Yahoo! Research Proposed method for selecting a subgraph part 1: measuring quality of a path: – electrical current / random walks part 2: selecting a subgraph – dynamic programming part 3: scalability – heuristics Yahoo! Research Path quality, part 1 Why not shortest path? s t f Yahoo! Research Path quality, part 2 Why not shortest path? Why not net. flow? s t f Yahoo! Research Path quality, part 3 Why not shortest path? Why not net. flow? Why not plain ‘voltages’? s +1V t 0V f Yahoo! Research Path quality, part 4 Why not shortest path? Why not net. flow? Why not plain ‘voltages’? s +1V +0.5V t 0V f Yahoo! Research Proposed path quality measure Proposed method: voltages with universal sink: – ~ ‘tax collector’ goodness of a path: its electric current(*)! s t +1V f 0V 0V ... Yahoo! Research Outline Introduction / Motivation Survey Proposed Method Algorithms Experiments Conclusions Yahoo! Research Electricity – Algorithm Voltages/Amperages can be computed easily ( O(E) ) without universal sink: v(i) = Σumj [v(j) * C(i,j) / C(i,*) ] i != source, sink v(source)=1; v(sink)=0 Yahoo! Research Electricity – Algorithm With universal sink: v(i) = 1/(1+a) Σumj [v(j) * C(i,j) / C(i,*) ] (~ insensitive to a (=1)) Yahoo! Research Part 2: From paths to subgraphs Using Part 1, compute an s-t flow on the entire graph Find a subgraph that “captures” much of this flow 1 1 1/2 1/2 1 1 s t 1 1 1 1 1/2 1/2 Given the flow above, how good is the specified path? “Delivered current”: how many electrons travel from s to t along that path Yahoo! Research Delivered current of a subgraph All units of flow (ie, electrons) that travel from s to t via edges in the subgraph: Yahoo! Research Algorithm for selecting subgraph Combinatorial problem: find a B-node subgraph to optimize delivered current – hard to solve exactly or provide approximation algorithms Dynamic program to compute: – Path which maximizes delivered current per node Recursive greedy application Yahoo! Research Part 3: Scalability Begin with enormous out-of-core graph Slowly expand from s and t to find a candidate subgraph for algorithm: Begin with nodes s and t in expansion pool Until (stoppingCriterion) Use pickHeuristic() to pick a node n from expansion pool Add n to candidate subgraph Add neighbors of n to expansion pool Apply electrical flow and dynamic program to candidate subgraph Yahoo! Research Part 3: Scalability By successive, careful expansions source s t sink Yahoo! Research Part 3: Scalability s t Yahoo! Research Part 3: Scalability s t Yahoo! Research Part 3: Scalability s t Yahoo! Research Pseudo-code pickHeuristic() favors Nearby nodes with – Strong connections to source or sink – Small degree Yahoo! Research Outline Introduction / Motivation Survey Proposed Method Algorithms Experiments Conclusions Yahoo! Research Experiments on large real graph – ~15M nodes, ~100M edges, weighted – ‘who co-appears with whom’ (from 500M web pages) Q1: Quality of ‘voltage’ approach? Q2: Speed/accuracy trade-off? Yahoo! Research Q1: Quality Actors (A); Computer-Scientists (CS) Kidman-Diaz (A-A) Negreponte-Palmisano (CS-CS) Turing-Stone (CS-A) Yahoo! Research (A-A) Kidman-Diaz What are the best paths between ‘Kidman’ and ‘Diaz’? Diaz Kidman Strong, direct link Yahoo! Research CS-CS: Negreponte - Palmisano NN • Mainly: CEOs of major Computer companies (Dell, Gates, Fiorina, ++) SP Yahoo! Research CS-CS: Negreponte - Palmisano NN Esther Dyson Louis Gerstner SP Yahoo! Research CS-A: Turing - Stone Turing Anderson Stone Yahoo! Research Outline Introduction / Motivation ... Experiments – Q1: quality – Q2: speed/accuracy trade-off Conclusions Yahoo! Research Speed/Accuracy Trade-off delivered current Kleinberg-Newell Rivest-Hoffman Turing-Stone Kidman-Diaz number of nodes kept (‘b’) Yahoo! Research Speed/accuracy trade-off 80/20-like rule: the first few nodes/paths contribute the vast majority of ‘delivered current’ Thus: CandidateGen makes sense Yahoo! Research Conclusions Defined the problem Part 1: Electricity-based method to measure quality Part 2: Dynamic programming to spot best paths (‘DisplayGen’) Part 3: Scalability with good accuracy (‘CandidateGen’) Operational system Message-Structured Data [Kumar, Novak, Liben-Nowell, Raghavan 05] Yahoo! Research Dataset 1.3M LiveJournal bloggers, as of February 2004 500K list a home town in the United States Home towns mapped to lat/long Granularity of locations: roughly cities Extracted self-reported “friends” of each blogger: 4M friendships 80% of friendships are reciprocal ¾ of network form giant strongly-connected component Clustering coefficient: 0.2 Lognormal degree distribution Each blogger has a profile – Name, age, … – Geographic information (city, state, zip, …) – Friends and friend of – Interests/communities Yahoo! Research Eg, LiveJournal user “bill” Yahoo! Research LJ bloggers in US < 1K < 5K < 10K < 25K < 50K ~ 100K Yahoo! Research LJ bloggers world-wide < 1K < 2K < 5K ~ 25K ~ 50K ~ 75K Yahoo! Research Who are they? Age % Representative interests Yahoo! Research Friendship graph Directed 80% mutual Average degree ~ 14 Power law degrees Clustering coeff. ~ 0.2 Most friendships explained by age, location, interest Age 1% 5% 16% Location 20% Interest 16% 22% A More Complex Example: Evolution of Communities in Message-Structured Data Yahoo! Research A word on evolution Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity” Model source as multi-state Each state has certain emission properties Traversal between states is controlled by a Markov Model Determine most likely underlying state sequence over time, given observable output [Kleinberg02] Yahoo! Research Example State 1: Output rate: very low I’ve been thinking about your idea with the asparagus… Uh huh I think I see… 1 0.01 2 0.005 Uh huh Yeah, that’s what I’m saying… State 2: Output rate: very high So then I said “Hey, let’s give it a try” And anyway she said maybe, okay? Time Pr[1] ~ 1 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 7 Pr[2] ~ 10 Pr[2] ~ 10 Most likely “hidden” sequence: 1 1 1 1 2 2 2 Yahoo! Research More bursts Infinite chain of increasingly high-output states Allows hierarchical bursts Example 1: email messages Example 2: conference titles Yahoo! Research Integrating evolution and graph-based community analysis Number of blog pages that belong to a community Number of blog communities Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace Wired magazine publishes an article on weblogs that impacts the tech community Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption [Kumar, Novak, Raghavan, Tomkins 03] Yahoo! Research Conclusions Documents are complex objects whose pieces have many flavors For all representations, there exists a problem such that a better representation exists