The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden

The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden) Andrei Broder, Farzin Maghoul (AltaVista Corp.) Raymie Stata, Janet Wiener (Compaq SRC) Eli Upfal (Brown University) A Picture of (~200M) pages. Part A: Structure. • • • • The graph. The Questions …. … and some Answers. The picture. The web graph • • • • • Nodes (or vertices) = web pages. Edges = (non-nepotistic) links. The graph = all web pages and links. Many nodes, estimates range from 500M to over 1B. Is very sparse. Average links/page between 5-10. Average (links/page | more than 6 links) > 30. Concentrate on graph structure, ignore content. Questions about the web graph • How big is the graph?How many links on a page (outdegree)? How many links to a page (indegree)? • Can one browse from any web page to any other? How many clicks? • Can we pick a random page on the web? • How different is browsing from a “random walk”? • Can we exploit the structure of the web graph for searching and mining? • What does the web graph reveal about social processes which result in its creation and dynamics? Power laws: How many pages point to a random page on the web? • Indegrees. I (u ) | {v  u} | Pru ( I (u )  k )  k 2.1 Slope = 2.1 How many links on a page? Slope = 2.7 Yule/Pareto/Zipf and power laws. • • • • • • • • Inverse polynomial tail. Word frequency in text. Yule (later Mandelbrot) Statistical study of the literary vocabulary.[Yule, 1944]. Citation analysis [Lotka, 1926]. Zipf Human behavior and the principle of least effort. [Zipf, 1947]. Pareto Cours d’economie politique. [Pareto,1897] Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999] Oligonucleotide sequences [Martindale-Konopka, 1996] Many other instances. More Germane • Access statistics for web pages. (From server logs) [Glassman97] • User behavior (by instrumenting browsers and proxies) [LukoseHuberman-98, Crovella and others,97-99] • Earliest analytical model, [Herb Simon, 1955]. Co-citation and Bibliographic coupling: Signature of a community. • Bipartite cores: small “complete” bipartite subgraphs. • Bibliographic coupling, Co-citation analysis. • Hubs and Authorities. K ( 3, 3 ) Uses: • Web searching (HITS/Clever). • Mining communities (Campfire project). • Backlink browsing, “find similar.” Small world. • (Small World Prediction) [Barabasi and Albert 99, Albert-JeongBarabasi 99]. Based on a simple model, predict that most pages are within 19 links of each other. Justify the model by crawling nd.edu Facts (about the crawl). • Most of the time (75%) a random page u is not reachable from another random page v. • Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale. Component sizes. • Component sizes are distributed by the power law. Reachability • How many vertices are reachable from a random vertex? A Picture of (~200M) pages. Part B: Interpretation • Random graph theory. • Application 1: The Campfire Project. • Application 2: Classical IR/Learning. Random Graphs • • • • • Erdos and Renyi’s Gn , p model [Bollobas]. Graph with n vertices. Each of n(n-1) arcs appear with probability p. Graphical evolution [Palmer]: study properties of the resulting random graph as p is increased from 0 to 1. [Shelah and Spencer] 0-1 law: Most properties exhibit a threshold “phase change” like behavior. p Facts about the Erdos-Renyi model • A random graph with average degree 4 has a giant connected component containing almost all (90%) of the vertices. • Indegrees and outdegrees are concentrated around the mean. And have exponentially declining tails. • Most vertices in the graph are close to most others (small world). A new random graph model. Content creation hypothesis • Some page creators create content without regard to what exists on the web. • Many create pages which are inspired by pre-existing content. • Effectively, some links are random, others are copied from preexisting pages. Probabilistic analysis: Evolving graphs. • Creation and Deletion processes for nodes and edges. – e.g. at each time step, a new node is created with a fixed probability pv – at each time step, a new edge is created with probability pe • links two random nodes with probability 1  • a node in proportion to its indegree with probability • (copy a random link). – At each time step a node (resp. edge) is deleted with probability pu (resp pd ) Simple model: creation probabilities are 1 and deletion probabilities are 0.  •  Theory 1. The indegree distributi on has an inverse polynomial tail, moreover, the exponent of the tail is dependent only on  , and can be expressed in the formula : Pru ( I (u )  k )  k  ( 2  1 ) 2. Almost surely, the distributi on converges to this in the limit. 3. Almost surely, the graph has a linear sized SCC. Why study models? • Good predictors of macroscopic behavior. – Degree distributions. Existence and number of cores. [WWW8] • Algorithmic advantages (speed and accuracy). – Better and analyzable algorithmic methods. Inclusion-Exclusion pruning. [VLDB]. – Applications to Data Mining. • Better understanding of the data/corpus. – What is “surprising” depends on what is typical. To find interesting stuff, you must know what is expected. Be careful about... • Predicting and analyzing microscopic properties. – Microscopic Properties which can be changed by the addition/deletion of a few nodes/edges/features. • Examples: Diameter and girth, rare terms and features. • Very susceptible to noise and systematic but small inconsistencies in the model. – Macroscopic Major dataset surgery required to significantly alter the property. • Examples: Degree distributions. Connectivity. • Law of large numbers or equivalent applies. Application 1: The campfire project. Co-citation: Signature of a community. • Bipartite cores: small “complete” bipartite subgraphs. K ( 3, 3 ) Fact : Let K (i , j ) denote a core with i left hand vertices and j right hand vertices. Then, if i  j  ij then a sparse Gn, p random graph almost surely contains no K (i , j ) . Campfire project • Automatically find and organize communities on the web. • Approach: – Find all cores. – Grow cores into the full community. – Do IR/Categorization/Clustering etc. to organize the community space. [KRRT] WWW8, and [KRRT] VLDB’99. The cores are interesting. Explicit communities. •Yahoo!, Excite, Infoseek •webrings •news groups •mailing lists (1) Implicit communities are defined by cores. (2) There are an order of magnitute more of these. There are efficient heuristics to compute all cores. (3) Can grow the core to the community using Clever. Implicit communities • • • • • • • • hotels in costa rica clipart japanese elementary schools turkish student associations oil spills off the coast of japan australian fire brigades aviation/aircraft vendors guitar manufacturers Costa Rican Hotels. • • • • • • • • • • • • • • • • • • The Costa Rica Inte...ion on arts, busi... Informatica Interna...rvices in Costa Rica Cocos Island Research Center Aero Costa Rica Hotel Tilawa - Home Page COSTA RICA BY INTER@MERICA tamarindo.com Costa Rica New Page 5 The Costa Rica Internet Directory. Costa Rica, Zarpe Travel and Casa Maria Si Como No Resort Hotels & Villas Apartotel El Sesteo... de San José, Cos... Spanish Abroad, Inc. Home Page Costa Rica's Pura V...ry - Reservation ... YELLOW\RESPALDO\HOTELES\Orquide1 Costa Rica - Summary Profile COST RICA, MANUEL A...EPOS: VILLA • • • • • • • • • • • • • • • • • • Hotels and Travel in Costa Rica Nosara Hotels & Res...els & Restaurants... Costa Rica Travel, Tourism & Resorts Association Civica de Nosara http://www...ca/hotels/mimos.html Costa Rica, Healthy...t Pura Vida Domestic & International Airline HOTELES / HOTELS - COSTA RICA tourgems Hotel Tilawa - Links Costa Rica Hotels T...On line Reservations Yellow pages Costa ...Rica Export INFOHUB Costa Rica Travel Guide Hotel Parador, Manuel Antonio, Costa Rica Destinations Elementary Schools in Japan • • • • • • • • • • • • • • • • • • The American School in Japan The Link Page ‰ª• èŽ s—§ˆä“c ¬Šw Zƒz [ƒ ƒy [ƒW Kids' Space ˆÀ• éŽ s—§ˆÀ é ¼•” ¬Šw Z ‹{ é‹³ˆç‘åŠw• ‘® ¬Šw Z KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“Þ ìŒ§ E‰¡•l s—§’† ì¼ ¬Šw Z‚Ìƒy http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... • • • • • • • • • • • • • • • • • • schools LINK Page-13 “ú–{‚ÌŠw• Z a‰„ ¬Šw Zƒz [ƒ ƒy [ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j ¬Šw Z‚U”N‚P‘g•¨Œê ÒŠ—’¬—§ ÒŠ—“Œ ¬Šw Z Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“ì ¬Šw Z‚Ìƒz [ƒ ƒy [ƒW UNIVERSITY ‰J—³ ¬Šw Z DRAGON97-TOP Â‰ª ¬Šw Z‚T”N‚P‘gƒz [ƒ ƒy [ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼ Application 2: Classical Learning/IR. Vector space and other classical models. • Document is a vector in a real-valued space with dimensions identified with “features.” [Salton]  x  R F , F  {features} Some notion of similarity, usually, cosine or dot-product.     D( x , y )  cos( x , y ) Built in assumption: Features are independent. Uses of the Vector Space model. • • • • • • • Search, Clustering, Classification. Term weighting. [Salton, Dumais, Sparck-Jones] SVD (for instance, LSI [Deerwester et.al.]). Gaussian assumption and classification. (for instance, [Koller-Sahami], [Chakrabarti et.al.]). Many ad-hoc methods and heuristics, some of which work remarkably well.[Modha et.al.] Clustering. [Drineas et.al.] Dimensionality reduction. Feature selection. [Johnson-Lindenstrauss, KollerSahami, Chakrabarti et.al. and others] Two (new ?) ingredients. • Hypertext -- the graph. • Zipfian distributions on term occurances. Hypertext Classification/Clustering. • Class of a page is a function of text + class of neighbor set. – Classification problem -- Markov Random fields. [Chakrabarti-Dom-Indyk] – Clustering problem -- [Modha] Research Issue • Rework applications in these new (rather old) context. OR • Explain why the standard algorithms continue to work despite the sometime questionable assumptions behind their derivation.

The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden

Related documents

Products

Support

The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib