Extracting large-scale knowledge bases from the web ∗ Ravi Kumar Prabhakar Raghavan

Extracting large-scale knowledge bases from the web∗ Ravi Kumar Prabhakar Raghavan Sridhar Rajagopalan Andrew Tomkins IBM Almaden Research Center Abstract The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of certain subgraphs. We focus on subgraphs that are signatures of web phenomena such as tightly-focused topic communities, webrings, taxonomy trees, keiretsus, etc. For instance, the signature of a webring is a central page with bidirectional links to a number of other pages. We develop novel algorithms for such enumeration problems. A key technical contribution is the development of a model for the evolution of the web graph, based on experimental observations derived from a snapshot of the web. We argue that our algorithms run efficiently in this model, and use the model to explain some statistical phenomena on the web that emerged during our experiments. Finally, we describe the design and implementation of Campfire, a knowledge base of over one hundred thousand web communities. 1 Overview The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of chosen subgraphs. For example, consider enumerating all cliques of size four or more, where a clique is a set of web pages each of which links to all the others. Each such clique could represent an alliance between the creators of these pages: business partners, keiretsus, members of a family, etc. If we could enumerate all cliques on the web, then organize and annotate them into a usable structure, we would have created a knowledge base of all such ∗ IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120. Email: {ravi, pragh, sridhar, tomkins}@almaden.ibm.com Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. 639 alliances—patent as well as latent—as evident in the link structure of the web. More generally, we consider the creation of knowledge bases from an analysis of the web graph using the following paradigm: (1) identify a signature subgraph that is likely to arise in every element to be represented in the knowledge base; (2) devise a method for enumerating every instance of this subgraph in the web graph; (3) reconstruct, from each enumerated subgraph, the associated element of the knowledge base; and (4) annotate and index the elements to make the knowledge base usable. In the example above, step (1) could consist of identifying a clique of size four as likely to be present in every keiretsu (say); step (2) would require an efficient method for enumerating cliques of size 4; step (3) would require assembling all pages in a keiretsu given the portion represented by the 4-clique; and step (4) could consist of extracting and indexing statistically significant keywords from the assembled pages. While the first of these steps is specific to the elements that will populate the knowledge base, the other three share some common challenges that will be our focus here. Foremost among these challenges: enumerating subgraphs on large graphs is, in the worst case, infeasible— from a complexity-theoretic as well as practical standpoint. Clearly, we must exploit the fact that the web is not a “worst-case” graph. To this end, we develop a stochastic model of the web graph that exhibits good agreement with statistics from the web, and show that a traditional random graph model could not exhibit such agreement. We develop an algorithmic paradigm for subgraph enumeration problems, run a concrete instance on the web, and show that its good performance is predicted by our web graph model. We describe the ongoing Campfire project, in which we enumerate, annotate, and index over 100,000 web communities generated using our methods. Locally dense regions and communities. the following motivating example: We begin with Example 1 Consider the set of web pages that point to both www.boeing.com and www.airbus.com. There are more than two thousand such pages on today’s web, including personal pages, aircraft museums, vendors, legal services, and so forth. Almost all of these pages represent some type of resource list of airplane manufacturers. The subgraph induced by these thousand pages, and the pages they point to, has a specific form: a number of resource lists all point to some subset of a set of resources, in this case Boeing, Airbus, and dozens of other aircraft manufacturers. Moreover, pages within the subgraph frequently—but not always—cross-reference each other. (See Figure 1.) A knowledge base containing all structures such as the one in Figure 1—aircraft manufacturers, long distance phone companies, US national parks, etc.—would clearly be of tremendous value. This motivates the definition of structures we call bipartite cores: a bipartite core in a graph consists of two (not necessarily disjoint) sets of nodes L and R, such that every node in L links to every node in R. Note that links from R to L, or within R or L, may or may not be present. www.boeing.com www.airbus.com www.embraer.com Figure 1: A bipartite core. Indeed, one may envision building knowledge bases from enumerations of many different interesting structures—bipartite cores, cliques, webrings (which manifest themselves as star-shaped graphs with bidirectional links on the spokes), pages in a hierarchically organized website, or newsgroups and newsgroup discussion threads (which manifest themselves as bidirectional paths). The reasons for doing so include: (1) Such knowledge bases represent a better starting point for deeper analyses and mining than raw web data. Indeed, this is the goal of the Campfire project. (2) Structure can be used more effectively for searching and navigation. For instance, an agent responsible for searching a database containing dense bipartite graphs could pay more attention to text surrounding the relevant links, for their annotative value. (3) Fine-grained structures provide a basis for targeted market segmentation. (4) Studying these enumerated structures over time gives us insight into the sociological evolution of the web. Challenges and approaches. From an algorithmic perspective, the naive “search” algorithm for enumeration suffers from two fatal problems. First, the size of the search space is far too large—using the naive algorithm to enumerate all bipartite cores with two web pages pointing to three pages would require examining approximately 1040 possibilities on a graph such as the web with 108 nodes. Second, and more practically, the algorithm requires random access to edges in the graph, which implies that a large fraction of 640 the graph must effectively reside in main memory to avoid the overhead of seeking a disk on every edge access. Although these obstacles appear insurmountable, we exploit additional structure latent in the web graph. For instance, the average number of links out of a page is small. However, the web is not a random sparse graph—it is a sparse graph containing many structures (cliques for instance) that arise only in far denser random graphs. The reason the web contains these structures is that, despite its overall sparseness, local regions of the web are dense. We propose novel algorithms that exploit this structure of the web graph to overcome these challenges. To understand the performance of our algorithms, we develop a stochastic model for “web-like” graphs. The model exhibits two desirable properties: (1) it agrees with a number of statistical observations about the web graph, and (2) it is a useful tool in algorithm design, suggesting both explanations for the performance of our algorithms, and future directions for other efficient web analyses. Here is a preview of a key technical element of our stochastic web graph model: intuitively, new pages are created by borrowing random fragments of existing pages. Guided tour of this paper. In Section 2 we detail a number of measurements we have made of a snapshot of the web graph; these include the distribution of in- and outdegrees of nodes, and the numbers of bipartite cores. In Section 3 we motivate and develop our stochastic model for the web graph. We give analyses showing that our model— besides being a plausible high-level process for the creation of the web graph—explains our measurements from Section 2 in ways that traditional random graph models could not. Section 4 describes our three main algorithms: the elimination/generation paradigm for subgraph enumeration, an extension of a link-based web search algorithm for extending bipartite cores into communities, and an index extraction algorithm. In Section 5 we give some results of the Campfire project on organizing web communities. 1.1 Related previous work Link analysis. A number of web search projects have used links to enhance the quality and reliability of the search results; see for instance HITS [19], its variants [5, 9, 10], and Google [6]. The connectivity server [4] also provides a fast index to linkage information on the web. Dean and Henzinger [12] combine heuristic improvements from [5] and [10] and apply these to the problem of finding related pages on the web. Sociometrics. Statistical analysis of the structure of the academic citation graph has been the subject of much work in the Sociometrics community. As we discuss below, Zipf distributions seem to characterize web citation frequency. Interestingly, the same distributions have also been observed for citations in the academic literature. This fact, known as Lotka’s law, was demonstrated by Lotka in 1926 [23]. Gilbert [16] presents a probabilistic model ex- plaining Lotka’s law, which is similar in spirit to our proposal, though different in details and application. Indegree distribution 1e+08 Indegree distribution 1e+07 Enumerating bipartite cores. In recent work [20] we reported enumerating over 200,000 bipartite cores from a snapshot of the web. The contributions reported in that paper were: (1) a special case of the algorithmic paradigm introduced here; (2) preliminary statistical observations about the web that we extend and explain here, using the web graph model developed in this paper; and (3) A random sampling experiment showing that barely 5% of the cores arose coincidentally. The large scale and high quality of such enumerated structures provides a compelling basis for building knowledge bases out of them. A number of these results, including Kleinberg’s HITS algorithm [19], are discussed in the survey paper [21]. Henzinger et. al. [17] study algorithmic and memory bottleneck issues in related graph computations from a theoretical perspective, primarily to derive impossibility results. For a survey of database techniques on the web and the relationships between them, see Florescu, Levy, and Mendelzon [13]. 2 Measurements of the web graph In this section we describe a set of measurements generated from a crawl of the web from 1997, provided by Alexa, Inc. As our primary focus is the linkage patterns between pages, 641 1e+06 Number of vertices Data mining. Traditional data mining research (see for instance Agrawal and Srikant [1]) focuses largely on algorithms for finding association rules and related statistical correlation measures in a given dataset. However, efficient methods such as a priori [1] or even more general methodologies such as query flocks [29], do not scale to the numbers of “items” (pages) in the web dataset. This number is currently around four hundred million, which is two to three orders of magnitude more than the number of items in a typical market basket analysis. Further, the graph-theoretic structures we seek could correspond to association rules with very small support and confidence. Our conviction that these structures are interesting comes from additional insight about the web graph, rather than from traditional support and confidence measures. In the case of bipartite cores, the relation we are interested in is co-citation. Co-citation is effectively the join of the web citation relation with its transpose, the web “cited by” relation. The size of this relation is potentially much larger than the original citation relation. Thus, we need methods that work with the original, without explicitly computing the co-citation relation. The work of Mendelzon and Wood [25] is an instance of structural methods in mining. They argue that the traditional SQL query interface to databases is inadequate in its power to specify several structural queries that are interesting in the context of the web. We note that this is not the case here. The signature graphs we are interested in are relatively easy to specify in SQL. 100000 10000 1000 100 10 1 10 100 1000 Indegree Figure 2: In-degree distribution. we consider only the interconnection patterns of hyperlinks, and abstract away the textual content of each page. We view each page as a node of a directed graph, and each hyperlink as a directed edge. These measurements provide crucial insights for our web graph model (Section 3) as well as in our efficient enumeration algorithms (Section 4). In-degree. We say that a hyperlink is an out-link of its source page, and an in-link of its destination page. We call the number of out-links and in-links of a page the outdegree and in-degree of the page, respectively. We begin by counting the in-degree of each page. Since the average out-degree is around eight, and since each edge contributes equally to the total in- and out-degree, the average in-degree must also be around eight. Our interest is in understanding not just the average, but the complete distribution of in-degrees. Figure 2 shows a log-log plot of the number of pages that have in-degree i as a function of i. The linearity of the curve indicates that the distribution is an inverse polynomial. Fitting an inverse polynomial to the data we find the probability that a page has i in-links to be roughly proportional to i−2.1 . We will also refer to inverse polynomial distributions as Zipfian distributions [30]. An important characteristic of Zipfian distributions is that they have high probability of deviating significantly from the mean. Thus, although the mean in-degree of a page is about 8, there is a significant probability that a page will have 1000 in-links (for example, approximately 100K pages on the web have ≥ 1000 in-links). Out-degree. Our next observation concerns the roughly Zipfian distribution of out-degrees in the web graph. Figure 3 shows a log-log distribution of i versus the number of pages that have out-degree i. This curve also follows a Zipfian distribution. Fitting an inverse polynomial to the data, we note that a random web page has out-degree i with probability approximately i−2.38 . Ci,j counts. In Section 1 we described bipartite cores, which consist of two sets of pages L and R, such that every page in L links to every page in R. If L contains i pages, and R contains j pages, we refer to the core as a Ci,j . Outdegree distribution Number of cores as function of centers 1e+07 1e+06 Outdegree distribution 2 fans 3 fans 4 fans 5 fans 6 fans 7 fans 1e+06 100000 Number of cores Number of vertices 100000 10000 10000 1000 1000 100 10 100 1 10 100 1 1000 Outdegree 10 Number of centers 100 Figure 5: Number of Ci,j ’s as a function of j. Figure 3: Out-Degree distribution. 3 Number of cores as function of fans Models for web-like graphs In this section we present our model for web-like graphs, motivated by the following goals: 100000 3 centers 4 centers 5 centers 6 centers 20 centers 1. Understand various structural properties of the web graph—in- and out-degrees, neighborhood structures, etc. Of particular interest to us is the distribution of web structures that are signatures of the components of a knowledge base (e.g., the bipartite cores we are interested in). Number of cores 10000 1000 2. Perform a more realistic analysis of algorithms on the web graph—this is of particular interest because worst case analysis of many algorithms is particularly pessimistic and unrealistic when applied to the web graph. 100 3 4 5 6 7 8 Number of fans Figure 4: Number of Ci,j ’s as a function of i. During the exhaustive enumeration [20] of bipartite cores, we created a subgraph of the web consisting of all pages remaining after the algorithm prunes away all nodes of degree less than four. We enumerated all Ci,j ’s in the resulting graph, for all i ∈ {2, . . . , 7} and j ∈ {3, . . . , 20}. We now analyze the trends in that dataset. We begin by looking at the dependence of the number of Ci,j ’s on i. Figure 4 shows a number of curves representing fixed values of j and four values of i—we display only counts for i ≥ 4 since the pruning removed nodes of degree less than 4. To avoid clutter, we show values of j ranging from 3 to 6, and 20—intermediate values have the same form. As the figure shows, the number of Ci,j ’s drops exponentially as a function of i, in this range. Next, Figure 5 shows the curves representing fixed values of i for various values of j. The graph shows a log-log plot with linear behavior. Fitting an inverse polynomial to the data, the number of Ci,j ’s as a function of j drops as j −c , where the value of c is between 1.09 and 1.4. 642 3. Predict properties of the web graph based on the model—this is of interest because it can lead to better algorithmic and structural insight. We seek to model the linkage structure of the web graph. In particular, our models do not describe textual content We begin with a discussion of the properties such a model should have, and then present a general framework for a class of graph models called copying models. Next, we give a simple concrete instance of a copying model, and analyze the concrete model to predict the parameters measured in Section 2. We show that measurements and predicted behavior exhibit strong agreement. 3.1 Desiderata for a web graph model. 1. Simplicity: It should have a succinct and fairly natural description. 2. Plausibility: It should be rooted in a plausible macrolevel process for the creation of content on the web. We cannot hope to model the detailed behavior of the many users creating web content. Instead, we only desire that the aggregate formation of web structure be captured well by our graph model. Thus, while the model is described as a stochastic process for the creation of individual pages, we are really only concerned with the aggregate consequence of these individual actions. Therefore we seek a model that is plausible at this aggregate level. 3. Topics and communities: It should provide structure corresponding to the strongly-linked “topics” that have emerged on the actual web, but it should not do so by requiring some a priori static set of topics that are part of the model description—the evolution of interesting topics and communities should instead be an emergent feature of the model.1 Such a model has several advantages: • Viability: It is extremely difficult to characterize the set of topics on the web; thus it would be useful to draw statistical conclusions without such a characterization. • Dynamism: The set of topics reflected in web content has proven to be fairly dynamic. Thus, the shifting landscape of actual topics will need to be addressed in any topic-aware model of time-dependent growth. 4. Statistics: We would like the model to reflect many of the structural phenomena we have observed in the web graph. These include: (a) Zipfian distributions: A Zipfian distribution [30] for the number of links to a web page; in particular, the number of pages with i in-links is well-approximated [20] by 1/i2 . A similar phenomenon has been observed in the study of scholarly citations [11, 23], and is the basis of studies in the sociology of scientific communities [16]. (b) Locally dense globally sparse structure: Although the web graph is relatively sparse (the average number of links out of a page is roughly 7.2), the graph contains well over one hundred thousand bipartite cores with at least six nodes in them, even after mirrors are deleted. This is because the web graph, though globally sparse, has many locally dense regions. 5. Evolution: We should capture the phenomenon that the web graph has nodes and edges appearing and disappearing with time. Interestingly and unfortunately, Items 3, 5, and all the criteria of Item 4, fail to hold for traditional models [7] of random graph theory. For instance, traditional random graph models would predict in- and out-degrees that are Poisson distributed, rather than the significantly heaviertailed Zipfian distributions we have observed. It is easy to 1 In particular, we avoid models of the form “Assume each node is some combination of topics, and add an edge from one page to another with probability dependent on some function of their respective combinations.” 643 extend traditional random graph models to capture evolution, but this has not been addressed primarily because in standard models it is unlikely that interesting mathematical phenomena arise from evolution. 3.2 Intuition for our models Items 1 and 2 of the desiderata listed above ask that our model encapsulate a simple, plausible notion of content creation on the web. Our notion is based on the following intuition: • some page creators on today’s web may create content and link to other sites without regard to the topics that are already represented on the web, but • many page creators will be drawn to existing topics of interest to them, and will link to pages within some of these existing topics. Consider a user intent on creating a page about recreational sailing. Like most content creators, this user would probably wish to incorporate some links that would be of interest to potential visitors to the page. In order to gather together such links, the user would probably begin to browse around, perhaps using the many excellent resource lists2 already available about recreational sailing, and choosing links based on his or her particular preferences within the topic. In the end, the resulting page would be another resource list about the topic, albeit one with a new personal spin. We draw two lessons from this example: first, if a number of users have created links to a page, a new user will be more likely to link to that page than to a random page (partly because the page is probably of higher quality, and partly because the page is easier to find). And second, a user who has added a link to a sailing page is more likely to add another sailing link than an arbitrary link. Or said another way, if a user links to a page, and some existing resource list also links to that page, the user may link to something else appearing on the resource list. Rephrasing these observations in purely graph-theoretic terms: 1. A new page is more likely to link to pages with higher in-degree. 2. A new page is more likely to link to two pages that cooccur on some resource list than to two random pages. We therefore propose the following intuition for a simple process of link creation, which results in behavior obeying the previous observations: A new page adds links by picking an existing page, and copying some links from that page to itself. 2 by Resource lists we mean pages that collect links on one or more topics, ranging in scope from a carefully-researched node of Yahoo to the small personal page of an individual who has collected several links about lateen-rigged sailing vessels. We reiterate that this process is not meant to reflect individual user behavior on the web; rather, it is a local procedure which in aggregate works well in describing page creation on the web. The model also implicitly captures topic creation as follows: First, a few scattered pages begin to appear about the topic. Then, as users interested in the topic reach critical mass, they begin linking these scattered pages together, and other interested users are able to discover and link to the topic more easily. This creates a “locally dense” subgraph around the topic of interest. Furthermore, copying is a powerful mechanism for giving rise to bipartite cores: an author creates a resource list, and others create pages pointing to many pages on this resource list. This intuitive view summarizes the process from a pagecreator’s standpoint; we now recast this formulation in terms of a random graph model. 3.3 A class of graph models We model the web as an evolving graph, in which nodes and edges appear and disappear with time. Our models are described by four stochastic processes—creation processes Cv and Ce for node- and edge-creation, and deletion processes Dv and De for node- and edge-deletion. These processes are discrete-time processes. Each process is a function of the time step, and of the current graph. Consider, for instance, the following node creation and deletion process: at time step t, independent of all earlier events, create a node with probability αc(t). We could have a similar model with parameter αd (t) for node deletion. Deleting a node also deletes all its incident edges. Clearly, we would tailor these probabilities to reflect the growth rates of the web and the half-life of pages respectively. We present edge processes ranging from simple to complex to model the web with increasing fidelity. At this stage, we state a complex model we believe to be largely realistic. In Section 3.4, we show that even a greatly simplified process induces a Zipf distribution on in-degrees. At each step we choose (possibly by random sampling) a node v to add edges out of, and a number of edges k that will be added to v. With probability β we add k edges from v to nodes chosen independently and uniformly at random. With probability 1 − β, we choose another vertex u, at random, and copy k edges from u to v. That is, after choosing a node u at random, create k (directed) edges (v, w) such that (u, w) is a random edge incident at u. One might reasonably expect that much of the time, u will not have out-degree larger than k; if the out-degree of u exceeds k we pick a random subset of size k. If on the other hand the out-degree of u is less than k we first copy the edges out of u, then pick another random node u to copy from, and so on until we have enough edges. Such a copying process is not unnatural, and consistent with the qualitative intuition at the beginning of this section. As a simple example of an edge deletion process, at time t, delete a random edge with probability δ(t). 644 3.4 The α and (α, β) models We illustrate these ideas with a very simple special case that captures destination copying. We show that even in this simple case, the induced in-degree distribution is Zipfian. A node is created at every step. Nodes and edges are never deleted, so the graph keeps on growing. This corresponds to setting αc(t) to 1, and αd (t) and δ(t) to 0. We now concentrate on the edge creation process. As each node comes into being, with probability α ∈ (0, 1) it adds an edge to itself, and with probability 1 − α it picks a random edge and copies the edge onto itself, i.e., choose a random edge (u, w) created earlier and add the edge (v, w). We now argue that the in-degree distribution of nodes in the α model is Zipfian. Let pi,t be the fraction of nodes at time t with in-degree i. Assume t is sufficiently large that pi,t = pi,t+1 . Then at time t there are t · pi,t nodes with in-degree i, and at t + 1 there are (t + 1) · pi,t+1 such nodes. Therefore the probability that an additional node has in-degree i at time t + 1 is (t + 1)pi,t − tpi,t = pi,t . The probability that a node with in-degree i − 1 garners the single new edge is (1 − α)(i − 1)tpi−1,t, and the probability that a node with in-degree i gains the new edge and therefore becomes in-degree i + 1 is (1 − α)(it)pi,t . It must be the case that: (1 − α)(i − 1)tpi−1,t − (1 − α)(it)pi,t (1 − α)i(pi−1,t − pi,t ) dpi,t (1 − α)i di dpi,t (1 − α) pi,t (1 − α) ln pi,t = = pi,t pi,t = −pi,t = − = di i − ln i pi,t = 1/i (1−α) . 1 As an example, set α = 1/2. In this process, each new edge flips a fair coin, and on heads points to the newest page, but on tails points to the destination of a uniformlychosen random in-link. This process will then have an indegree distribution given by 1/i2 , following our observations of the web. Now, consider an extension to the α model yielding the (α, β) model. Each time a new page arrives, a single edge is added as follows. The process flips two independent coins with probabilities α and β of coming up heads. The new edge (u, v) is built as follows: The “destination” v is set based on the α coin: if it comes up heads, v is the newest page, and otherwise, v is the destination of a random link. The “source” u is set based on the β coin: if it comes up heads, u is the newest page, and otherwise, u is the source of a random link. Since the original α model chooses edge destinations without reference to edge sources, the in-degree distribution of the (α, β) model will also be given by pi,t = 1 1/i (1−α) . And since the same analysis applies if we flip the definition of “source” and “destination,” the out-degree 1 distribution q is given by qi,t = 1/i (1−β) for large enough t. For example, setting α = .52 and β = .58, the model generates a graph with the following properties: 1. Pr[A node has in-degree i] −→ i−2.1 . 2. Pr[A node has out-degree i] −→ i−2.38 . Both of these values match the web. 3.5 Resource list copying for bipartite cores The (α, β) model of Section 3.4 induces Zipfian in- and out-degrees, but edges are added one at a time. In the general framework of Section 3.3, a model with more topic focus would copy several links from the same resource list. In this section, we explore the impact of copying multiple links on Ci,j formation. Recall that a Ci,j is a bipartite core whose set L contains i nodes, and whose set R contains j nodes. We will refer to the elements of L as fans, and the elements of R as centers. In [20] it is shown that the web contains over 133,000 C3,3 cores. We consider first a traditional random graph model, showing that such a model will not predict this large number of C3,3 ’s. Next, we show that our model will in fact predict such a large number. Example 2 Let n = 100, 000, 000 = 108 , and consider a traditional random graph model where every link is independently present with probability 10−7 (so that the average out-degree is 10, somewhat higher than reality). A 6-tuple of nodes L = {a, b, c} and R = {x, y, z} forms a C3,3 if all nine edges from L to R are present, an event occurring with probability (10−7 )9 = 10−63. There are approximately n6 /720 = 1047 /72 ways of choosing such 6-tuples. Thus the expected number of C3,3’s is approximately 10−63 × 1047 /72 < 10−17 1, Turning this calculation around, we can compute how large an out-degree an average page must have in the traditional random graph model, in order for 133,000 C3,3 ’s to arise in the web: roughly 1200, a number that is again inconsistent with the web. A similar calculation with our model yields an expected number of roughly 100,000 C3,3 ’s. Consider the model of Section 3.3, with no deletion processes; thus we allow the copying of a random subset of links from a resource list. We omit the details, but the key idea is the following: the probability that a and b both copy the same 3 links from a single resource list c is Ω(1/n2 ) even though the out-degree of a node is a small constant independent of n. When they both copy the same 3 links from the same source, a C3,3 results. There are some technicalities that stand in the way of this being a mathematically provable statement; chief of these is the fact that in the early stages of copying, we may attempt to copy from a resource list that does not have 3 links to copy. However, in the “steady state” this should not be a significant effect; proving this rigorously remains an interesting open direction in random graph theory. 645 4 Algorithms We identify three steps in the construction of a knowledge base of communities on the web: (i) efficiently enumerating the subgraphs (i.e., cores) of interest; (ii) collecting additional web pages related to each enumerated core to build a community; and (iii) extracting statistically significant information from each such community, leading to a searchable index of these communities. Accordingly in this section we describe three algorithms that we view as central to each of these steps: (i) the elimination/generation paradigm for efficient subgraph enumeration on web-scale graphs, using a relatively lean computational infrastructure; (ii) an extension of Kleinberg’s HITS algorithm [19] that can take as input only a set of URL’s present in a core (and no query terms), to produce authoritative web pages on the community centered around this core; and (iii) a method for extracting statistically significant index terms with which to index such communities. Note that the first and the third steps are generic to building indexable knowledge bases of communities from subgraph enumeration. The second is specified in terms of bipartite cores for concreteness, although the reader may readily see its extension to other subgraphs. 4.1 Enumerating cores An algorithm in the elimination/generation paradigm performs a number of sequential passes over the web graph. The graph is stored as a binary relation on disk, and thus is not available for random access. During each pass, the algorithm writes a modified version of the dataset to disk for the next pass. It also collects some metadata in main memory which serves as state during the next pass. Passes over the data are interleaved with sort operations, which change the order in which the data is scanned, and constitute the bulk of the processing cost. During each pass over the data, elimination and generation operations are interleaved. The details are given below: Elimination. There are often easy necessary (though not sufficient) conditions that have to be satisfied in order for a node to participate in a subgraph of interest to us. Consider for instance the problem of enumerating cliques of size four. We can prune any node whose in-degree or out-degree—at any stage of the process—is less than 4 (the significance of this will become apparent below). Consider next the example of C4,4 ’s. Any node with in-degree 3 or smaller cannot participate on the right side of a C4,4 . Thus, edges which are directed into such nodes and the nodes themselves can be pruned from the graph. Likewise, nodes with out-degree 3 or smaller cannot participate on the left side of a C4,4 . We refer to these necessary conditions as elimination filters. Generation. Generation is a counterpoint to elimination. Nodes that barely qualify for potential membership in an interesting subgraph can easily be established as either belonging to such a subgraph or not. Consider again the example of a C4,4 . Let u be a node of in-degree exactly 4. Then, u can belong to a C4,4 if and only if the 4 nodes that point to it have a neighborhood intersection of size at least 4. It is possible to test this property relatively cheaply. We define a generation filter to be a procedure that identifies barely-qualifying nodes, and for all such nodes, either outputs a subgraph or proves that such a subgraph cannot exist. If the test embodied in the generation filter is successful, we have identified an interesting subgraph. Furthermore, regardless of the outcome, this node can be marked for pruning since all potential interesting subgraphs containing it have already been enumerated. Similar schemes can be developed for other structures. Example 3 Consider enumerating all subgraphs in which four web pages all point to one another. Then, any node with fewer than three links out of it can be eliminated. Likewise, any node with fewer than three links into it can be eliminated. Thus, the elimination filter for such an enumeration finds nodes with fewer than three in-links or out-links. The generation filter finds nodes with exactly three in-links or out-links, and checks to see whether the resulting set of four nodes, namely the original and the three adjacent nodes, is a clique. Notice that the generation filter is substantially cheaper in this case since there is no exhaustive enumeration of subsets of size 4 required as would have been the case if the in and out degrees had each been substantially larger than 3. Note that if edges appear in an arbitrary order, it is not clear that the elimination filter can be easily applied. If, however, the edges are sorted by source (resp. destination), it is clear that the elimination filter can be applied to the out-links (resp. in-links) in a single scan. The generation filter is slightly more complicated to implement efficiently. We first explain how this can be done in 3 sub-passes over the data, in the setting of Example 3 above. In the first sub-pass, we construct a list of nodes with in-degree or out-degree 3, along with their in- or outneighborhoods. We assume that this list fits in main memory during the second sub-pass; if not we can break the second sub-pass into several phases, each processing a main memory-sized chunk of candidates. In the second sub-pass, we verify that each node in the neighborhood points to all other nodes in the neighborhood. At the end of the second sub-pass we output all the interesting subgraphs, and in the third sub-pass, we delete all the candidate nodes. Notice that the first and the third passes can be overlapped with the preceding and subsequent elimination filtering pass. Thus, the effective cost of the generation filter is just one pass over the data. Also, note that even the first round of the elimination/generation paradigm will eliminate a large fraction of the destination nodes in the graph (see Example 4 for details). After one elimination/generation phase, the remaining nodes have fewer neighbors than before in the residual graph, which may present new opportunities during the next pass. We can continue to iterate until we do not make 646 significant progress. Depending on the filters, one of two things could happen: either we repeatedly remove nodes from the graph until nothing is left or after several passes, the benefits of elimination/generation “tail off” as fewer and fewer nodes are deleted at each phase. If for instance our elimination filter were “delete all nodes with fewer than 100 in-links from .com pages”, we will eliminate almost all pages on the web. On the other hand if the elimination filter were “delete all nodes with fewer than 3 out-links”, we will reach a fixed point at which repeated elimination passes do not reduce the size of the residual graph substantially (though this may not happen after the first such pass). Why should such algorithms run fast? We make a number of observations about their behavior: 1. The in/out-degree of every node drops monotonically during each elimination/generation phase. 2. During each generation test, we either remove a node u from further consideration (by developing a proof that it can belong to no instance of the subgraph of interest), or we output a subgraph that contains u. Thus, the total work in generation is linear in the size of the graph plus the number of subgraphs enumerated, assuming that each generation test runs in constant time. This is the case if we are enumerating a constant-sized subgraph; thus our algorithm is output sensitive. 3. As shown in Example 4 below, elimination phases rapidly eliminate most nodes in the web graph. A complete mathematical analysis of iterated elimination is beyond current techniques, but the example below shows that just the first elimination phase by itself is quite powerful. Example 4 Consider the enumeration of C3,3 ’s. By the inverse quadratic law for in-degrees, eliminating nodes on the basis of in-degree prunes away any node with indegree ≤ 2. As described in Section 2, out-degrees also have a (different) Zipfian distribution, and we can prune away any node with out-degree ≤ 2. Finally, applying a generation phase means that we also remove nodes with in/out-degree equal to 3. From simple calculations with the corresponding probabilities (noting for instance that i 1/i2 = π 2 /6), we determine that these elimination/generation steps would account for nearly 90% of the nodes in just the first iteration. Further iterations are less dramatic—both in theory and in practice. It is thus clear that the insights from our measurements and model drive the efficiency of the elimination/generation paradigm. A more mathematically precise analysis of the running time of elimination/generation in our model appears to be beyond the reach of current random graph theory, and is a worthwhile goal. 4.2 From cores to communities We now describe an extension of Kleinberg’s HITS [19], to expand a core Ci,j into its surrounding community. For brevity we only detail the novel aspects of our extension; the reader is referred to [19] for details on the basic HITS algorithm. To make our description succinct, we use the notation p → q to denote that a web page p has a hyperlink to a web page q. Given a query, the basic HITS algorithm assembles a small set of pages R (the root set) using a traditional textbased search engine and then constructs R (the expanded set) by adding all pages that points to or is pointed to pages in the root set. I.e., R = R ∪ {p | p → q, q ∈ R} ∪ {q | p → q, p ∈ R}. Each page p is associated with a pair of scores h(p), a(p) which are initially set to 1. The algorithm iteratively updates these scores as h(p) = p→q a(q) and a(p) = q→p h(q), (after appropriate normalization). The top-scoring pages based on a(·) (resp. h(·)) are termed “authorities” (resp. “hubs”). In our case, we have no text query; instead we have a core Ci,j . Since there is no query, the root set R is constructed using the following rule: the root set consists of (i) the core Ci,j , (ii) all pages pointed to by the nodes in L and (iii) all pages that point to at least two nodes in R. I.e., R = L ∪ {p | q → p, q ∈ L} ∪ R ∪ {p | p → q1 , p → q2 , q1 , q2 ∈ R}. We apply the basic HITS algorithm to this root set (making the assumption that the community around the core overlaps significantly with the root set). The result is a set of authoritative sources of information for that community, together with a set of hubs that collect together and annotate these authoritative pages. 4.3 Extracting index terms Having enumerated the cores and run the generalized HITS algorithm above to expand each core into a full community, we must extract from each community index terms to build the knowledge base, and a summary that can be used to identify the contents of the community. We make the following observation: the title of a page is likely to contain terms describing the page. Using this heuristic, we implement the following algorithm for identifying keywords that are useful for both indexing and summarization: 1. Extract the titles of all web pages in a community (which comprises the top 50 hubs and 50 authorities as returned by the algorithm of Section 4.2) 2. Eliminate stop words (e.g., articles, “html”, “home”, “page”, “web”, etc.) 3. Rank the resulting terms by frequency across the entire set of pages 4. Return the top ten most frequent words 647 We use the resulting set of 10 terms to index the community, and build a search utility against this index. We are investigating more sophisticated strategies for the future. For instance, anchortext—the text in the vicinity of the “href” tags at the tails of hyperlinks—often represents a description of the contents of the linked-to page. Thus, anchortext of good hubs around links pointing to good authorities may be important for generation of index terms and summaries. We also plan to use phrase extraction techniques instead of simple keyword extraction. 5 Campfire: a knowledge base of communities In this section we describe Campfire, a knowledge base of communities. We spell out the details of our experimental setup and give some examples of the communities indexed in Campfire. 5.1 Experimental setup The communities in the Campfire knowledge base were constructed from a 1997 crawl of the web. In addition to extracting cores from the tape, we also extracted the state of Yahoo at the time of the crawl. As a reference point, Yahoo contained slightly over 16,000 topics in this crawl. The trawling algorithms were run on a PC with a 333 MHz Intel Pentium II processor. The machine had 256M of main memory and a SCSI chain with multiple disk drives. The cost of running the trawling algorithm has two major components: (i) the data cleaning portion, and (ii) the pruning-based mining algorithm. Since 2 mirrored copies of a page would generate a spurious C3,j , we removed mirrors using the shingling method of Broder et al [8]. Our algorithm shingles 3100 pages/second. This implies that the entire set of potential fan pages (2 Million pages in our case) are shingled in about 10 minutes. Eliminating duplicates (pages which have the same shingle) runs at about the same speed (about 3000 pages/second). Thus, initial data cleaning takes only about half an hour. The major cost in the pruning step is in sorting the edge data. There are 60M edges in the dataset after shingling. The time to sort the edge set is 5 minutes per pass (on average). Our algorithms make 30 passes over the data (for i, j = 3, 3) in the worst case. The total time to trawl all C3,3 cores was about 1.5 hours. Having enumerated the cores, we then expanded them into communities using the algorithm of Section 4.2. To construct the root set, we augmented the fans and centers by following links on today’s web. However, since the cores date back to 1997, and the half-life of a web page is of the order of a few weeks, many of these fans and centers no longer exist. Therefore, we only expanded those cores at least half of whose fans were still alive on today’s web. For example, in the case of C4,j cores, 41% passed this test. Interestingly, the fraction of centers that are still alive is around 54%, suggesting that centers might have a higher half-life than fans. Running HITS typically expands a core into a community of between 100 and 4000 pages, with the average around 1300. The number of hyperlinks between these pages varies from 200 to 15000, with the average around 3500. Parsing each page takes a few milliseconds, but running the expansion algorithm takes between 2 and 10 seconds. The final task is to index these communities. To accomplish this, we run the index term extraction algorithm on each of these communities. Given that we already have a parsed version of these pages, running this algorithm takes under a millisecond for each page. Using the keywords extracted, we index each community into Campfire. 5.2 Sample communities We now provide 5 sample communities automatically generated by expanding C4,j cores. We present the top five hubs and authorities of each community along with indexing keywords, and a brief (manually-generated) annotation for the benefit of the reader. A more detailed list of pages can be found at www.almaden.ibm.com/cs/k53/campfire.html To validate our conclusions further, we conducted the following experiment on the extracted communities. We took a random sample of about 100 communities and manually applied our resource compilation tools [10] to generate, for each community, a high-quality set of relevant pages. We then computed the overlap between these carefully-compiled resource lists and their counterparts in Campfire. The average overlap of sites was around 60%, suggesting that the quality and purity of communities extracted by our algorithm is fairly high. 6 Further directions 3. We have given illustrative examples and some general principles for devising elimination/generation algorithms for several interesting subgraph structures. Are there systematic ways for developing elimination/generation algorithms for any web subgraph enumeration problem? How do we exploit the interplay between the nature of the web graph and the elimination/generation phases? What are systematic techniques for tuning this methodology for given resource constraints? 4. In Campfire, we have a knowledge base built from web communities by extending the cores we enumerate from the web. We have not elicited significant semantic content (beyond extracting title words and indexing them for text search). Are there new paradigms for annotating and organizing these knowledge atoms in ways that are more valuable to users? References [1] R. Agrawal and R. Srikant. Fast Algorithms for mining Association rules. Proceedings of VLDB, Sept, 1994, Santiago, Chile. [2] G.O. Arocena, A.O. Mendelzon, G.A. Mihaila, Applications of a Web query language, Proc. 6th International World Wide Web Conference, 1997. [3] A.E. Bayer, J.C. Smart, G.W. McLaughlin, Mapping intellectual structure of scientific subfields through author co-citations, J. American Soc. Info. Sci., 41(1990), pp. 444–452. [4] K. Bharat, A. Broder, M.R. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: fast access to linkage information on the web. Proceedings of WWW7, Brisbane, Australia, April, 1998. Our work raises a number of interesting directions and questions. [5] K. Bharat and M.R. Henzinger. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. Proceedings of ACM SIGIR, 1998. 1. What are other instances of subgraph structures (possibly in conjunction with text patterns) that can be used to build valuable knowledge bases? [6] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th World-wide web conference (WWW7), 1998. 2. Our web graph model leads to very different areas for further investigation: (a) What are the dominant modes of content creation that different kinds of users exhibit, and how can these be reconciled (in an aggregate sense) with our model? (b) We have only scratched the surface of the rigorous mathematical analysis of our graph model (in contrast to the analysis of simpler, more traditional random graph models). Explicating the structure of graphs generated by our model is far more challenging than these traditional models, but a worthwhile goal in being able to predict the evolution of the web graph. 648 [7] B. Bollobás, Random Graphs, Academic Press, 1985. [8] A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntactic clustering of the Web. In Proceedings of the Sixth International World Wide Web Conference, April 1997, pages 391-404. [9] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proceedings of the 7th World-wide web conference (WWW7), 1998. [10] S. Chakrabarti, B. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Experiments in Topic Distillation. SIGIR workshop on hypertext information retrieval, 1998. [11] H.T. Davis. The Analysis of economic time series., Principia press, 1941. [12] J. Dean and M.R. Henzinger. Finding related pages in the World Wide Web. Proceedings of the 8th WWW conference, 1999. [13] D. Florescu, A. Levy and A. Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record 27(3): 59-74 (1998). [14] E. Garfield, Citation analysis as a tool in journal evaluation, Science, 178(1972), pp. 471–479. [15] E. Garfield, The impact factor, Current Contents, June 20, 1994. [16] N. Gilbert. A simulation of the structure of academic science. Sociological Research Online, 2(2), 1997. [17] M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. AMS-DIMACS series, special issue on computing on very large datasets, 1998. [18] M.M. Kessler, Bibliographic coupling between scientific papers, American Documentation, 14(1963), pp. 10–25. [19] J. Kleinberg, Authoritative sources in a hyperlinked environment, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ 10076(91892) May 1997. [20] S. Ravi Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins. Trawling emerging cyber-communities automatically. In Proceedings of the 8th World-wide web conference, 1999. [21] J.M. Kleinberg, S. Ravi Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins. The web as a graph: measurements, models and methods. Invited paper in Proceedings of the Fifth Annual International Computing and Combinatorics Conference, 1999. [22] R. Larson, Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace, Ann. Meeting of the American Soc. Info. Sci., 1996. [23] A.J. Lotka. The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences. 16, p.317, 1926. [24] M. Marchiori, The Quest for Correct Information on the Web: Hyper Search Engines, The 6th International World Wide Web Conference (WWW6), 1997. [25] A. Mendelzon and P. Wood. Finding Regular Simple Paths in Graph Databases. SIAM J. Comp. 24(6), 1995, pp. 1235-1258. [26] P. Pirolli, J. Pitkow, R. Rao, Silk from a sow’s ear: Extracting usable structures from the Web, Proc. ACM SIGCHI Conference on Human Factors in Computing, 1996. [27] E. Rivlin, R. Botafogo, B. Shneiderman, Navigating in hyperspace: designing a structure-based toolbox, Communications of the ACM, 37(2), 1994, pp. 87–96. [28] E. Spertus, ParaSite: Mining structural information on the Web, Proc. 6th International World Wide Web Conference, 1997. [29] D. Tsur, J. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal. Query Flocks: A Generalization of Association Rule Mining. Proceedings of ACM SIGMOD, 1998. [30] G.K. Zipf. Human behaviour and the principle of least effort. New York: Hafner, 1949. 649 AUTHORITIES H UBS www.midnightbeach.com/hs user.pa.net/∼rtregl/schools.html www.home-school.com larch-www.lcs.mit.edu:8001/∼raymie/linklist.html www.comenius.org/chnpage.htm www.phonet.com/bsimon/educ-nat.html users.aol.com/WERHSFAM/humor.html home.sol.no/∼hunwww/Elinker.htm www.sound.net/∼ejcol/confer.html www.thefamily.net/educational.html indexing keywords: homeschool homeschooling school education resource Homeschooling. AUTHORITIES H UBS www.netimages.com/∼chile www.chetbacon.com/hotlinks.html www.webring.org/cgi-bin/webring?list&ring=chile bigsun.wbs.net/homepages/g/g/h/gghosey/links.html www.firegirl.com www.bizkid.com/food.htm www.azstarnet.com/∼coriel www.catechnologies.com/cuisinesites.html www.chilegod.com allison.clark.net/pub/yoda indexing keywords: hot food sauce chile cooking recipes spicy peppers Hot and spicy foods. AUTHORITIES H UBS manufacture.com.tw www.aunet.com.tw/steel.htm www.rack.com.tw www.commerce.com.tw/c/metal www.mold.net.tw www.acer.net/search/ee.htm www.pack.com.tw www.hinet.net/source/business/steel/3.htm www.or.com.tw www.hinet.net/source/business/steel/1.htm indexing keywords: taiwan corporation products international machinery Taiwanese machine shops. AUTHORITIES H UBS www.eren.doe.gov www.physic.ut.ee/∼janro/ab.html www.doe.gov www.csmi.com/oaatweb/links.htm www.epri.com www.azstarnet.com/∼dcat/Rec list.htm www.epa.gov www.beasley.com.au/news.htm www.ashrae.org wwwvms.utexas.edu/∼whcii/energy/bldglnks.htm indexing keywords: energy wind renewable sustainable Sustainable energy resources. AUTHORITIES H UBS cancer.med.upenn.edu www.allny.com/health/oncology.html wwwicic.nci.nih.gov micf.mic.ki.se/Diseases/c4.html www.cancer.org www.medmark.org/onco/onco.html www.nci.nih.gov www.meds.com/cancerlinks.html www.nlm.nih.gov www.mic.ki.se/Diseases/c4.html indexing keywords: cancer breast oncology Cancer. 650

Extracting large-scale knowledge bases from the web ∗ Ravi Kumar Prabhakar Raghavan

Related documents

Products

Support

Extracting large-scale knowledge bases from the web ∗ Ravi Kumar Prabhakar Raghavan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib