Searching and Integrating Information on the Web Professor Chen Li Department of Computer Science University of California, Irvine 1 About these seminars • Speaker: Chen Li (chenli AT ics.uci.edu) • http://www.ics.uci.edu/~chenli/tsinghua04/ • Topics: recent research on (Web) data management – – – – Search Data extraction Integration Information quality • Format: 4 seminars, 3 hours each • Questions/comments are always welcome! Seminar 1 2 Today’s topics • Topic 1: Web Search – Earlier search engines – PageRank in Google – HITS algorithm • Topic 2: Web-data extraction Seminar 1 3 Topic 1: Web Search • • • • How did earlier search engines work? How does PageRank used by Google work? HITS algorithm Readings: – Lawrence and Giles, Searching the World Wide Web, Science, 1998. – Brin and Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998. – Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999. Seminar 1 4 Earlier Search Engines • Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … • Main technique: “inverted index” – Conceptually: use a matrix to represent how many times a term appears in one page – # of columns = # of pages (huge!) – # of rows = # of terms (also huge!) ‘car’ ‘toyota’ ‘honda’ … Page1 Page2 Page3 Page4 … 1 0 1 0 0 2 0 1 2 1 0 0 Seminar 1 page 2 mentions ‘toyota’ twice 5 Search by Keywords • If the query has one keyword, just return all the pages that have the word – E.g., “toyota” all pages containing “toyota”: page2, page4,… – There could be many many pages! – Solution: return those pages with most frequencies of the word first Seminar 1 6 Multi-keyword Search • For each keyword W, find all the set of pages mentioning W • Intersect all the sets of pages – Assuming an “AND” operation of those keywords • Example: – A search “toyota honda” will return all the pages that mention both “toyota” and “honda” Seminar 1 7 Observations • The “matrix” can be huge: – Now the Web has 4.2 billion pages! – There are many “terms” on the Web. Many of them are typos. – It’s not easy to do the computation efficiently: Given a word, find all the pages… Intersect many sets of pages… • For these reasons, search engines never store this “matrix” so naively. Seminar 1 8 Problems • Spamming: – People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times – Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time). • Search engines can be easily “fooled” Seminar 1 9 Closer look at the problems • Lacking the concept of “importance” of each page on each topic – E.g.: My homepage is not as “important” as Yahoo’s main page. • A link from Yahoo is more important than a link from a personal homepage • But, how to capture the importance of a page? – A guess: # of hits? where to get that info? – # of inlinks to a page Google’s main idea. Seminar 1 10 Google’s History • Started at Stanford DB Group as a research project (Brin and Page) • Used to be at: google.stanford.edu • Very soon many people started liking it • Incorporated in 1998: www.google.com • The “largest” search engine now • Started other businesses: froogle, gmail, … Seminar 1 11 PageRank • Intuition: – The importance of each page should be decided by what other pages “say” about this page – One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: – We can easily fool this technique by generating many dummy pages that point to our class page Seminar 1 12 Details of PageRank • At the beginning, each page has weight 1 • In each iteration, each page propagates its current weight W to all its N forward neighbors. Each of them gets weight: W/N • Meanwhile, a page accumulates the weights from its backward neighbors • Iterate until all weights converge. Usually 6-7 times are good enough. • The final weight of each page is its importance. • NOTICE: currently Google is using many other techniques/heuristics to do search. Here we just cover some of the initial ideas. Seminar 1 13 Example: MiniWeb • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their weights are represented as a vector Ne MS Am n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a new 1 / 2 1 0 a old For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Materials by courtesy of Jeff Ullman Seminar 1 14 Iterative computation n 1 1 5 / 4 9 / 8 5 / 4 6 / 5 m 1 1 / 2 3 / 4 1 / 2 11 / 16 3 / 5 a 1 3 / 2 1 11 / 8 17 / 16 6 / 5 Ne MS Am Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Yes. Seminar 1 15 Observations • We cannot get absolute weights: – We can only know (and we are only interested in) those relative weights of the pages • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation: n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a 1 / 2 1 0 a Seminar 1 16 Problem 1 of algorithm: dead ends! Ne MS Am n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a new 1 / 2 0 0 a old • MS does not point to anybody • Result: weights of the Web “leak out” n 1 1 3 / 4 5 / 8 1 / 2 0 m 1 1 / 2 1 / 4 1 / 4 3 / 16 0 a 1 1 / 2 1 / 2 3 / 8 5 / 16 0 Seminar 1 17 Problem 2 of algorithm: spider traps Ne MS Am n 1 / 2 0 1 / 2 n m 0 1 1 / 2 m a new 1 / 2 0 0 a old • MS only points to itself • Result: all weights go to MS! n 1 1 3 / 4 5 / 8 1 / 2 0 m 1 3 / 2 7 / 4 2 35 / 16 3 a 1 1 / 2 1 / 2 3 / 8 5 / 16 0 Seminar 1 18 Google’s solution: “tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example. n 1 / 2 0 1 / 2 n 0.2 m 0.8 * 0 1 1 / 2 m 0.2 1 / 2 0 0 a 0.2 a n 7 / 11 m 21 / 11 a 5 / 11 Seminar 1 19 The War of Search Engines • More companies are realizing the importance of search engines • More competitors in the market: Microsoft, Yahoo!, etc. Seminar 1 20 • Next: the HITS algorithm Seminar 1 21 Hubs and Authorities • Motivation: find web pages to a topic – E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic – E.g.: DBLP is a page about papers – E.g.: google.com, aj.com, teoma.com, lycos.com • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic – E.g.: www.searchenginewatch.com is a hub of search engines – http://www.ics.uci.edu/~ics214a/ points to many bibliosearch engines Seminar 1 22 Two values of a page • Each page has a hub value and an authority value. – In PageRank, each page has one value: “weight” • Two vectors: – H: hub values – A: authority values h1 H h2 A1 A A2 Seminar 1 23 HITS algorithm: find hubs and authorities • First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” – Find pages S containing the keyword (“automobile”) – Find all pages these S pages point to, i.e., their forward neighbors. – Find all pages that point to S pages, i.e., their backward neighbors – Compute the subgraph of these pages root Focused subgraph Seminar 1 24 Step 2: computing H and A • Initially: set hub and authority to 1 • In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) • The authority value of each page is the total hub value of its backward neighbors (after normalization) • Iterate until converge hubs Seminar 1 authorities 25 Example: MiniWeb 1 1 1 M 0 0 1 1 1 0 an A am aa hn H hm ha H new * M * Aold Ne Anew * M T * H old Normalization! Therefore: MS Am H new * M * M T * H old Anew * M T * M * Aold Seminar 1 26 Example: MiniWeb 1 1 1 M 0 0 1 1 1 0 1 0 1 3 1 2 T T M 1 0 1 MM 1 1 0 1 1 0 2 0 2 2 3 1 6 28 132 H 1 2 8 36 1 1 3 1 4 20 96 Ne MS Am 2 2 1 T M M 2 2 1 1 1 2 1 3 1 5 24 114 A 1 5 24 114 1 3 2 1 4 18 84 Seminar 1 27 Topic 2: Web-Data Extraction • Sergey Brin, Extracting Patterns and Relations from the World Wide Web, WebDB Workshop, 1998. • Anand Rajaraman and Jeffrey D. Ullman, Querying Websites Using Compact Skeletons, PODS 2001. Goal: Extract rich data from the Web Seminar 1 28 Motivation • Many applications need to get rich data from the Web – Data integration – Domain-specific applications • Challenges: – Data embedded in HTML pages – For display purposes, but not easily extractable Seminar 1 29 Brin’s approach • Example: – Book(title, author) – Find as many book titles and authors as possible Sample data Find patterns Find data Seminar 1 30 Details 1. Start with sample tuples, e.g., five book titles and authors. 2. Find where the tuples appear on Web. Accept a “pattern” if: a) It identifies several examples of known tuples, and b) is sufficiently specific that it is unlikely to identify too much. 3. Given a set of accepted patterns, find data that appears in these patterns, add it to the set of known data. 4. Repeat steps (2) and (3) several times. Seminar 1 31 Pattern A pattern consists of five elements: 1. The order, i.e., whether the title appears prior to the author in the text, or viceversa. a) In a more general case, where tuples have more than 2 components, the order would be the permutation of components. 2. The URL prefix. 3. The prefix of text, just prior to the first of the title or author. 4. The middle: text appearing between the two data elements. 5. The suffix of text following the second of the two data elements. Both the prefix and suffix were limited to 10 characters. Seminar 1 32 <ul> <li><i>Database Systems: The Complete Book</i>, by Hector Garcia-Molina, Jeffrey Ullman, and Jennifer Widom.<br> <li><i>Data Mining: Concepts and Techniques</i>, by Jiawei Han and Micheline Kamber.<br> <li><i>Principles of Data Mining</i>, by David J. Hand, Heikki Mannila, and Padhraic Smyth, Cambridge, MA: MIT Press, 2001.<br> </ul> Seminar 1 33 Example 1. 2. 3. Order: title then author. URL prefix: www.ics.uci.edu/~ics215/ Prefix, middle, and suffix of the following form: a) b) c) <li><i>title</i> by author<br> The prefix is “<li><i>”, the middle is “</i> by” (including the blank after ``by''), and the suffix is “<br>”. The title is whatever appears between the prefix and middle; the author is whatever appears between the middle and suffix. The intuition behind why this pattern might be good is that there are probably lots of reading lists among the class pages at UCI ICS. Seminar 1 34 Constraints on patterns 1. Pattern specificity: (a) Is the product of the lengths of prefix, middle, suffix, and URL prefix. (b) It measures how likely we are to find the pattern; the higher the specificity, the fewer occurrences we expect. 2. To make sure patterns are likely to be accurate, it must meet two conditions: (a) There must be >= 2 known data items appearing in it. (b) The product of the pattern’s specificity and the number of occurrences of data items in the pattern must exceed a certain threshold T. Seminar 1 35 Data Occurrences • A data occurrence in a pattern consists of: – The particular title and author. – The complete URL, not just the prefix as for a pattern. – The order, prefix, middle, and suffix of the pattern in which the title and author occurred. • The same title and author might appear in several different patterns Seminar 1 36 Finding Data Occurrences Given Data • Given known titleauthor pairs, to find new patterns, search the Web to see where these titles and authors occur. – Assume there is a Web index – Given a word, can find (pointers to) all pages containing that word. • The method used is essentially apriori: – Find (pointers to) pages containing any known author. Since author names generally consist of 2 words, use the index for each first name and last name, and check that the occurrences are consecutive in the document. – Find (pointers to) pages containing any known title. Start by finding pages with each word of a title, and then checking that the words appear in order on the page. – Intersect the sets of pages that have an author and a title on them. Only these pages need to be searched to find the patterns in which a known titleauthor pair is found. For the prefix and suffix, take the 10 surrounding characters, or fewer if there are not as many as 10. Seminar 1 37 Building Patterns from Data Occurrences • 1. Group data occurrences according to their order and middle. – E.g., one group in the “groupby'' might correspond to the order “titlethenauthor'' and the middle “</I> by”. • 2. For each group, find the longest common prefix, suffix, and URL prefix. • 3. If specificity test for the pattern is met, accept it. • 4. Otherwise, – try to split the group into two by extending the length of the URL prefix by one character, and repeat from step (2). – If it is impossible to split the group (because there is only one URL), then we fail to produce a pattern from the group. Seminar 1 38 Example • Suppose our group contains three URL's: – www.ics.uci.edu/~ics184/ – www.ics.uci.edu/~ics214/ – www.ics.uci.edu/~ics215/ • The common prefix is www.ics.uci.edu/~ics • If we have to split the group, then the next character, “1” versus “2”, breaks the group into two, – those data occurrences in the first page (could be many) go into one group, – those occurrences on the other two pages going into another. Seminar 1 39 Finding Occurrences Given Patterns • Find all URL's that match the URL prefix in at least one pattern. • For each of those pages, scan the text using a regular expression built from the pattern's prefix, middle, and suffix. • From each match, extract the title and author, according the order specified in the pattern. Seminar 1 40 Results • • • • • 24M pages, 147GB Start with 5 (book, author) pairs First round: 199 occurrences, 3 patterns, 4047 unique (book, author) pairs After four rounds, found to 15,000 tuples. About 95% were true titleauthor pairs. Data quality is good. Seminar 1 41 RU’s approach • • • • Model data values as a graph Compute “skeletons” from the graph Use skeleton to extract data to populate tables Used in Junglee – a legendary database startup Seminar 1 42 Example Seminar 1 43 Data graph G • • A DAG Each node is an information element: – – – • E.g., A(ddress), T(itle), S(alary) Can be extracted using predefined regular expressions Can have a value, or NULL. An edge represents relationship between two elements – – Could be between pages (web link) Or could be within one page Seminar 1 44 Relation schema • • • The table we want to populate Has a given set of attributes X Each attribute A has a domain Dom(A) • Problem formulation: – – Given a data graph G, and a relation R over attribute set X Use the data graph to populate the table Seminar 1 45 Skeleton • • • • A tree Each node is an attribute in X (could be NULL) Intuition: a pattern/layout of data in the graph Overlay of a skeleton on the data graph: – – • A skeleton node matches a graph node, i.e., they same attribute May use an overlay to extract tuples Perfect skeleton: a skeleton K is perfect for data graph G if for each edge e in G, there is an overlay using K that includes e. Seminar 1 46 Perfect skeletons Data graph • • • Skeleton K1 Good Skeleton K2 Bad K1 tends to give us “right” tuples K2 can give us “wrong” tuples Intuitively, in K1, information elements are closer Seminar 1 47 Compact skeletons Data graph • Not compact K is a compact skeleton for G if: – • Compact Skeleton For each node u in G, there is a node v in K, such that for any overlay from K to G in which u participates, u is mapped to v. Perfect compact skeleton (PCS): perfect & compact – That’s what we want! Seminar 1 48 Computing PCS’s • • • • Not every graph has a PCS PCS is unique An algorithm for computing PCS Complexity: O(km|VG|) – – K: # of attributes in the relation M: # of the nodes in the largest subgraph that has a PCS Seminar 1 49 Other results • Partially PCS – – • • Deal with incomplete information An algorithm for computing PPCS Use PCS or PPCS to populate the relation, and answer queries Deal with noisy data graphs Seminar 1 50