CS 440 Database Management Systems Web Data Management 1 How the Web different from a database of documents? 2 How the Web different from a database of documents? • Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? • Geographically distributed vs. centralized – so you need to build a crawler • Precision more valued than recall – quality is important than quantity, especially “broad” queries • Spamming • Hoaxes and more … • Web scale is super-huge – scalability is the key 3 Web data and query • Data model – – – – directed graph nodes: Web pages links: hyperlinks all nodes belong to the same type. • Query is a set of terms • Answer – ranked list of relevant and important pages – quantifying a subjective quality • Basic data/query model – more complex models, e.g., assigning types to pages. 4 Web search before Google • Web as a set of documents • Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’ rank higher pages with more ‘clinton’ • Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. • Directory services were used – Yahoo! was one of the leaders – Google co-founders were told “nobody will use a keyword interface”. 5 Google: PageRank • From the Stanford Digital Libraries project 1996-98 • Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) • Tried to sell to Infoseek in 1997 • Founded in 1998 by Brin and Page 6 Web: Adjacent Matrix • Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y), (x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: Aij = 1 if page i links to page j, 0 if not target node y A= z 7 source node x 1 0 1 1 0 1 1 1 0 Transposed Adjacent Matrix • Adjacent matrix A: – what does row j represent? • Transpose At: – what does row j represent? x A= 1 0 1 1 0 1 1 1 0 At = 1 1 1 0 0 1 1 1 0 y z 8 PageRank: importance of pages • PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: • proportionally contributed by the back-linked pages • Example: x – rx = 1/2 rx + 1/2 rz – ry = 1/2 rz – rz = 1/2 rx + 1 ry • Random-surfer interpretation: y z – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page 9 Computing PageRank • Importance-propagation equation: 1/2 r= 0 1/2 0 0 1 1/2 1/2 r 0 • linked-from (At) or links-to matrix (A)? • column-normalized: • column x is all that x points to • sum of column = 1 • Computation: by relaxation r: 1 1 1 1 2 1 1/2 3/2 3 fixpoint 5/4 … 6/5 3/4 … 3/5 1 … 6/5 x y z 10 Problems: Dead Ends • Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? x • Example: a b y z – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb 11 Problems: Spider Trap • Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? x a y b z • Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb • Solutions?? 12 Solutions: surfer’s random jump • Surfer can randomly jump to a new page – without following links PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) – d: damping factor (set to .85 in paper) • model the probability of randomly jumping to this page • another interpretation: – “tax” importance of each page and distribute to all pages • Teleportation 13 Anti-Spamming • Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” • Google anti-spam device: – unlike other search engines, tends to believe what others say about you • by links and anchor texts – recursive importance also works: • importance (not just links) propagate – Still, not perfect solution 14 PageRank influence • A basic block for modern link analysis algorithms • Web, social networks, biological networks, … – information network, graph DB • Typical problems – finding similar nodes (items) – community detection / node clustering – keyword search –… 15 Web as a database Active and challenging research area • Information extraction – finding entities and relationships from pages • Information integration – integrating data from multiple websites • Easier to use query interfaces – Natural-language queries/ question answering 16 What you should know • • • • Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation 17