Web Data Model

CS 540 Database Management Systems Lecture 5: Web Data Management some slides from Kevin Chang’s CS511 1 Announcement • Project proposal due tonight 11:59 pm on TEACH. • Assignment 1 is posted – Due on January 29th at 11:59 pm. • Many reviews have very good questions – some reviews do not include any question. • they will lose some points. • No class on Thursday. • Arash’s office hour on Thursday canceled. – make up office hour on Wednesday 1/28 3:00-4:00pm 2 How the Web different from a database of documents? 3 How the Web different from a database of documents? • Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? • Geographically distributed vs. centralized – so you need to build a crawler • Precision more valued than recall – quality is important than quantity, especially “broad” queries • Spamming • Hoaxes and more … • Web scale is super-huge and evolving – scalability is the key 4 Web data and query • Data model – – – – directed graph nodes: Web pages links: hyperlinks all nodes belong to the same type. • Query is a set of terms • Answer – ranked list of relevant and important pages – quantifying a subjective quality • Basic data model – more complex models, e.g., assigning types to pages. 5 Web search before Google • Web as a set of documents • Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’  rank higher pages with more ‘clinton’ • Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. • Directory services were used – Yahoo! was one of the leaders! – Google co-founders were told “nobody will use a keyword interface”. 6 Hubs and Authorities • An intuitive/informal definition: – authorities: highly-regarded, authoritative pages – hubs: pages that refer you to authorities • A recursive/formal definition: mutually reinforcing relationships – hub: • a page that links to many authorities – authority: • a page that is linked by many hubs 7 Web: Adjacent Matrix • Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y), (x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: Aij = 1 if page i links to page j, 0 if not target node y A= z 8 source node x 1 0 1 1 0 1 1 1 0 Transposed Adjacent Matrix • Adjacent matrix A: – what does row j represent? • Transpose At: – what does row j represent? x A= 1 0 1 1 0 1 1 1 0 At = 1 1 1 0 0 1 1 1 0 y z 9 Hubbiness and Authority • Hubbiness: a vector h – hi is a value representing the “hubbiness” of page i • Authority: a vector a – ai is a value representing the “authority” of page i • Mutual recursive definition: in terms of h and a x – ?? hx = ? – ?? ax = ? 10 z y Hubbiness • Hubbiness: – hx = ax + a y + az – hy = az – hz = ax + a y A= 1 0 1 1 0 1 1 1 0 • h = αAa – A: links-to nodes x – a: their authority weights – α: scaling factor to normalize 11 y z Authority • Authority: – ax = hx + h z – ay = hx + h z – az = hx + h y 1 1 1 At = 0 0 1 1 1 0 • a = βAth – At: linked-from nodes – h: their hub weights – β: scaling factor x 12 y z Finding Hubbiness and Authority • Recursive definition: – a = βAth, h = αAa • Authority: a = αβ(AtA)a – a is an eigenvector of AtA • Hubbiness: h = αβ(AAt)h – h is an eigenvector of AAt 13 Computing Hubbiness and Authority • Computation: by “relaxation” – start from some initial values of a and h • z = (1, 1, …, 1) • a0 = z; h0= z – repeat until fixpoint: apply the equations • ai = αβ (AtA)ai-1 • hi = αβ (AAt)hi-1 • fixpoint: ai » ai-1, hi » hi-1 • Convergence: – for a: AtA is symmetric (and z is “right”)  relaxation will converge to the principle eigenvector of AtA – for h: similarly the principle eigenvector of AAt 14 Computing Hubbiness and Authority • Assume a = 1, b = 1, initial h = a = (1, 1, 1) – note: AtA and AAt are both symmetric matrices AtA a= a: 1 1 1 1 2 5 5 4 2 2 1 2 2 1 1 1 a 2 AAt 3 1 2 h= h: 3 4 24 114 24 114 18 84 1 1 1 1 2 6 2 4 1 1 0 3 4 28 132 8 36 20 96 • Will converge: e.g.: with some scaling: – a --> 1.36, 1.36, 1 (or 0.63, 0.63, 0.46 as unit vector) 15 2 0 h 2 Google: PageRank • Reference: http://www7.scu.edu.au/ – – • S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. M. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. WWW7 / Computer Networks 30(1-7): 65-74 (1998) S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Google.com: – in the Stanford Digital Libraries project 1996-98 • around the same time as Kleinberg’s paper – tried to sell to Infoseek in 1997 – founded in 1998 by Brin and Page 16 PageRank: importance of pages • PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: • proportionally contributed by the back-linked pages • Example: x – rx = 1/2 rx + 1/2 rz – ry = 1/2 rz – rz = 1/2 rx + 1 ry • Random-surfer interpretation: y z – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page 17 Computing PageRank • Importance-propagation equation: 1/2 r= 0 1/2 0 0 1 1/2 1/2 r 0 • linked-from (At) or links-to matrix (A)? • column-normalized: • column x is all that x points to • sum of column = 1 • Computation: by relaxation r: 1 1 1 1 2 1 1/2 3/2 3 fixpoint 5/4 … 6/5 3/4 … 3/5 1 … 6/5 x y z 18 Problems: Dead Ends • Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? x • Example: a b y z – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb 19 Problems: Spider Trap • Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? x a y b z • Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb • Solutions?? 20 Solutions: surfer’s random jump • Surfer can randomly jump to a new page – without following links PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) – d: damping factor (set to .85 in paper) • model the probability of randomly jumping to this page • another interpretation: – “tax” importance of each page and distribute to all pages • Teleportation 21 Anti-Spamming • Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” • Google anti-spam device: – unlike other search engines, tends to believe what others say about you • by links and anchor texts – recursive importance also works: • importance (not just links) propagate – Still, not perfect solution: suggestions? 22 Hub/ Authority versus PageRank • As “refining service” for extra time to process • As an add-on to existing search engines 23 PageRank and Hub/Authority influence • Connected DB/DM with links analysis – Rumored that Google paper rejected for “not being original”! – Basic blocks of modern link analysis algorithms • Web, social networks, biological networks, … – information network, graph DB • Typical problems – finding similar nodes (items) – community detection / node clustering –… More in SIGMOD, VLDB, ICDE, KDD, EDBT conference… 24 Web as a database Active and challenging research area • Information extraction – finding entities and relationships from pages • Information integration – integrating data from multiple websites • Easier to use query interfaces – Natural-language queries/ question answering More in WWW, SIGMOD, VLDB, ICDE, WWW, … 25 Your questions • • • • • • • Other factors, such as location of the link in the page How to be fair toward new pages? Losing information by eliminating dangling pages Idea of PageRank for other data models Dealing with evolution of Web structure Dynamic Web pages (Java script, …) How to store the links structure? 26 What you should know • • • • Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation 27 Next • Database system implementation – DBMS architecture, storage, and access methods • You have two papers to review – rather short papers! 28

Web Data Model

Related documents

Products

Support

Web Data Model

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib