(A taste of) Data Management Over the Web Web R&D • The web has revolutionized our world – Relevant research areas include databases, networks, security… – Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design.. • Lots of research in each of these directions – Specialized conferences for web research – Lots of companies • This course will focus on Web Data Web Data • The web has revolutionized our world • Data is everywhere – Web pages, images, movies, social data, likes and dislikes… • Constitutes a great potential • But also a lot of challenges – Web data is huge, not structured, dirty.. • Just the ingredients of a fun research topic! Ingredients • Representation & Storage – Standards (HTML, HTTP), compact representations, security… • Search and Retrieval – Crawling, inferring information from text… • Ranking – What's important and what's not – Google PageRank, Top-K algorithms, recommendations… Challenges • Huge – Over 14 Billions of pages indexed in Google • Unstructured – But we do have some structure, such as html links, friendships in social networks.. • Dirty – A lot of the data is incorrect, inconsistent, contradicting, just irrelevant.. Course Goal • Introducing a selection of fun topics in web data management • Allowing you to understand some state-of-theart notions, algorithms, and techniques • As well as the main challenges and how we approach them Course outline • • • • • • • Ranking: HITS and PageRank Data representation: XML; HTML Crawling Information Retrieval and Extraction, Wikipedia example Aggregating ranks and Top-K algorithms Recommendations, Collaborative Filtering for recommending movies in NetFlix Other topics (time permitting): Deep Web, Advertisements… The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech Course requirement • A small final project • Will involve understanding of 2 or 3 of the subjects studied and some implementation • Will be given next monday Ranking Why Ranking? • Huge number of pages • Huge even if we filter according to relevance – Keep only pages that include the keywords • A lot of the pages are not informative – And anyway it is impossible for users to go through 10K results How to rank? • Observation: links are very informative! • Instead of a collection of Web Pages, we have a Web Graph!! • This is important for discovering new sites (see crawling), but also for estimating the importance of a site • CNN.com has more links to it than my homepage… Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites A(v) = The authority of v • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites H(v) = The hubness of v HITS (Kleinberg ’99) • Recursive dependency: a(v) = Σin h(u) h(v) = Σout a(u) Normalize according to sum of authorities \ hubness values • We can show that a(v) and h(v) converge Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it P(W) = P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn)) Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi Recursive definition • PageRank reflects the probability of being in a web-page (PR(w) = P(w)) Then: PR(W) = PR(W1)* (1/O(W1))+…+ PR(Wn)* (1/O(Wn)) • How to solve? EigenVector! • PR (row vector) is the left eigenvector of the stochastic transition matrix – I.e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique if the matrix is irreducible – Can be guaranteed by small perturbations Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem – Many Web pages have no outlinks Damping Factor • Add some probability d for "jumping" to a random page • Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+ P(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index How to compute PR? • Analytical methods – Can we solve the equations? – In principle yes, but the matrix is huge! – Not a realistic solution for web scale • Approximations A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow \ which page to go to • Keep record of the frequency of the webpages visited • The frequency for each page converges to its PageRank Power method • Start with some arbitrary rank row vector R0 • Compute Ri = Ri-1* A • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations) Other issues • Accelerating Computation • Distributed PageRank • Mixed Model (Incorporating "static" importance) • Personalized PageRank XML HTML (HyperText Markup Language) • Used for presentation • Standardized by W3C (1999) • Described the structure and content of a (web) document • HTML is an open format – Can be processed by a variety of tools HTTP • Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: www.google.com Server response: HTTP/1.1 200 OK • Two main HTTP methods: GET and POST GET URL: http://www.google.com/search?q=BGU Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host: www.google.com POST • Used for submitting forms POST /php/test.php HTTP/1.1 Host: www.bgu.ac.il Content-Type: application/x-wwwformurlencoded Content-Length: 100 … Status codes • HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) • First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error Authentication • HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. • It can be used instead to transmit sensitive data GET ... HTTP/1.1 Authorization: Basic dG90bzp0aXRp Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID – Connected, on the server side, to all session information Crawling Basics of Crawling • Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web • Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge! Discovering new URLs • Browse the "internet graph" (following e.g. hyperlinks) • Referrer urls • Site maps (sitemap.org) The internet graph • At least 14.06 billion nodes = pages • At least 140 billion edges = links Graph-browsing algorithms • Depth-first • Breath-first • Combinations.. Duplicates • Identifying duplicates or near-duplicates on the Web to prevent multiple indexing • Trivial duplicates: same resource at the same canonized URL: http://example.com:80/toto http://example.com/titi/../toto • Exact duplicates: identification by hashing • near-duplicates: (timestamps, tip of the day, etc.) more complex! Near-duplicate detection • Edit distance – Good measure of similarity, – Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams Crawling ethics • robots.txt at the root of a Web server • User-agent: * Allow: /searchhistory/ Disallow: /search • Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX,NOFOLLOW"> • Per-link exclusion (de facto standard). <a href="toto.html" rel="nofollow">Toto</a> • Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server