Data Management Over the Web

(A taste of)
Data Management Over the Web
Web R&D
• The web has revolutionized our world
– Relevant research areas include databases, networks,
– Data structures and architecture, complexity, image
processing, security, natural language processing, user
interfaces design..
• Lots of research in each of these directions
– Specialized conferences for web research
– Lots of companies
• This course will focus on Web Data
Web Data
• The web has revolutionized our world
• Data is everywhere
– Web pages, images, movies, social data, likes and
• Constitutes a great potential
• But also a lot of challenges
– Web data is huge, not structured, dirty..
• Just the ingredients of a fun research topic!
• Representation & Storage
– Standards (HTML, HTTP), compact
representations, security…
• Search and Retrieval
– Crawling, inferring information from text…
• Ranking
– What's important and what's not
– Google PageRank, Top-K algorithms,
• Huge
– Over 14 Billions of pages indexed in Google
• Unstructured
– But we do have some structure, such as html links,
friendships in social networks..
• Dirty
– A lot of the data is incorrect, inconsistent,
contradicting, just irrelevant..
Course Goal
• Introducing a selection of fun topics in web
data management
• Allowing you to understand some state-of-theart notions, algorithms, and techniques
• As well as the main challenges and how
we approach them
Course outline
Ranking: HITS and PageRank
Data representation: XML; HTML
Information Retrieval and Extraction, Wikipedia example
Aggregating ranks and Top-K algorithms
Recommendations, Collaborative Filtering for recommending
movies in NetFlix
Other topics (time permitting): Deep Web, Advertisements…
The course is partly based on:
Web Data Management and Distribution,
Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine
Rousset, Pierre Senellart
And on a course by Pierre Senellart (and others) in telecom Paris-tech
Course requirement
• A small final project
• Will involve understanding of 2 or 3 of the subjects
studied and some implementation
• Will be given next monday
Why Ranking?
• Huge number of pages
• Huge even if we filter according to relevance
– Keep only pages that include the keywords
• A lot of the pages are not informative
– And anyway it is impossible for users to go
through 10K results
How to rank?
• Observation: links are very informative!
• Instead of a collection of Web Pages, we have
a Web Graph!!
• This is important for discovering new sites (see
crawling), but also for estimating the
importance of a site
• has more links to it than my
Authority and Hubness
• Authority: a site is very authoritative if it receives
many citations. Citation from important sites
weight more than citations from less-important
A(v) = The authority of v
• Hubness shows the importance of a site. A good
hub is a site that links to many authoritative sites
H(v) = The hubness of v
HITS (Kleinberg ’99)
• Recursive dependency:
a(v) = Σin h(u)
h(v) = Σout a(u)
Normalize according to sum of
authorities \ hubness values
• We can show that a(v) and h(v) converge
Random Surfer Model
• Consider a "random surfer"
• At each point chooses a link and clicks on it
P(W) = P(W1)* (1/O(W1))+…+
P(Wn)* (1/O(Wn))
Where Wi…Wn link to W, O(Wi) is the number of
out-edges of Wi
Recursive definition
• PageRank reflects the probability of being in a
web-page (PR(w) = P(w))
PR(W) = PR(W1)* (1/O(W1))+…+
PR(Wn)* (1/O(Wn))
• How to solve?
• PR (row vector) is the left eigenvector of the
stochastic transition matrix
– I.e. the adjacency matrix normalized to have the
sum of every column to be 1
• The Perron-Frobinius theorem ensures that
such a vector exists
• Unique if the matrix is irreducible
– Can be guaranteed by small perturbations
• A random surfer may get stuck in one
component of the graph
• May get stuck in loops
• “Rank Sink” Problem
– Many Web pages have no outlinks
Damping Factor
• Add some probability d for "jumping" to a
random page
• Now P(W) = (1-d) * [P(W1)* (1/O(W1))+…+
P(Wn)* (1/O(Wn))] + d* 1/N
Where N is the number of pages in the index
How to compute PR?
• Analytical methods
– Can we solve the equations?
– In principle yes, but the matrix is huge!
– Not a realistic solution for web scale
• Approximations
A random surfer algorithm
• Start from an arbitrary page
• Toss a coin to decide if you want to follow a
link or to randomly choose a new page
• Then toss another coin to decide which link to
follow \ which page to go to
• Keep record of the frequency of the webpages visited
• The frequency for each page converges to its
Power method
• Start with some arbitrary rank row vector R0
• Compute Ri = Ri-1* A
• If we happen to get to the eigenvector we will stay
• Theorem: The process converges to the eigenvector!
• Convergence is in practice pretty fast (~100 iterations)
Other issues
• Accelerating Computation
• Distributed PageRank
• Mixed Model (Incorporating "static"
• Personalized PageRank
HTML (HyperText Markup Language)
• Used for presentation
• Standardized by W3C (1999)
• Described the structure and content of a (web) document
• HTML is an open format
– Can be processed by a variety of tools
• Application protocol
Client request:
GET /MarkUp/ HTTP/1.1
Server response:
HTTP/1.1 200 OK
• Two main HTTP methods: GET and POST
Corresponding HTTP GET request:
GET /search?q=BGU HTTP/1.1
• Used for submitting forms
POST /php/test.php HTTP/1.1
Content-Type: application/x-wwwformurlencoded
Content-Length: 100
Status codes
• HTTP response always starts with a status code
followed by a human-readable message (e.g., 200
• First digit indicates the class of the response:
1 Information
2 Success
3 Redirection
4 Client-side error
5 Server-side error
• HTTPS is a variant of HTTP that includes
encryption, cryptographic authentication,
session tracking, etc.
• It can be used instead to transmit sensitive
GET ... HTTP/1.1
Authorization: Basic dG90bzp0aXRp
• Key/value pairs, that a server asks a client to store
and retransmit with each HTTP request (for a given
domain name).
• Can be used to keep information on users between
• Often what is stored is a session ID
– Connected, on the server side, to all session
Basics of Crawling
• Crawlers, (Web) spiders, (Web) robots: autonomous agents
that retrieve pages from the Web
• Basics crawling algorithm:
1. Start from a given URL or set of URLs
2. Retrieve and process the corresponding page
3. Discover new URLs (next slide)
4. Repeat on each found URL
Problem: The web is huge!
Discovering new URLs
• Browse the "internet graph" (following e.g.
• Referrer urls
• Site maps (
The internet graph
• At least 14.06 billion nodes = pages
• At least 140 billion edges = links
Graph-browsing algorithms
• Depth-first
• Breath-first
• Combinations..
• Identifying duplicates or near-duplicates on the
Web to prevent multiple indexing
• Trivial duplicates: same resource at the same
canonized URL:
• Exact duplicates: identification by hashing
• near-duplicates: (timestamps, tip of the day, etc.)
more complex!
Near-duplicate detection
• Edit distance
– Good measure of similarity,
– Does not scale to a large collection of documents
(unreasonable to compute
the edit distance for every pair!).
• Shingles: two documents similar if they mostly share the
same succession of k-grams
Crawling ethics
• robots.txt at the root of a Web server
• User-agent: *
Allow: /searchhistory/
Disallow: /search
• Per-page exclusion (de facto standard).
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
• Per-link exclusion (de facto standard).
<a href="toto.html" rel="nofollow">Toto</a>
• Avoid Denial Of Service (DOS), wait 100ms/1s between two
repeated requests to the same Web server