CS 440 Database Management Systems Web Data Management 1

advertisement
CS 440
Database Management Systems
Web Data Management
1
How the Web different from a database
of documents?
2
How the Web different from a database
of documents?
• Hypertext vs. text: a lot of additional clues
– graph vs. set
– anchor text vs. text: how others say about you?
• Geographically distributed vs. centralized
– so you need to build a crawler
• Precision more valued than recall
– quality is important than quantity, especially “broad” queries
• Spamming
• Hoaxes and more …
• Web scale is super-huge
– scalability is the key
3
Web data and query
• Data model
–
–
–
–
directed graph
nodes: Web pages
links: hyperlinks
all nodes belong to the same type.
• Query is a set of terms
• Answer
– ranked list of relevant and important pages
– quantifying a subjective quality
• Basic data/query model
– more complex models, e.g., assigning types to pages.
4
Web search before Google
• Web as a set of documents
• Relevance: content-based retrieval
– documents match queries by contents
– q: ’clinton’  rank higher pages with more ‘clinton’
• Importance???
– contents: what documents say about themselves
– many spams and unreliable information in the results.
• Directory services were used
– Yahoo! was one of the leaders
– Google co-founders were told “nobody will use a keyword interface”.
5
Google: PageRank
• From the Stanford Digital Libraries project 1996-98
• Published the paper in 1997:
S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search
Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)
• Tried to sell to Infoseek in 1997
• Founded in 1998 by Brin and Page
6
Web: Adjacent Matrix
• Web: G = {V, E}
– V = {x, y, z}, |V| = n
– E = {(x, x),(x, y), (x, z),
(y, z),
(z, x), (z, y)
}
– A: n x n matrix: Aij = 1 if page i links to page j, 0 if not
target node
y
A=
z
7
source node
x
1
0
1
1
0
1
1
1
0
Transposed Adjacent Matrix
• Adjacent matrix A:
– what does row j represent?
• Transpose At:
– what does row j represent?
x
A=
1
0
1
1
0
1
1
1
0
At =
1
1
1
0
0
1
1
1
0
y
z
8
PageRank: importance of pages
• PageRank (or importance): recursively
– a page P is important if important pages link to it
– importance of P:
• proportionally contributed by the back-linked pages
• Example:
x
– rx = 1/2 rx + 1/2 rz
– ry = 1/2 rz
– rz = 1/2 rx + 1 ry
• Random-surfer interpretation:
y
z
– surfer randomly follows links to navigate
– PageRank = the prob. that surfer will visit the page
9
Computing PageRank
• Importance-propagation equation:
1/2
r= 0
1/2
0
0
1
1/2
1/2 r
0
• linked-from (At) or links-to matrix (A)?
• column-normalized:
• column x is all that x points to
• sum of column = 1
• Computation: by relaxation
r:
1
1
1
1
2
1
1/2
3/2
3
fixpoint
5/4 …
6/5
3/4 …
3/5
1
…
6/5
x
y
z
10
Problems: Dead Ends
• Dead ends:
– page without successors has nowhere to send its importance
– eventually, what would happen to r?
x
• Example:
a
b
y
z
– ra = 0 ra + 0 rb
– rb = 1 ra + 0 rb
11
Problems: Spider Trap
• Spider traps:
– group of pages without out-of-group links will trap a spider inside
– what would happen to r?
x
a
y
b
z
• Example:
– ra = 1/2 ra + 0 rb
– rb = 1/2 ra + 1 rb
• Solutions??
12
Solutions: surfer’s random jump
• Surfer can randomly jump to a new page
– without following links
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
– d: damping factor (set to .85 in paper)
• model the probability of randomly jumping to this page
• another interpretation:
– “tax” importance of each page and distribute to all pages
• Teleportation
13
Anti-Spamming
• Spamming:
– attempt to create artifacts to “please” search engines
– so that ranking will be high
– e.g., commercial “search engine optimization service”
• Google anti-spam device:
– unlike other search engines, tends to believe what others say
about you
• by links and anchor texts
– recursive importance also works:
• importance (not just links) propagate
– Still, not perfect solution
14
PageRank influence
• A basic block for modern link analysis algorithms
• Web, social networks, biological networks, …
– information network, graph DB
• Typical problems
– finding similar nodes (items)
– community detection / node clustering
– keyword search
–…
15
Web as a database
Active and challenging research area
• Information extraction
– finding entities and relationships from pages
• Information integration
– integrating data from multiple websites
• Easier to use query interfaces
– Natural-language queries/ question answering
16
What you should know
•
•
•
•
Web data and query model
PageRank formula and algorithm
Dead ends and spider traps
Teleportation
17
Download