Web Data Model

advertisement
CS 540
Database Management Systems
Lecture 5: Web Data Management
some slides from Kevin Chang’s CS511
1
Announcement
• Project proposal due tonight 11:59 pm on TEACH.
• Assignment 1 is posted
– Due on January 29th at 11:59 pm.
• Many reviews have very good questions
– some reviews do not include any question.
• they will lose some points.
• No class on Thursday.
• Arash’s office hour on Thursday canceled.
– make up office hour on Wednesday 1/28 3:00-4:00pm
2
How the Web different from a database
of documents?
3
How the Web different from a database
of documents?
• Hypertext vs. text: a lot of additional clues
– graph vs. set
– anchor text vs. text: how others say about you?
• Geographically distributed vs. centralized
– so you need to build a crawler
• Precision more valued than recall
– quality is important than quantity, especially “broad” queries
• Spamming
• Hoaxes and more …
• Web scale is super-huge and evolving
– scalability is the key
4
Web data and query
• Data model
–
–
–
–
directed graph
nodes: Web pages
links: hyperlinks
all nodes belong to the same type.
• Query is a set of terms
• Answer
– ranked list of relevant and important pages
– quantifying a subjective quality
• Basic data model
– more complex models, e.g., assigning types to pages.
5
Web search before Google
• Web as a set of documents
• Relevance: content-based retrieval
– documents match queries by contents
– q: ’clinton’  rank higher pages with more ‘clinton’
• Importance???
– contents: what documents say about themselves
– many spams and unreliable information in the results.
• Directory services were used
– Yahoo! was one of the leaders!
– Google co-founders were told “nobody will use a keyword interface”.
6
Hubs and Authorities
• An intuitive/informal definition:
– authorities: highly-regarded, authoritative pages
– hubs: pages that refer you to authorities
• A recursive/formal definition:
mutually reinforcing relationships
– hub:
• a page that links to many authorities
– authority:
• a page that is linked by many hubs
7
Web: Adjacent Matrix
• Web: G = {V, E}
– V = {x, y, z}, |V| = n
– E = {(x, x),(x, y), (x, z),
(y, z),
(z, x), (z, y)
}
– A: n x n matrix: Aij = 1 if page i links to page j, 0 if not
target node
y
A=
z
8
source node
x
1
0
1
1
0
1
1
1
0
Transposed Adjacent Matrix
• Adjacent matrix A:
– what does row j represent?
• Transpose At:
– what does row j represent?
x
A=
1
0
1
1
0
1
1
1
0
At =
1
1
1
0
0
1
1
1
0
y
z
9
Hubbiness and Authority
• Hubbiness: a vector h
– hi is a value representing the “hubbiness” of page i
• Authority: a vector a
– ai is a value representing the “authority” of page i
• Mutual recursive definition: in terms of h and
a
x
– ?? hx = ?
– ?? ax = ?
10
z
y
Hubbiness
• Hubbiness:
– hx = ax + a y + az
– hy = az
– hz = ax + a y
A=
1
0
1
1
0
1
1
1
0
• h = αAa
– A: links-to nodes
x
– a: their authority weights
– α: scaling factor to normalize
11
y
z
Authority
• Authority:
– ax = hx + h z
– ay = hx + h z
– az = hx + h y
1
1
1
At =
0
0
1
1
1
0
• a = βAth
– At: linked-from nodes
– h: their hub weights
– β: scaling factor
x
12
y
z
Finding Hubbiness and Authority
• Recursive definition:
– a = βAth, h = αAa
• Authority: a = αβ(AtA)a
– a is an eigenvector of AtA
• Hubbiness: h = αβ(AAt)h
– h is an eigenvector of AAt
13
Computing Hubbiness and Authority
• Computation: by “relaxation”
– start from some initial values of a and h
• z = (1, 1, …, 1)
• a0 = z; h0= z
– repeat until fixpoint: apply the equations
• ai = αβ (AtA)ai-1
• hi = αβ (AAt)hi-1
• fixpoint: ai » ai-1, hi » hi-1
• Convergence:
– for a: AtA is symmetric (and z is “right”) 
relaxation will converge to the principle eigenvector of AtA
– for h: similarly the principle eigenvector of AAt
14
Computing Hubbiness and Authority
• Assume a = 1, b = 1, initial h = a = (1, 1, 1)
– note: AtA and AAt are both symmetric matrices
AtA
a=
a:
1
1
1
1
2
5
5
4
2
2
1
2
2
1
1
1 a
2
AAt
3
1
2
h=
h:
3 4
24 114
24 114
18 84
1
1
1
1
2
6
2
4
1
1
0
3 4
28 132
8 36
20 96
• Will converge: e.g.: with some scaling:
– a --> 1.36, 1.36, 1 (or 0.63, 0.63, 0.46 as unit vector)
15
2
0 h
2
Google: PageRank
•
Reference: http://www7.scu.edu.au/
–
–
•
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. M.
Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink
Structure and Associated Text. WWW7 / Computer Networks 30(1-7): 65-74
(1998)
S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search
Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998)
Google.com:
– in the Stanford Digital Libraries project 1996-98
•
around the same time as Kleinberg’s paper
– tried to sell to Infoseek in 1997
– founded in 1998 by Brin and Page
16
PageRank: importance of pages
• PageRank (or importance): recursively
– a page P is important if important pages link to it
– importance of P:
• proportionally contributed by the back-linked pages
• Example:
x
– rx = 1/2 rx + 1/2 rz
– ry = 1/2 rz
– rz = 1/2 rx + 1 ry
• Random-surfer interpretation:
y
z
– surfer randomly follows links to navigate
– PageRank = the prob. that surfer will visit the page
17
Computing PageRank
• Importance-propagation equation:
1/2
r= 0
1/2
0
0
1
1/2
1/2 r
0
• linked-from (At) or links-to matrix (A)?
• column-normalized:
• column x is all that x points to
• sum of column = 1
• Computation: by relaxation
r:
1
1
1
1
2
1
1/2
3/2
3
fixpoint
5/4 …
6/5
3/4 …
3/5
1
…
6/5
x
y
z
18
Problems: Dead Ends
• Dead ends:
– page without successors has nowhere to send its importance
– eventually, what would happen to r?
x
• Example:
a
b
y
z
– ra = 0 ra + 0 rb
– rb = 1 ra + 0 rb
19
Problems: Spider Trap
• Spider traps:
– group of pages without out-of-group links will trap a spider inside
– what would happen to r?
x
a
y
b
z
• Example:
– ra = 1/2 ra + 0 rb
– rb = 1/2 ra + 1 rb
• Solutions??
20
Solutions: surfer’s random jump
• Surfer can randomly jump to a new page
– without following links
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
– d: damping factor (set to .85 in paper)
• model the probability of randomly jumping to this page
• another interpretation:
– “tax” importance of each page and distribute to all pages
• Teleportation
21
Anti-Spamming
• Spamming:
– attempt to create artifacts to “please” search engines
– so that ranking will be high
– e.g., commercial “search engine optimization service”
• Google anti-spam device:
– unlike other search engines, tends to believe what others say
about you
• by links and anchor texts
– recursive importance also works:
• importance (not just links) propagate
– Still, not perfect solution: suggestions?
22
Hub/ Authority versus PageRank
• As “refining service” for extra time to process
• As an add-on to existing search engines
23
PageRank and Hub/Authority influence
• Connected DB/DM with links analysis
– Rumored that Google paper rejected for “not being original”!
– Basic blocks of modern link analysis algorithms
• Web, social networks, biological networks, …
– information network, graph DB
• Typical problems
– finding similar nodes (items)
– community detection / node clustering
–…
More in SIGMOD, VLDB, ICDE, KDD, EDBT conference…
24
Web as a database
Active and challenging research area
• Information extraction
– finding entities and relationships from pages
• Information integration
– integrating data from multiple websites
• Easier to use query interfaces
– Natural-language queries/ question answering
More in WWW, SIGMOD, VLDB, ICDE, WWW, …
25
Your questions
•
•
•
•
•
•
•
Other factors, such as location of the link in the page
How to be fair toward new pages?
Losing information by eliminating dangling pages
Idea of PageRank for other data models
Dealing with evolution of Web structure
Dynamic Web pages (Java script, …)
How to store the links structure?
26
What you should know
•
•
•
•
Web data and query model
PageRank formula and algorithm
Dead ends and spider traps
Teleportation
27
Next
• Database system implementation
– DBMS architecture, storage, and access methods
• You have two papers to review
– rather short papers!
28
Download