Link Analysis

advertisement
Link Analysis
Mining Massive Datasets
Wu-Jun Li
Department of Computer Science and Engineering
Shanghai Jiao Tong University
Lecture 7: Link Analysis
1
Link Analysis
Link Analysis Algorithms





PageRank
Hubs and Authorities
Topic-Sensitive PageRank
Spam Detection Algorithms
Other interesting topics we won’t cover
 Detecting duplicates and mirrors
 Mining for communities (community detection)
(Refer to Chapter 10 of the textbook)
2
Link Analysis
Outline
 PageRank
 Topic-Sensitive PageRank
 Hubs and Authorities
 Spam Detection
3
Link Analysis
PageRank
Ranking web pages
 Web pages are not equally “important”
 www.joe-schmoe.com v www.stanford.edu
 Inlinks as votes
 www.stanford.edu has 23,400 inlinks
 www.joe-schmoe.com has 1 inlink
 Are all inlinks equal?
 Recursive question!
4
Link Analysis
PageRank
Simple recursive formulation
 Each link’s vote is proportional to the importance of
its source page
 If page P with importance x has n outlinks, each link
gets x/n votes
 Page P’s own importance is the sum of the votes on
its inlinks
5
PageRank
Link Analysis
Simple “flow” model
The web in 1839
y
a/2
Yahoo
y/2
y/2
y = y /2 + a /2
a = y /2 + m
m = a /2
m
M’soft
Amazon
a
a/2
m
6
Link Analysis
PageRank
Solving the flow equations
 3 equations, 3 unknowns, no constants
 No unique solution
 All solutions equivalent modulo scale factor
 Additional constraint forces uniqueness
 y+a+m = 1
 y = 2/5, a = 2/5, m = 1/5
 Gaussian elimination method works for small
examples, but we need a better method for large
graphs
7
Link Analysis
PageRank
Matrix formulation
 Matrix M has one row and one column for each web page
 Suppose page j has n outlinks
 If j i, then Mij=1/n
 Else Mij=0
 M is a column stochastic matrix
 Columns sum to 1
 Suppose r is a vector with one entry per web page
 ri is the importance score of page i
 Call it the rank vector
 |r| = 1
8
PageRank
Link Analysis
Example
Suppose page j links to 3 pages, including i
j
i
i
=
1/3
M
r
r
9
Link Analysis
PageRank
Eigenvector formulation
 The flow equations can be written
r = Mr
 So the rank vector is an eigenvector of the stochastic
web matrix
 In fact, its first or principal eigenvector, with corresponding
eigenvalue 1
10
PageRank
Link Analysis
Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
m
0
1
0
r = Mr
Amazon
M’soft
y = y /2 + a /2
a = y /2 + m
m = a /2
y
1/2 1/2 0
a = 1/2 0 1
m
0 1/2 0
y
a
m
11
Link Analysis
PageRank
Power Iteration method





Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1/N,….,1/N]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < 
 |x|1 = 1≤i≤N|xi| is the L1 norm
 Can use any other vector norm e.g., Euclidean
12
PageRank
Link Analysis
Power Iteration Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
y
a =
m
m
0
1
0
M’soft
1/3
1/3
1/3
1/3
1/2
1/6
5/12
1/3
1/4
3/8
11/24 . . .
1/6
2/5
2/5
1/5
13
Link Analysis
PageRank
Random Walk Interpretation
 Imagine a random web surfer
 At any time t, surfer is on some page P
 At time t+1, the surfer follows an outlink from P uniformly
at random
 Ends up on some page Q linked from P
 Process repeats indefinitely
 Let p(t) be a vector whose ith component is the
probability that the surfer is at page i at time t
 p(t) is a probability distribution on pages
14
Link Analysis
PageRank
The stationary distribution
 Where is the surfer at time t+1?
 Follows a link uniformly at random
 p(t+1) = Mp(t)
 Suppose the random walk reaches a state such that
p(t+1) = Mp(t) = p(t)
 Then p(t) is called a stationary distribution for the random
walk
 Our rank vector r satisfies r = Mr
 So it is a stationary distribution for the random surfer
15
Link Analysis
PageRank
Existence and Uniqueness
A central result from the theory of random walks (aka Markov
processes):
For graphs that satisfy certain conditions, the
stationary distribution is unique and eventually will
be reached no matter what the initial probability
distribution at time t = 0.
16
Link Analysis
PageRank
Spider traps
 A group of pages is a spider trap if there are no links
from within the group to outside the group
 Random surfer gets trapped
 Spider traps violate the conditions needed for the
random walk theorem
17
PageRank
Link Analysis
Microsoft becomes a spider trap
Yahoo
y
a
m
y 1/2 1/2 0
a 1/2 0 0
m 0 1/2 1
M’soft
Amazon
y
a =
m
1
1
1
1
1/2
3/2
3/4
1/2
7/4
5/8
3/8
2
...
0
0
3
18
Link Analysis
PageRank
Random teleports
 The Google solution for spider traps
 At each time step, the random surfer has two
options:
 With probability , follow a link at random
 With probability 1-, jump to some page uniformly at
random
 Common values for  are in the range 0.8 to 0.9
 Surfer will teleport out of spider trap within a few
time steps
19
PageRank
Link Analysis
Random teleports ( = 0.8)
0.2*1/3
Yahoo
1/2
0.8*1/2
1/2
0.8*1/2
0.2*1/3
y
y 1/2
a 1/2
m 0
y
1/2
0.8* 1/2
0
y
1/3
+ 0.2* 1/3
1/3
0.2*1/3
Amazon
M’soft
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
20
PageRank
Link Analysis
Random teleports ( = 0.8)
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
Yahoo
y
a =
m
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
M’soft
Amazon
1
1
1
1.00
0.60
1.40
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
0.84
0.60
1.56
0.776
0.536 . . .
1.688
7/11
5/11
21/11
21
Link Analysis
PageRank
Matrix formulation
 Suppose there are N pages
 Consider a page j, with set of outlinks O(j)
 We have Mij = 1/|O(j)| when j i and Mij = 0 otherwise
 The random teleport is equivalent to
 adding a teleport link from j to every other page with probability
(1-)/N
 reducing the probability of following each outlink from 1/|O(j)| to
/|O(j)|
 Equivalent: tax each page a fraction (1-) of its score and
redistribute evenly
22
Link Analysis
PageRank
PageRank
 Construct the N*N matrix A as follows
 Aij = Mij + (1-)/N
 Verify that A is a stochastic matrix
 The PageRank vector r is the principal eigenvector of
this matrix
 satisfying r = Ar
 Equivalently, r is the stationary distribution of the
random walk with teleports
23
Link Analysis
PageRank
Dead ends
 Pages with no outlinks are “dead ends” for the
random surfer
 Nowhere to go on next step
24
PageRank
Link Analysis
Microsoft becomes a dead end
1/2 1/2 0
0.8 1/2 0 0
0 1/2 0
Yahoo
M’soft
Amazon
y
a =
m
1
1
1
1
0.6
0.6
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 1/15
0.787 0.648
0.547 0.430 . . .
0.387 0.333
0
0
0
Nonstochastic!
25
Link Analysis
PageRank
Dealing with dead ends
 Teleport
 Follow random teleport links with probability 1.0 from
dead ends
 Adjust matrix accordingly
 Prune and propagate




Preprocess the graph to eliminate dead ends
Might require multiple passes
Compute PageRank on reduced graph
Approximate values for dead ends by propagating values
from reduced graph
26
Link Analysis
PageRank
Computing PageRank
 Key step is matrix-vector multiplication
 rnew = Arold
 Easy if we have enough main memory to hold A,
rold, rnew
 Say N = 1 billion pages
 We need 4 bytes for each entry (say)
 2 billion entries for vectors, approx 8GB
 Matrix A has N2 entries
 1018 is a large number!
27
Link Analysis
PageRank
Rearranging the equation
r = Ar, where
Aij = Mij + (1-)/N
ri = 1≤j≤N Aij rj
ri = 1≤j≤N [Mij + (1-)/N] rj
=  1≤j≤N Mij rj + (1-)/N 1≤j≤N rj
=  1≤j≤N Mij rj + (1-)/N, since |r| = 1
r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
28
Link Analysis
PageRank
Sparse matrix formulation
 We can rearrange the PageRank equation:
 r = Mr + [(1-)/N]N
 [(1-)/N]N is an N-vector with all entries (1-)/N
 M is a sparse matrix!
 10 links per node, approx 10N entries
 So in each iteration, we need to:
 Compute rnew = Mrold
 Add a constant value (1-)/N to each entry in rnew
29
PageRank
Link Analysis
Sparse matrix encoding
 Encode sparse matrix using only nonzero entries
 Space proportional roughly to number of links
 say 10N, or 4*10*1 billion = 40GB
 still won’t fit in memory, but will fit on disk
source
degree destination nodes
node
0
3
1, 5, 7
1
5
17, 64, 113, 117, 245
2
2
13, 23
30
Link Analysis
PageRank
Basic Algorithm

Assume we have enough RAM to fit rnew, plus some
working memory

Store rold and matrix M on disk
Basic Algorithm:
 Initialize: rold = [1/N]N
 Iterate:



Update: Perform a sequential scan of M and rold to update rnew
Write out rnew to disk as rold for next iteration
Every few iterations, compute |rnew-rold| and stop if it is below
threshold
 Need to read in both vectors into memory
31
PageRank
Link Analysis
Update step
Initialize all entries of rnew to (1-)/N
For each page p (out-degree n):
Read into memory: p, n, dest1,…,destn, rold(p)
for j = 1..n:
rnew(destj) += *rold(p)/n
rnew
0
1
2
3
4
5
6
src
0
degree
3
destination
1, 5, 6
1
4
17, 64, 113, 117
2
2
13, 23
rold
0
1
2
3
4
5
6
32
Link Analysis
PageRank
Analysis
 In each iteration, we have to:
 Read rold and M
 Write rnew back to disk
 IO Cost = 2|r| + |M|
 What if we had enough memory to fit both rnew and
rold?
 What if we could not even fit rnew in memory?
 10 billion pages
33
Link Analysis
PageRank
Strip-based update
Problem: thrashing
34
Link Analysis
PageRank
Block Update algorithm
35
PageRank
Link Analysis
Block Update algorithm
rnew
0
1
2
3
src
0
1
degree
3
2
destination
0, 1
rold
0
2
1
0
3
2
1
0
3
3
1
2
2
3
2
2
0
1
2
3
36
Link Analysis
PageRank
Block Update algorithm
 Some additional overhead
 But usually worth it
 Cost per iteration
 |M|(1+) + (k+1)|r|
37
Link Analysis
Outline
 PageRank
 Topic-Sensitive PageRank
 Hubs and Authorities
 Spam Detection
38
Link Analysis
Topic-Sensitive PageRank
Some problems with PageRank
 Measures generic popularity of a page
 Biased against topic-specific authorities
 Ambiguous queries e.g., jaguar
 Uses a single measure of importance
 Other models e.g., hubs-and-authorities
 Susceptible to link spam
 Artificial link topographies created in order to boost page
rank
39
Link Analysis
Topic-Sensitive PageRank
Topic-Sensitive PageRank
 Instead of generic popularity, can we measure popularity
within a topic?
 E.g., computer science, health
 Bias the random walk
 When the random walker teleports, he picks a page from a set S of
web pages
 S contains only pages that are relevant to the topic
 E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)
 For each teleport set S, we get a different rank vector rS
40
Link Analysis
Topic-Sensitive PageRank
Matrix formulation




Aij = Mij + (1-)/|S| if i is in S
Aij = Mij otherwise
Show that A is stochastic
We have weighted all pages in the teleport set S
equally
 Could also assign different weights to them
41
Topic-Sensitive PageRank
Link Analysis
Example
0.2
0.5
0.4
2
1
1
0.8
Suppose S = {1},  = 0.8
0.5
0.4
3
1
0.8
1
0.8
4
Node
1
2
3
4
Iteration
0
1
1.0
0.2
0
0.4
0
0.4
0
0
2…
0.52
0.08
0.08
0.32
stable
0.294
0.118
0.327
0.261
Note how we initialize the PageRank vector differently from the
unbiased PageRank case.
42
Link Analysis
Topic-Sensitive PageRank
How well does TSPR work?
 Experimental results [Haveliwala 2000]
 Picked 16 topics
 Teleport sets determined using DMOZ
 E.g., arts, business, sports,…
 “Blind study” using volunteers
 35 test queries
 Results ranked using PageRank and TSPR of most closely
related topic
 E.g., bicycling using Sports ranking
 In most cases volunteers preferred TSPR ranking
43
Link Analysis
Topic-Sensitive PageRank
Which topic ranking to use?
 User can pick from a menu
 Use Bayesian classification schemes to classify query
into a topic
 Can use the context of the query
 E.g., query is launched from a web page talking about a
known topic
 History of queries e.g., “basketball” followed by “jordan”
 User context e.g., user’s My Yahoo settings,
bookmarks, …
44
Link Analysis
Outline
 PageRank
 Topic-Sensitive PageRank
 Hubs and Authorities
 Spam Detection
45
Link Analysis
Hubs and Authorities
Hubs and Authorities
 Suppose we are given a collection of documents on
some broad topic
 e.g., stanford, evolution, iraq
 perhaps obtained through a text search
 Can we organize these documents in some manner?
 PageRank offers one solution
 HITS (Hypertext-Induced Topic Selection) is another
 proposed at approx the same time (1998)
46
Link Analysis
Hubs and Authorities
HITS Model


Interesting documents fall into two classes
Authorities are pages containing useful information



course home pages
home pages of auto manufacturers
Hubs are pages that link to authorities


course bulletin
list of US auto manufacturers
47
Hubs and Authorities
Link Analysis
Idealized view
Hubs
Authorities
48
Link Analysis
Hubs and Authorities
Mutually recursive definition
 A good hub links to many good authorities
 A good authority is linked from many good hubs
 Model using two scores for each node
 Hub score and Authority score
 Represented as vectors h and a
49
Link Analysis
Hubs and Authorities
Transition Matrix A
 HITS uses a matrix A[i, j] = 1 if page i links to page j, 0
if not
 AT, the transpose of A, is similar to the PageRank
matrix M, but AT has 1’s where M has fractions
50
Hubs and Authorities
Link Analysis
Example
Yahoo
A=
Amazon
y a m
y 1 1 1
a 1 0 1
m 0 1 0
M’soft
51
Link Analysis
Hubs and Authorities
Hub and Authority Equations
 The hub score of page P is proportional to the sum of
the authority scores of the pages it links to
 h = λAa
 Constant λ is a scale factor
 The authority score of page P is proportional to the
sum of the hub scores of the pages it is linked from
 a = μAT h
 Constant μ is scale factor
52
Link Analysis
Hubs and Authorities
Iterative algorithm






Initialize h, a to all 1’s
h = Aa
Scale h so that its max entry is 1.0
a = ATh
Scale a so that its max entry is 1.0
Continue until h, a converge
53
Hubs and Authorities
Link Analysis
Example
111
A= 101
010
110
AT = 1 0 1
110
a(yahoo) =
a(amazon) =
a(m’soft) =
1
1
1
1
1
1
...
1
0.75 . . .
...
1
1
0.732
1
h(yahoo)
=
h(amazon) =
h(m’soft) =
1
1
1
...
1
1
1
2/3 0.71 0.73 . . .
1/3 0.29 0.27 . . .
1.000
0.732
0.268
1
4/5
1
54
Link Analysis
Hubs and Authorities
Existence and Uniqueness
h = λAa
a = μAT h
h = λμAAT h
a = λμATA a
Under reasonable assumptions about A,
the dual iterative algorithm converges to vectors
h* and a* such that:
• h* is the principal eigenvector of the matrix AAT
• a* is the principal eigenvector of the matrix ATA
55
Hubs and Authorities
Link Analysis
Bipartite cores
Hubs
Authorities
Most densely-connected core
(primary core)
Less densely-connected core
(secondary core)
56
Link Analysis
Hubs and Authorities
Secondary cores
 A single topic can have many bipartite cores




corresponding to different meanings, or points of view
abortion: pro-choice, pro-life
evolution: darwinian, intelligent design
jaguar: auto, Mac, NFL team, panthera onca
 How to find such secondary cores?
57
Link Analysis
Hubs and Authorities
Non-primary eigenvectors
 AAT and ATA have the same set of eigenvalues
 An eigenpair is the pair of eigenvectors with the same
eigenvalue
 The primary eigenpair (largest eigenvalue) is what we get
from the iterative algorithm
 Non-primary eigenpairs correspond to other
bipartite cores
 The eigenvalue is a measure of the density of links in the
core
58
Link Analysis
Hubs and Authorities
Finding secondary cores
 Once we find the primary core, we can remove its
links from the graph
 Repeat HITS algorithm on residual graph to find the
next bipartite core
 Technically, not exactly equivalent to non-primary
eigenpair model
59
Link Analysis
Hubs and Authorities
Creating the graph for HITS
 We need a well-connected graph of pages for HITS
to work well
60
Link Analysis
Hubs and Authorities
PageRank and HITS
 PageRank and HITS are two solutions to the same
problem
 What is the value of an inlink from S to D?
 In the PageRank model, the value of the link depends on
the links into S
 In the HITS model, it depends on the value of the other
links out of S
 The destinies of PageRank and HITS post-1998 were
very different
 Why?
61
Link Analysis
Outline
 PageRank
 Topic-Sensitive PageRank
 Hubs and Authorities
 Spam Detection
62
Link Analysis
Spam Detection
Web Spam
 Search has become the default gateway to the web
 Very high premium to appear on the first page of
search results
 e.g., e-commerce sites
 advertising-driven sites
63
Link Analysis
Spam Detection
What is web spam?
 Spamming = any deliberate action solely in order to
boost a web page’s position in search engine results,
incommensurate with page’s real value
 Spam = web pages that are the result of spamming
 This is a very broad defintion
 SEO industry might disagree!
 SEO = search engine optimization
 Approximately 10-15% of web pages are spam
64
Link Analysis
Spam Detection
Web Spam Taxonomy
 We follow the treatment by Gyongyi and GarciaMolina [2004]
 Boosting techniques
 Techniques for achieving high relevance/importance for a
web page
 Hiding techniques
 Techniques to hide the use of boosting
 From humans and web crawlers
65
Link Analysis
Spam Detection
Boosting techniques
 Term spamming
 Manipulating the text of web pages in order to appear
relevant to queries
 Link spamming
 Creating link structures that boost page rank or hubs and
authorities scores
66
Link Analysis
Spam Detection
Term Spamming
 Repetition
 of one or a few specific terms e.g., free, cheap, viagra
 Goal is to subvert TF.IDF ranking schemes
 Dumping
 of a large number of unrelated terms
 e.g., copy entire dictionaries
 Weaving
 Copy legitimate pages and insert spam terms at random positions
 Phrase Stitching
 Glue together sentences and phrases from different sources
67
Link Analysis
Spam Detection
Link spam
 Three kinds of web pages from a spammer’s point of
view
 Inaccessible pages
 Accessible pages
 e.g., web log comments pages
 spammer can post links to his pages
 Own pages
 Completely controlled by spammer
 May span multiple domain names
68
Link Analysis
Spam Detection
Link Farms
 Spammer’s goal
 Maximize the page rank of target page t
 Technique
 Get as many links from accessible pages as possible to
target page t
 Construct “link farm” to get page rank multiplier effect
69
Spam Detection
Link Analysis
Link Farms
Accessible
Own
1
Inaccessible
t
2
M
One of the most common and effective organizations for a link farm
70
Spam Detection
Link Analysis
Analysis
Own
Accessible
Inaccessibl
e
t
1
2
M
Suppose rank contributed by accessible pages = x
Let page rank of target page = y
Rank of each “farm” page = y/M + (1-)/N
y = x + M[y/M + (1-)/N] + (1-)/N
Very small; ignore
= x + 2y + (1-)M/N + (1-)/N
y = x/(1-2) + cM/N where c = /(1+)
71
Spam Detection
Link Analysis
Analysis
Own
Accessible
Inaccessibl
e
t
1
2
M
 y = x/(1-2) + cM/N where c = /(1+)
 For  = 0.85, 1/(1-2)= 3.6
 Multiplier effect for “acquired” page rank
 By making M large, we can make y as large as we want
72
Link Analysis
Spam Detection
Detecting Spam
 Term spamming
 Analyze text using statistical methods e.g., Naïve Bayes
classifiers
 Similar to email spam filtering
 Also useful: detecting approximate duplicate pages
 Link spamming
 Open research area
 One approach: TrustRank
73
Link Analysis
Spam Detection
TrustRank idea
 Basic principle: approximate isolation
 It is rare for a “good” page to point to a “bad” (spam) page
 Sample a set of “seed pages” from the web
 Have an oracle (human) identify the good pages and
the spam pages in the seed set
 Expensive task, so must make seed set as small as possible
74
Link Analysis
Spam Detection
Trust Propagation
 Call the subset of seed pages that are identified as
“good” the “trusted pages”
 Set trust of each trusted page to 1
 Propagate trust through links
 Each page gets a trust value between 0 and 1
 Use a threshold value and mark all pages below the trust
threshold as spam
75
Link Analysis
Spam Detection
Rules for trust propagation
 Trust attenuation
 The degree of trust conferred by a trusted page decreases
with distance
 Trust splitting
 The larger the number of outlinks from a page, the less
scrutiny the page author gives each outlink
 Trust is “split” across outlinks
77
Link Analysis
Spam Detection
Simple model
 Suppose trust of page p is t(p)
 Set of outlinks O(p)
 For each q in O(p), p confers the trust
 t(p)/|O(p)| for 0<<1
 Trust is additive
 Trust of p is the sum of the trust conferred on p by all its
inlinked pages
 Note similarity to Topic-Specific PageRank
 Within a scaling factor, trust rank = biased page rank with
trusted pages as teleport set
78
Link Analysis
Spam Detection
Picking the seed set
 Two conflicting considerations
 Human has to inspect each seed page, so seed set must be
as small as possible
 Must ensure every “good page” gets adequate trust rank,
so need make all good pages reachable from seed set by
short paths
79
Link Analysis
Spam Detection
Approaches to picking seed set
 Suppose we want to pick a seed set of k pages
 PageRank
 Pick the top k pages by page rank
 Assume high page rank pages are close to other highly
ranked pages
 We care more about high page rank “good” pages
80
Link Analysis
Spam Detection
Inverse page rank
 Pick the pages with the maximum number of outlinks
 Can make it recursive
 Pick pages that link to pages with many outlinks
 Formalize as “inverse page rank”
 Construct graph G’ by reversing each edge in web graph G
 Page rank in G’ is inverse page rank in G
 Pick top k pages by inverse page rank
81
Link Analysis
Spam Detection
Spam Mass
 In the TrustRank model, we start with good pages
and propagate trust
 Complementary view: what fraction of a page’s page
rank comes from “spam” pages?
 In practice, we don’t know all the spam pages, so we
need to estimate
82
Link Analysis
Spam Detection
Spam mass estimation
r(p) = page rank of page p
r+(p) = page rank of p with teleport into “good” pages
only
r-(p) = r(p) – r+(p)
Spam mass of p = r-(p)/r(p)
83
Link Analysis
Spam Detection
Good pages
 For spam mass, we need a large set of “good” pages
 Need not be as careful about quality of individual pages as
with TrustRank
 One reasonable approach
 .edu sites
 .gov sites
 .mil sites
84
Link Analysis
Acknowledgement
 Slides are from
 Prof. Jeffrey D. Ullman
 Dr. Anand Rajaraman
85
Download