ppt - School of Engineering and Computer Science

advertisement
Page Quality: In Search of an
Unbiased Web Ranking
Seminar on databases and the internet.
Hebrew University of Jerusalem
Winter 2008
Ofir Cooper
(ofir.cooper@gmail.com)
References
Impact of search engines of page popularity
Junghoo Cho, Sourashis Roy
UCLA Computer Science department (May 2004)
Page Quality: in search of unbiased web
ranking
Junghoo Cho, Sourashis Roy, Robert E. Adams
UCLA Computer Science department (June 2005)
Overview
What is the current algorithm search
engines use.
 Motivation to improve.
 The proposed method.
 Implementation.
 Experimental results.
 Conclusions (problems & future work)

Search engines today

Search engines today use a variant of
the PageRank rating system to sort
relevant results.

PageRank (PR) tries to measure the
“importance” of a page, by measuring its
popularity.
What is PageRank ?

Based on the random-surfer web-user
model:
A person starts surfing the web at a random
page.
 The person advances by clicking on links in
the page (selected at random).
 At each step, there is a small chance the
person will jump to a new, random page.
 This model does not take into account
search engines.

What is PageRank ?
PageRank (PR) measures the probability
that the random-surfer is at page p, at
any given time.
 Computed by this formula:

PR( pi )  d  (1- d )( PR( p1 ) / c1  PR( pn ) / cn )
pages p1…p n link to page pi
c j is the number of outgoing link from p j
d is a constant, called damping factor
The problem with PR

PageRank rating creates a “rich-getricher” phenomenon.
Popular pages will become even more
popular over time.
 Unpopular pages are hardly ever visited,
because they remain unknown to users.
They are doomed to obscurity.

The problem with PR

This was observed in an experiment:

The Open Directory (http://dmoz.org) was
sampled, twice within seven months period.

Change to number of incoming links to each
page was recorded.

Pages were divided into popularity groups,
and the results…
The bias against low-PageRank pages
The bias against low-PageRank pages
The bias against low-PageRank pages
In their study, Cho and Roy show that in
a search-dominant world, discovery time
of new pages rises by a factor of 66 !
What can be done to remedy the
situation ?

Ranking should reflect quality of pages.

Popularity is not a good-enough
measure for quality, because there are
many good, yet unknown, pages.

We want to give new pages an equal
opportunity (if they are of high quality).
How to define “quality” ?

Quality is a very subjective notion.

Let’s try to define it anyway…

Page Quality – the probability that an
average user will like the page when
he/she visits it for the first time.
How to estimate quality ?
Quality can be measured exactly if we
show all users the page, and ask their
opinion.
 It’s impractical to ask users their opinion
on every page they visit.

(PageRank is a good measure of quality, if all
pages had been given equal opportunity to be
discovered. That is no longer the case)
How to estimate quality ?

We want to estimate quality, but from
measurable quantities.
 We can talk about these quantities:




Q(p) – page quality. The probability that a user will like page
p when exposed to it for the first time.
P(p,t) – page popularity. The fraction of users who like p at
time t.
V(p,t) – visit popularity. The number of “visits” page p
receives at unit time interval at time t.
A(p,t) – page awareness. The fraction of web users who are
aware of page p, at time t.
Estimating quality
Lemma 1
P( p,t)  A( p,t)Q( p)  Q( p)  P( p,t)
A( p,t)
Proof: follows from definitions.
This is not sufficient – we can’t measure awareness.
We can measure page popularity, P(p,t).
How do we estimate Q(p) only from P(p,t) ?
Estimating quality

Observation 1 – popularity (as measured in
incoming links) measures quality well for
pages with the same age.

Observation 2 – the popularity of new, highquality pages, will increase faster then the
popularity of new, low-quality pages.
In other words, the time-derivative of
popularity is also a measure of quality.
Estimating quality


We need a web-user model to link
popularity and quality.
We start with these two assumptions:
1.
2.
Visit popularity is proportional to
popularity;
V(p,t) = rP(p,t)
Random visit hypothesis: a visit to page p
can be from any user with equal
probability.
Estimating quality
Lemma 2
A(p,t) can be computed from past popularity:
t
A( p,t) 1 exp{ nr  P( p,t)dt}
0
* n is number of web users.
Estimating quality
Proof:
t
t
By time t, page p was visited k  V ( p,t)dt  r  P( p,t)dt
0
0
times.
We compute the probability that some user, u, is
not aware of p, after p was visited k times.
Estimating quality


PR(i’th visitor to p is not u)=1-1/n
PR(u didn’t visit p | p was visited k times) =
t
t
r  P( p,t )dt
(1 1n)k  (1 1n)
0
 r P( p,t )dt
 n





 (1 1n)n 
0


t
t
e

n
nr  P( p,t )dt
(1 A( p,t))  e
0
nr  P( p,t )dt
0
Estimating quality

We can combine lemmas 1 and 2 to get a
popularity as a function of time.
 Theorem:
Q( p)
P( p,t) 
1 ( Q( p) 1)exp{ nr Q( p)t}
P( p,0)

The proof is a bit long, we won’t go into it
(available on hard copy, to those interested).
Estimating quality
This is popularity vs. time, as predicted
by our formula:
(This trend was seen
in practice, by
companies such as
NetRatings)
Estimating quality

Important fact: Popularity converges to
quality over a long period of time.
Q( p)
P( p,t) 
 Q( p)
t

1 ( Q( p) 1)exp{ nr Q( p)t}
P( p,0)
We will use this fact to check estimates
about quality later.
Estimating quality
Lemma 3
n
dP( p, t ) / dt
Q( p ) 
r P( p, t )(1  A( p, t ))
Proof:
We differentiate the equation P(p,t)=A(p,t)Q(p) by time,
plug in the expression we found for A(p,t) in Lemma 2,
and that’s it.
Estimating quality
We define the “relative popularity increase
function”:
n dP( p, t ) / dt
I ( p, t ) 
r P ( p, t )
Estimating quality
Estimating quality
Theorem
Q(p) = I(p,t)+P(p,t) at all times.
Proof:
dA( p, t )
r
1.
 (1  A( p, t )) P( p, t ) (from lemma 2)
dt
n
dA( p, t )
r
2. Q( p )
 (Q ( p )  Q ( p ) A( p, t )) P ( p, t ) (multiply by Q(p))
dt
n
dP ( p, t )
r
3.
 (Q( p )  P ( p, t )) P( p, t )
dt
n
dP ( p, t ) / dt
 Q ( p )  P ( p, t ) 
r P ( p, t )
n
Estimating quality
We can now estimate quality of page by
measuring only popularity.
What happens if quality changes in time?
Is our estimate still good ?
Quality change over time

In reality, quality of pages changes:
Web pages change.
 Expectation of users rise as better pages
appear all the time.


Will the model handle changing quality
well ?
Quality change over time
Theorem
If quality changed at time T (from Q1 to Q2),
then for t > T, the estimate for quality is still:
n dP( p, t ) / dt
Q2 
 P ( p, t )
r P ( p, t )
Quality change over time
Proof:
After time T, we put users into three groups:
(1)
Users who visited the page before T. (group u1)
(2)
Users who visited the page after T. (group u2)
(3)
Users who never visited the page.
Quality change over time
Fraction of users who like the page at time t>T:
P( p, t )  Q1 u1  u2 (t )  Q2 u2 (t )
* After time t, group u2 expands, while u1 remains the same.
We will have to compute u2(t).
Quality change over time
From the proof of lemma 2 (calculation of
awareness at time t) it is easy to see that:
u2 (t )  1  e
r

n
t
T P ( p ,t ) dt
Quality change over time
Size of |u1-u2(t)|:
u1  u2 (t )  u1  u1  u2 (t )  u1  u1 u2 (t )
* The size of intersection of u1 and u2 is their multiplication, because
they are independent. (According to random-visit hypothesis, the
probability that a user visits page p at time t is independent of his
past visit history.)
Quality change over time
P( p, t )  Q1 u1  u2 (t )  Q2 u2 (t ) 
Q1 u1  Q1 u1 u2 (t )  Q2 u2 (t ) 
Q1 u1  (Q2  Q1 u1 ) u2 (t )
d u2 (t )
dP( p, t )
 (Q2  Q1 u1 )

dt
dt
r
(Q2  Q1 u1 ) P( p, t )(1  u2 (t ) ) 
n
......

r
P( p, t )(Q2  P( p, t ))
n
Quality change over time
Q.E.D
n dP( p, t ) / dt
Q2 
 P ( p, t )
r P ( p, t )
Implementation
Implementation
The implementation of a quality-estimator
system is very simple:
1.
2.
3.
“Sample” the web at different times.
Compute popularity (PageRank) for each
page, and popularity change.
Estimate quality of each page.
Implementation
But there are problems with this
implementation.
1.
2.
3.
Approximation error – we sample at discrete
time points, not a continuous sample.
Quality change between samples makes
estimate inaccurate.
We will have a time lag. Quality estimate will
never be up-to-date.
Implementation
Examining approximation error
Q=0.5
∆t=1 (units not specified!)
Implementation
Examining slow change in quality
Q(p,t)=0.4+0.0006t
Q(p,t)=0.5+ct
Implementation
Examining rapid change in quality
The Experiment
The Experiment
Evaluating a web metric such as quality is
difficult.
 Quality is subjective.
 There is no standard corpus.
 Doing a user survey is not practical.
The Experiment

The experiment is based on the
observation that popularity converges to
quality (assuming quality is constant).

If we estimate quality of pages, and wait
some time, we can check our estimates
against the eventual popularity.
The Experiment
The test was done on 154 web sites,
obtained from the Open Directory
(http://dmoz.org).
All pages of these web sites were
downloaded (~5 million).
4 snapshots were taken at these times:
First three snapshots were used to estimate quality,
and fourth snapshot was used to check prediction
The Experiment
Quality is taken to be PR(t=4).
 Quality estimator is measured against
PR(t=3)

The Experiment
The results:

The quality estimator metric seems
better than PageRank.

Its average error is smaller:
Average error of Q3 estimator = 45%
 Average error of P3 estimator = 74%


The distribution of error is also better.
Summary & Conclusions
Summary

We saw the bias created by search
engines.

A more desirable ranking will rank pages
by quality, not popularity.
Summary

We can estimate quality from the link
structure of the web (popularity and
popularity evolution).

Implementation is feasible, only slightly
different than current PageRank system.
Summary

Experimental results show that quality
estimator is better than PageRank
Conclusions
Problems & Future work:

Statistical noise is not negligible for pages
with low popularity.

Experiment was done on small scale.
Should try it on large scale.
Conclusions
Problems & Future work:

Can we use number of “visits” to pages to
estimate popularity increase, instead of
number of incoming links ?

Theory is based on a web-user model that
doesn’t take into account search engines.
That is unrealistic in this day and age.
Follow up suggestions
Many more interesting publications in
Junghoo Cho’s website:
http://oak.cs.ucla.edu/~cho/

Such as:
Estimating Frequency of Change
Shuffling the deck: Randomizing search results
Automatic Identification of User Interest for Personalized
Search
Other algorithms for ranking

Extra material can be found at:
http://www.seoresearcher.com/category/linkpopularity-algorithms/

Algorithms such as:
Hub and Authority
 HITS
 HUBAVG

Download