Egalitarian engines? S. Fortunato, A. Flammini, F. Menczer & A. Vespignani

advertisement
Egalitarian engines?
S. Fortunato, A. Flammini,
F. Menczer & A. Vespignani
Outline
Search engines
 The Google revolution: PageRank
 Popularity bias
 The feared scenario: Googlearchy!
 Empirical test: Googlocracy?
 The importance of query topics
 Outlook

Search Engines
“A search engine is a program designed to help
find information stored on a computer system
such as the World Wide Web …” Wikipedia
First search engine: Archie (1990, Internet)
First WWW search engine: Wandex (1993)
Timeline
Timeline
Year Engine
Event
1993 Aliweb
1994 WebCrawler
Infoseek
Lycos
1995 AltaVista
Excite
1996 Dogpile
Inktomi
Ask Jeeves
1997 Northern Light
1998 Google
1999 AlltheWeb
1999 Baidu
2000 Teoma
2003 Objects Search
2004 Yahoo! Search
Launch
Launch
Launch
Launch
Launch (part of DEC)
Launch
Launch
Founded
Founded
Launch
Launch
Launch
Founded
Founded
Launch
Final launch
(first original results)
Beta launch
Final launch
Final launch
Founded
MSN Search
2005 MSN Search
FinQoo Meta Search
2006 Quaero
The
revolution
Invented by S. Brin and L. Page (1998).
Novelty: for the first time, a search engine
ranks pages according to their relevance in the
graph topology of the Web!
Web pages
Nodes
Hyperlinks
Edges
Degree distribution of the Web Graph
PageRank
It is the prestige measure used by Google
to rank Web pages.

1 q
p(i) 
q
N
p( j )
kout ( j )
j i
p(i) ~ probability that a user browsing the
Web by clicking from one page to another
(i.e. by following hyperlinks) visits page i.
Theoretical/empirical result: the PageRank of
a page is approximately proportional to the
number of incoming links of the page (link
popularity or in-degree)
Google recipe: Web pages are ranked
according to their in-degree.
Other factors play a role in the ranking, but
PageRank is the only factor that treats Web
pages like points of a graph, regardless their
semantic features.
How attractive are Web pages for users?
Traffic
Traffic is related to the frequency of visits
of Web pages by users.
Operative definition: the traffic t to a page is
the fraction of times the page is clicked
within some period.
Question: how does the traffic t grow with the
link popularity (in-degree) k of a page?
Null model: in a world where people navigate
the Web only by browsing, the traffic t to a
page is just the probability to visit a page
during this process → PageRank ~ in-degree
Null model prediction → t ~ k
In the real Web, navigation by searching is
replacing navigation by browsing.
What consequences are there on the relation
between t and k? Do search engines introduce
a popularity bias?
There are three possible scenarios:
t ~ k → no bias;
α
 t ~ k with α > 1→ googlearchy;
α
 t ~ k with α < 1→ googlocracy.

The feared scenario: Googlearchy
Search Dominant Model
All users discover and navigate the Web by
submitting queries to search engines and
looking at the results.
Two empirical ingredients:
Distribution of clicks on the hits of a hit list;
 Relation between the rank of a hit in a hit
list and its PageRank/in-degree.

The fraction of clicks on a hit is our traffic t.
Hits are identified by their rank r in the list.
By ordering all Web pages in decreasing
values of in-degree, a page with in-degree k
will have rank
r ~k
1.1
t ~r
(from cumulative degree distr.)
1.6
~ (k
1.1 1.6
)
~k
1.8
Googlearchy: search engines boost the
popularity of the most popular pages much
faster than simply surfing on the Web!
Empirical test of popularity bias




28,124 sites
Traffic from Alexa
In-degree from
Google and Yahoo
Analysis repeated
after 2 months
Data vs. Models
Googlocracy?
What are we missing?
1.6 (at hit list level)
L
t~r
rG ~ k
1.1
(overall)
The two relations cannot be combined!
t  f (rG ) ?
The importance of query topics
Hit lists depend on the interests of the users
and can be of various sizes.
In particular, very specific queries lead to
small hit lists which often contain little
popular Web sites/pages.
Similarly, it is unlikely that small hit lists
contain very popular sites/pages.
Hit list size distribution
Our model




“Artificial” Web with N pages, labeled from 1 to N;
At each step, a hit list is created such that:
1) all pages have the same probability to appear in the
hit list;
2) the size of the hit list is taken from the empirical
distribution.
For each hit list, clicks are distributed among the hits
1.6
according to the empirical distribution  t ~ r
L
After a sufficient number of hit lists has been created, we
check how many clicks went to a page with label/rank r
t  f (rG )
Data vs “Semantically Correct” model
Conclusions
 The use of search engines partially mitigates the

rich-get-richer nature of the Web, giving new sites
an increased chance of being discovered
(compared to surfing alone), as long as they are
about specific topics that match the interests of
users.
The combination of (i) how search engines index
and rank results,
(ii) what queries users submit, and
(iii) how users view the results, leads to an
egalitarian effect (“Googlocracy”).
Reactions
Looks scientific, but actually biased,
and not right!
A research floats "The Egalitrian Effects of Search Engines"
Being on good terms with Google
and googling people
including blogger and bloging people,
still did not stop me from thinking
Streets are much better than the
rough roads of the past centuries , too !
That is what I thought after reading the
full text of the research paper.
I do not think the survey methods were
right,though they looked very scientific.
I have made an experement with Google
page ranking, here is a look at
how "egalitarian" google in reality is like:
Download