Wikipedia and Preferential Attachment in Social Networks

advertisement
STATISTICAL PROPERTIES
OF THE WIKIGRAPH
arXiv:physics/0602026
A.Capocci, V. Servedio, F. Colaiori,
D. Donato, L.S. Buriol, S. Leonardi , GC
Centro “E. Fermi”
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
Wikipedia in other languages
You may read and edit articles in many different languages:
Wikipedia encyclopedia languages with over 100,000 articles
Deutsch (German) · Français (French) · Italiano (Italian) · (Japanese) · Nederlands (Dutch) · Polski (Polish) ·
Português (Portuguese) · Svenska (Swedish)
Wikipedia encyclopedia languages with over 10,000 articles
‫( العربية‬Arabic) · Български (Bulgarian) · Català (Catalan) · Česky (Czech) · Dansk (Danish) · Eesti (Estonian) ·
Español (Spanish) · Esperanto · Galego (Galician) · ‫( עברית‬Hebrew) · Hrvatski (Croatian) · Ido · Bahasa Indonesia
(Indonesian) · 한국어 (Korean) · Lietuvių (Lithuanian) · Magyar (Hungarian) · Bahasa Melayu (Malay) · Norsk bokmål
(Norwegian) · Norsk nynorsk (Norwegian) · Română (Romanian) · Русский (Russian) · Slovenčina (Slovak) ·
Slovenščina (Slovenian) · Српски (Serbian) · Suomi (Finnish) · Türkçe (Turkish) · Українська (Ukrainian) · 中文
(Chinese)
Wikipedia encyclopedia languages with over 1,000 articles
Alemannisch (Alemannic) · Afrikaans · Aragonés (Aragonese) · Asturianu (Asturian) · Azərbaycan (Azerbaijani) · Bânlâm-gú (Min Nan) · Беларуская (Belarusian) · Bosanski (Bosnian) · Brezhoneg (Breton) · Чăваш чěлхи (Chuvash) ·
Corsu (Corsican) · Cymraeg (Welsh) · Ελληνικά (Greek) · Euskara (Basque) · ‫( فارسی‬Persian) · Føroyskt (Faroese) ·
Frysk (Western Frisian) · Gaeilge (Irish) · Gàidhlig (Scots Gaelic) · हिन्दी (Hindi) · Interlingua · Íslenska (Icelandic) ·
Basa Jawa (Javanese) · ქართული (Georgian) · ಕನ್ನಡ (Kannada) · Kurdî / ‫( كوردی‬Kurdish) · Latina (Latin) · Latviešu
(Latvian) · Lëtzebuergesch (Luxembourgish) · Limburgs (Limburgish) · Македонски (Macedonian) · मराठी (Marathi) ·
Napulitana (Neapolitan) · Occitan · Ирон (Ossetic) · Plattdüütsch (Low Saxon) · Scots · Sicilianu (Sicilian) · Simple
English · Shqip (Albanian) · Sinugboanon (Cebuano) · Srpskohrvatski/Српскохрватски (Serbo–Croatian) · தமிழ்
(Tamil) · Tagalog · ภาษาไทย (Thai) · Tatarça (Tatar) · తెలుగు (Telugu) · Tiếng Việt (Vietnamese) · Walon (Walloon)
Complete list · Multilingual coordination · Start a Wikipedia in another language
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
A Nature investigation aimed to find if Wikipedia is an authoritative source of
information with respect to established sources as Encyclopedia Britannica.
Among 42 entries tested, the difference in accuracy was not particularly great:
• the average science entry in Wikipedia contained around four inaccuracies;
• the one in Britannica, about three.
On the other hand the articles on Wikipedia are longer on average than those of
Britannica. This accounts for a lower rate of errors in Wikipedia.
In a survey of more than 1,000
Nature authors
• 70% had heard of Wikipedia
of those
• 17% of those consulted it
on a weekly basis.
• less than 10% help to
update it
(Nature 438, 900-901; 2005)
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
STATISTICAL PROPERTIES OF THE WIKIGRAPH
Actually, things are a little bit more complicated
• Introduction
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
There is not only “control” by users, but also conflict of interests.
Thereby sometimes is not possible to modify 100% of the structure
since some sites are locked.
One of the biggest scandal was the biography of Journalist John
Seigenthaler who was accused to be involved in the murder of
President J.F. Kennedy
Some issues and languages have more controls than others. An
experiment made by Italian newspaper “L’espresso” introduced
Deliberately some errors in two voices
• One in the career of Football player Rui Costa (to be part of an
Italian team in the early 90’s)
• To introduce a non-existing philosopher
Obviously:
• The error for the football player was corrected after 30’
• The philosopher remained in place until the experiment was published
( at least two weeks)
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
WHY STUDYING WIKIPEDIA?
• sociological reasons: the encyclopedia collects pages written by a
number of indipendent and eterogeneous individuals. Each of them
autonomously decides about the content of the articles with the only
constraint of a prefixed layout. The autonomy is a common feature of the
content creation in the Web. The wikipedia authors’ community is formed
by members whose only wish is to make available to the world concepts
and topics that they consider meaningful. In some sense, tracing the
evolution of the wikipedia subsets should mirror the develop of significant
trends within each linguistic community.
• generation on time: wikipedia provides time information associated with
nodes. Moreover, it provides old information: time information for the
creation and the modifications for each page on the dataset.
• independency of external links: wikipedia articles link mainly to articles on
the same dataset.
• variety of graph sizes: it can be collected one graph by language, and the
graph dimensions vary from a few hundred pages up to half million pages.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Introduction
Summarizing:
• We have available all the history of growth, so that we can study the evolution
• We have an example of a “social” network of huge size
• We can compare the system produced by users of different language, thereby
measuring the effect of different cultures.
• We can study Wikipedia as a case study for the World Wide Web
WE RECOVER A PREFERENTIAL ATTACHMENT MECHANISM FROM
THE DATA.
DIFFERENT LANGUAGES PRODUCE SIMILAR STRUCTURES
WE FIND A SYSTEM SIMILAR TO THE WWW EVEN IF THE
MICROSCOPIC RULE OF GROWTH IS VERY DIFFERENT.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Data
The datasets of each language are available in two selfextracting
files for mysql database. The table cur contains the current on-line
articles, whereas the table old contains all previous versions of
each current article. Old versions of an article are identified for
using the same title, and not the same id. The dataset dumps are
updated almost weekly, so the current graph is usually not more
than a week old.
For generating a graph from the link structure of a dataset, each
article is considered a node and each hyperlink between articles is
a link in this graph. In the wikipedia datasets, each webpage is a
single article. An article also might contain some external links that
point pages outside the dataset. Usually wikipedia articles has no
external links, or just a few of them. These kind of links are not
considered for generating the wikigraphs, since we want to restrict
the graph to pages into the set being analyzed.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Data
We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES,
wikiIT and wikiPT, generated from the English, German, French,
Spanish, Italian and Portuguese datasets, respectively. The
graphs were obtained from an old dump of June 13, 2004. We
are not using the current data due to disk space restrictions. The
English dataset of June 2005 has more than 36 GB compacted,
that is about 200 GB expanded.
The page that was mostly visited was the main pages for wikiEN,
wikiDE, wikiFR and wikiES, while that for the datasets wikiIT and
wikiPT there were no visits associated with the pages.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• SCC (Strongly Connected
Component) includes pages that
are mutually reachable by traveling
on the graph
• IN component is the region from
which one can reach SCC
• OUT component encompasses
the pages reached from SCC.
• TENDRILS are pages reacheable
from the IN component,and not
pointing to SCC or OUT region
TENDRILS also includes those
pages that point to the OUT region
not belonging to any of the other
defined regions.
• TUBES connect directly IN and
OUT regions,
• DISCONNECTED regions are
those isolated from the rest.
• Topology
The Bow-tie structure, found in the WWW
(Broder et al. Comp. Net. 33, 309, 2000)
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Topology
The measure/size of the Wikigraph for the
various languages.
The percentage
of the various
components of
the
Wikigraph
for the various
languages.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Topology
The Degree shows fat tails
that can be approximated
by a power-law function of
the kind P(k) ~ k-g
Where the exponent is
the same both for
in-degree and out-degree.
In the case of WWW
2 ≤ gin ≤ 2.1
in–degree(empty) and out–degree(filled)
Occurrency distributions for the Wikgraph
in English (○) and Portuguese ().
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Topology
As regards the
assortativity (as measured
by the average degree of
the neighbours of a vertex
with degree k) there is no
evidence of any
assortative behaviour.
The average neighbors’ in–degree, computed
along incoming edges, as a function of the in–
degree for the English (○) and Portuguese ()
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Topology
The pagerank distribution for wikiEN is a power law function with γ =
2.1. Previous measures in webgraphs also exhibit the same behaviour
for the pagerank distribution.
We list the number of visits of the top ranked pages just to show that
this value is not related with the pagerank values. We confirm that very
little correlation was found between the link analysis characteristics and
the actual number of visits.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Dynamics
Given the history of
growth one can verify the
hypothesis of preferential
attachment. This is done
by means of the histogram
P(k) who gives the
number of vertices (whose
degree is k) acquiring new
connections at time t.
This is quantity is
weighted by the factor
N(t)/n(k,t)
English (○) and Portuguese ().
White= in-degree
Filled = out-degree
We find preferential
attachment for in and out
degree.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Dynamics
In our opinion the nature of this preferential attachment is effective
ratther than the real driving force in the phenomenon.
In other words the linear preferential attachment can be originated by a
copying procedure (new vertices are introduced by copying old ones
and keeping most of the edges). Also we could have a sort of fitness for
the various entries (but in this case one has a multidimensional series of
quantities describing the importance of one page).
Apart the interpretation the data show a rather clear
LINEAR PREFERENTIAL ATTACHMENT
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Dynamics
Other power-laws
related to dyamics need to
be explained
For example the number
of updates also follows a
power law.
Each point presents the number of nodes (y axis)
that were updated exactly x times.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
•Dynamics
This feature is
time invariant
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Modelling
From these data it seems that a model in the spirit of BA could
reproduce most of the features of the system.
Actually
1) This network is oriented.
2) The preferential attachment in Wikipedia has a somewhat
different nature. Here, most of the times, the edges are added
between existing vertices differently from the BA model. For
instance, in the English version of Wikipedia a largely
dominant fraction 0.883 of new edges is created between two
existing pages, while a smaller fraction of edges points or
leaves a newly added vertex (0.026 and 0.091 respectively).
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Modelling
We introduced an evolution rule, similar to other models of
rewiring already considered*,
• At each time step, a vertex is added to the network. It is
connected to the existing vertices by M oriented edges; the
direction of each edge is drawn at random:
•with probability R1 the edge leaves the new vertex pointing to
an existing one chosen with probability proportional to its in–
degree;
• with probability R2, the edge points to the new vertex, and
the source vertex is chosen with probability proportional to its
out–degree.
• Finally, with probability R3 = 1 − R1 − R2 the edge is added
between existing vertices: the source vertex is chosen with
probability proportional to the out–degree, while the destination
vertex is chosen with probability proportional to the in–degree.
* See for example Krapivsky Rodgers and Redner PRL 86 5401 (2001)
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Modelling
The model can be solved analytically
P(kin) ~
kin- gin
gin = -(1+1/(1-R2))
P(kout) ~ kout- gout gout = -(1+1/(1-R1))
gin  2.100
gout  2.027
We can use for the model the
empirical values of
R1=0.026
R2=0.091
R3=0.083
Already measured for the
English version of Wikigraph
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Modelling
The model can be solved analytically
Knnin (kin) ~ M N1-R1 R1R2/R3
(R3≠0)
Knnin (kin) ~ M R1R2 ln (N)
(R3=0)
Both cases is constant
The value of the constant
depends also upon the initial
conditions. The two lines refer
to two realizations of the model
where in one case the 0.5% of
the first vertices has been
removed.
STATISTICAL PROPERTIES OF THE WIKIGRAPH
• Conclusions
• We have a structure that resembles the bow-tie of the WWW
• We have a power-law decay for the degree distributions and also
a power-law decay for the number of one page updates
• Preferential Attachment in the Rewiring seems to be the driving force
in the evolution of the system
• The microscopic structure of rewiring is very different from that of WWW
In principle a user can change any series of edges and add as many
pages as wanted. Still most of the quantities are similar
STATISTICAL PROPERTIES OF THE WIKIGRAPH
•Conclusions
It turns out that the pagerank of the pages is not related with the
number of visit opens a very interesting scenario for further research
work. Since, by definition, pagerank should give us the visit time of
the page and since actually it is complety indipendent by the number
of visits, we wonder if pagerank is a good measure of the
authoritativeness of the pages in wikigraphs and which modifications
should be introduced in order to tune its performances.
Download