The COSIN project, data in complex networks

advertisement
Data and networks
GIACS Conference Palermo 9-4-08
•Networks
GIACS PALERMO 9-4-08
•Networks as an instrument of Data Filtering
Correlation based Minimal Spanning Tree
1071 stocks traded at NYSE between 1987-1998
Different colours refers to different SIC sectors
Correlation based Minimal Spanning Tree
Artificial market of 1071 stocks
According the one factor model.
Different colours refers to different SIC sectors
Topology of correlation based minimal spanning trees in real and model markets
G. Bonanno, G. Caldarelli F. Lillo, R. Mantegna,
Networks of equities in financial markets
Physical Review E 68 046130 (2003).
G. Bonanno, GC, F. Lillo, S. Miccichè, N. Vandewalle, R. N. Mantegna,
European Physical Journal B 38 363-372 (2004).
GIACS PALERMO 9-4-08
•The Cosin project
COSIN (official number IST-20001-33555)
was a Research Project financed by European
Commission
through the Fifth Framework Programme.
COSIN is part of the actions taken by the
Future and Emerging Technologies (FET)
in the priority area of research of
Information Society Technologies (IST)
(http://www.cordis.lu/IST/FET)
Documents at http://www.cosinproject.org
GIACS PALERMO 9-4-08
•The Cosin project
COSIN involves
7 different nodes in 5 countries
A.
B.
C.
D.
E.
F.
(Ph +CS) Roma, Italy
(Ph) Barcelona, Spain
(Ph) Lausanne, Switzerland
(Ph) Ens, Paris, France
(CS) Karlsruhe, Germany
(Ph) Upsud, Paris, France
EU countries 2001
Non EU countries 2001
EU COSIN participant
Non EU COSIN participant
GIACS PALERMO 9-4-08
•Some of the Cosin people
G. Bonanno, G. Caldarelli, F.Colaiori, G. Di Battista, D. Donato, S.
Leonardi, R. Mantegna, A. Marchetti-Spaccamela, M. Patrignani, L.
Pietronero, V. Servedio
A. Arenas, M. Boguña, A. Díaz-Guilera, R. Ferrer i Cancho, M.A. Muñoz,
M.A Serrano, R. Pastor-Satorras
G. Bianconi, A. Capocci, P. De Los Rios, T. Erlebach, T. Petermann, Y.-C.
Zhang
A. Barrat. S. Battiston, P. Nadal, A. Vespignani, G. Weisbuch,
U. Brandes, M. Gaertler, M. Kaufmann, D. Wagner,
GIACS PALERMO 9-4-08
•The Cosin project
1.
To develop a unified set of Complex Systems theoretical
methodologies for the characterization of Complex Networks,
2. To develop statistical models for networks growth and
evolution.
3. To collect data mainly for Internet and World Wide Web
4. To extend analysis to social and economic networks
5. To develop visualization tools for large scale systems
6. To disseminate results through publication, conferences and
project web site.
GIACS PALERMO 9-4-08
•A Cosin summary
1. After three years of activity we have a common ground of
methodologies and tools at least between computer scientists
and physicists (also some economists). Some more effort
would be necessary to integrate social scientists.
2. We provided a class of models for network growth and
evolution, moreover we addressed the study of statistical
properties of weighted networks.
3. Data collection for Internet and World Wide Web resulted much
more difficult than expected. Actually larger consortia have
been funded specifically for this task in the meanwhile. Thank
to external collaboration we still found the data to validate the
models we produced
GIACS PALERMO 9-4-08
4. In economic and financial networks , COSIN people are on the
frontline of this very new field of research. This new approach
attracted the interest of the community at level of Nobel
laureates. Less successful has been the impact in social
science. Unexpected and very successful has been the impact
on biology (botany, zoology).
5. Standard visualization problem wants to keep all the graph
structure and present it suitably. On this point some progress
has been made, it is worth to mention that several ideas are
now under consideration for the visualization of ``simplified
graphs’’.
6. The project had a considerable impact on the scientific
community in terms of citations, visibility, conferences,
schools, books and data download from site. Maybe some
more work could be done for the general public.
GIACS PALERMO 9-4-08
The graph of
scientific
collaborations
on scale-free
networks in
statistical
physics
M.E.J Newman
PRE 69 026113 (2004)
GIACS PALERMO 9-4-08
•Dissemination
• More than 150 referred papers (some of them Nature, PNAS, PRL, LNCS)
• Lectures and talks in the various world conference (for physics STATPHYS,
APS Meetings) and invited talks in various institutions
• Books
GIACS PALERMO 9-4-08
The Sitges Conference published the proceedings of the most interesting talks on a special volume
Statistical Mechanics of Complex Networks
Series: Lecture Notes in Physics, Vol. 625
Pastor-Satorras, Romualdo; Rubi, Miguel; Diaz-Guilera, Albert (Eds.)
2003, XII, 206 p., Hardcover
ISBN: 3-540-40372-8
The Rome Conference published the proceeding on a special issue of the European Physical Journal B
GIACS PALERMO 9-4-08
•Web site
GIACS PALERMO 9-4-08
•What about data?
Trivially, the access to data was crucial for the project
We had that in some cases we found very nice datasets and could
work on them
1. Internet (AS topology)
2. Wikipedia.
In presence of poor or no data, we obtained (of course) only
partial results
1. Liquidity shocks,
2. River networks
GIACS PALERMO 9-4-08
STATISTICAL PROPERTIES
OF THE WIKIGRAPH
L.S. Buriol A. Capocci, F. Colaiori, D. Donato,
S. Leonardi, F. Rao, V. Servedio, GC
1.
2.
Taxonomy and clustering in collaborative systems: the case of the on-line encyclopedia Wikipedia
A.Capocci, F. Rao, GC
Europhysics Letters 81 28008 (arXiv:0710.3058) (2008)
Preferential attachment in the growth of social networks: the Internet encyclopedia Wikipedia
A. Capocci, V.D.P. Servedio, F. Colaiori, L.S. Buriol, D. Donato, S. Leonardi, GC
Physical Review E 74 036116 (2006).
Centro “E. GIACS
Fermi”
PALERMO 9-4-08
•Wikipedia intro
GIACS PALERMO 9-4-08
•Wikipedia intro
Wikipedia in other languages
You may read and edit articles in many different languages:
Wikipedia encyclopedia languages with over 100,000 articles
Deutsch (German) · Français (French) · Italiano (Italian) · (Japanese) · Nederlands (Dutch) · Polski
(Polish) · Português (Portuguese) · Svenska (Swedish)
Wikipedia encyclopedia languages with over 10,000 articles
‫( العربية‬Arabic) · Български (Bulgarian) · Català (Catalan) · Česky (Czech) · Dansk (Danish) · Eesti
(Estonian) · Español (Spanish) · Esperanto · Galego (Galician) · ‫( עברית‬Hebrew) · Hrvatski (Croatian) ·
Ido · Bahasa Indonesia (Indonesian) · 한국어 (Korean) · Lietuvių (Lithuanian) · Magyar (Hungarian) ·
Bahasa Melayu (Malay) · Norsk bokmål (Norwegian) · Norsk nynorsk (Norwegian) · Română (Romanian)
· Русский (Russian) · Slovenčina (Slovak) · Slovenščina (Slovenian) · Српски (Serbian) · Suomi (Finnish)
· Türkçe (Turkish) · Українська (Ukrainian) · 中文 (Chinese)
Wikipedia encyclopedia languages with over 1,000 articles
Alemannisch (Alemannic) · Afrikaans · Aragonés (Aragonese) · Asturianu (Asturian) · Azərbaycan
(Azerbaijani) · Bân-lâm-gú (Min Nan) · Беларуская (Belarusian) · Bosanski (Bosnian) · Brezhoneg
(Breton) · Чăваш чěлхи (Chuvash) · Corsu (Corsican) · Cymraeg (Welsh) · Ελληνικά (Greek) · Euskara
(Basque) · ‫( فارسی‬Persian) · Føroyskt (Faroese) · Frysk (Western Frisian) · Gaeilge (Irish) · Gàidhlig
(Scots Gaelic) · हिन्दी (Hindi) · Interlingua · Íslenska (Icelandic) · Basa Jawa (Javanese) · ქართული
(Georgian) · ಕನ್ನಡ (Kannada) · Kurdî / ‫( كوردی‬Kurdish) · Latina (Latin) · Latviešu (Latvian) ·
Lëtzebuergesch (Luxembourgish) · Limburgs (Limburgish) · Македонски (Macedonian) · मराठी (Marathi)
· Napulitana (Neapolitan) · Occitan · Ирон (Ossetic) · Plattdüütsch (Low Saxon) · Scots · Sicilianu
(Sicilian) · Simple English · Shqip (Albanian) · Sinugboanon (Cebuano) · Srpskohrvatski/Српскохрватски
(Serbo–Croatian) · தமிழ் (Tamil) · Tagalog · ภาษาไทย (Thai) · Tatarça (Tatar) · తెలుగు (Telugu) · Tiếng Việt
(Vietnamese) · Walon (Walloon)
Complete list · Multilingual coordination · Start a Wikipedia in another language
GIACS PALERMO 9-4-08
•Wikipedia intro
The datasets of each language are available in two selfextracting
files for mysql database. The table cur contains the current online articles, whereas the table old contains all previous versions
of each current article. Old versions of an article are identified for
using the same title, and not the same id. The dataset dumps are
updated almost weekly, so the current graph is usually not more
than a week old.
For generating a graph from the link structure of a dataset, each
article is considered a node and each hyperlink between articles
is a link in this graph. In the wikipedia datasets, each webpage is
a single article. An article also might contain some external links
that point pages outside the dataset. Usually wikipedia articles
has no external links, or just a few of them. These kind of links
are not considered for generating the wikigraphs, since we want
to restrict the graph to pages into the set being analyzed.
GIACS PALERMO 9-4-08
•Wikipedia interests
• sociological reasons: the encyclopedia collects pages written by a number of
indipendent and eterogeneous individuals. Each of them autonomously decides
about the content of the articles with the only constraint of a prefixed layout. The
autonomy is a common feature of the content creation in the Web. The wikipedia
authors’ community is formed by members whose only wish is to make available to
the world concepts and topics that they consider meaningful. In some sense, tracing
the evolution of the wikipedia subsets should mirror the develop of significant trends
within each linguistic community.
• generation on time: wikipedia provides time information associated with nodes.
Moreover, it provides old information: time information for the creation and the
modifications for each page on the dataset.
• independency of external links: wikipedia articles link mainly to articles on the
same dataset.
• variety of graph sizes: it can be collected one graph by language, and the graph
dimensions vary from a few hundred pages up to half million pages.
GIACS PALERMO 9-4-08
•Results
Summarizing:
• We have available all the history of growth, so that we can study the evolution
• We have an example of a “social” network of huge size
• We can compare the system produced by users of different language, thereby
measuring the effect of different cultures.
• We can study Wikipedia as a case study for the World Wide Web
WE RECOVER A PREFERENTIAL ATTACHMENT MECHANISM FROM
THE DATA.
DIFFERENT LANGUAGES PRODUCE SIMILAR STRUCTURES
WE FIND A SYSTEM SIMILAR TO THE WWW EVEN IF THE
MICROSCOPIC RULE OF GROWTH IS VERY DIFFERENT.
GIACS PALERMO 9-4-08
•The Wiki graphs
We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES,
wikiIT and wikiPT, generated from the English, German, French,
Spanish, Italian and Portuguese datasets, respectively. The
graphs were obtained from an old dump of June 13, 2004. We
are not using the current data due to disk space restrictions. The
English dataset of June 2005 has more than 36 GB compacted,
that is about 200 GB expanded.
The page that was mostly visited was the main pages for wikiEN,
wikiDE, wikiFR and wikiES, while that for the datasets wikiIT and
wikiPT there were no visits associated with the pages.
GIACS PALERMO 9-4-08
• SCC (Strongly Connected
Component) includes pages that are
mutually reachable by traveling on the
graph
• IN component is the region from
which one can reach SCC
• OUT component encompasses the
pages reached from SCC.
• TENDRILS are pages reacheable
from the IN component,and not
pointing to SCC or OUT region
TENDRILS also includes those pages
that point to the OUT region not
belonging to any of the other defined
regions.
• TUBES connect directly IN and OUT
regions,
• DISCONNECTED regions are those
isolated from the rest.
GIACS PALERMO 9-4-08
The Bow-tie structure, found in
the WWW
(Broder et al. Comp. Net. 33,
309, 2000)
•The Wikigraphs
The measure/size of the Wikigraph for
the various languages.
The percentage
of the various
components of
the
Wikigraph
for the various
languages.
GIACS PALERMO 9-4-08
•Power laws (what else?  )
The Degree shows
fat tails that can be
approximated by a
power-law function
of the kind P(k) ~ k-g
Where the exponent
is
the same both for
in-degree and outIn the case
of WWW
degree.
2 ≤ gin ≤ 2.1
in–degree(empty) and out–degree(filled)
Occurrency distributions for the Wikgraph
in English (○) and Portuguese ().
GIACS PALERMO 9-4-08
•Correlations
As regards the
assortativity (as
measured by the
average degree of
the neighbours of
a vertex with
degree k) there is
no evidence of
any assortative
behaviour.
The average neighbors’ in–degree, computed
along incoming edges, as a function of the in–
degree for the English (○) and Portuguese ()
GIACS PALERMO 9-4-08
•PageRank
The pagerank distribution for wikiEN is a power law function
with γ = 2.1. Previous measures in webgraphs also exhibit the
same behaviour for the pagerank distribution.
We list the number of visits of the top ranked pages just to show
that this value is not related with the pagerank values. We
confirm that very little correlation was found between the link
analysis characteristics and the actual number of visits.
GIACS PALERMO 9-4-08
•Preferential attachment
Given the history of
growth one can verify
the hypothesis of
preferential
attachment. This is
done by means of the
histogram P(k) who
gives the number of
vertices (whose degree
is k) acquiring new
connections at time t.
This is quantity is
weighted by the factor
N(t)/n(k,t)
English (○) and Portuguese ().
White= in-degree
Filled = out-degree
GIACS PALERMO 9-4-08
We find preferential
attachment for in and out
degree.
•Updates’ statistics
Other power-laws
related to dyamics need to
be explained
For example the number
of updates also follows a
power law.
Each point presents the number of nodes (y axis)
that were updated exactly x times.
GIACS PALERMO 9-4-08
•Wikipedia growth model
We introduced an evolution rule, similar to other models of
rewiring already considered*,
• At each time step, a vertex is added to the network. It is
connected to the existing vertices by M oriented edges; the
direction of each edge is drawn at random:
•with probability R1 the edge leaves the new vertex pointing to
an existing one chosen with probability proportional to its in–
degree;
• with probability R2, the edge points to the new vertex, and
the source vertex is chosen with probability proportional to its
out–degree.
• Finally, with probability R3 = 1 − R1 − R2 the edge is added
between existing vertices: the source vertex is chosen with
probability proportional to the out–degree, while the destination
vertex is chosen with probability proportional to the in–degree.
* See for example Krapivsky Rodgers and Redner PRL 86 5401 (2001)
GIACS PALERMO 9-4-08
•Wikipedia growth model
From these data it seems that a model in the spirit of BA could
reproduce most of the features of the system.
Actually
1) This network is oriented.
2) The preferential attachment in Wikipedia has a somewhat
different nature. Here, most of the times, the edges are added
between existing vertices differently from the BA model. For
instance, in the English version of Wikipedia a largely
dominant fraction 0.883 of new edges is created between two
existing pages, while a smaller fraction of edges points or
leaves a newly added vertex (0.026 and 0.091 respectively).
GIACS PALERMO 9-4-08
•Wikipedia growth model
The model can be solved analytically
P(kin) ~ kin- gin
gin = -(1+1/(1-R2))
P(kout) ~ kout- gout gout = -(1+1/(1-R1))
gin  2.100
gout  2.027
GIACS PALERMO 9-4-08
We can use for the model the
empirical values of
R1=0.026
R2=0.091
R3=0.883
Already measured for the
English version of Wikigraph
•Wikipedia growth model
The model can be solved analytically
Knnin (kin) ~ M N1-R1 R1R2/R3
(R3≠0)
Both cases is constant
Knnin (kin) ~ M R1R2 ln (N)
(R3=0)
The value of the constant
depends also upon the initial
conditions. The two lines refer
to two realizations of the model
where in one case the 0.5% of
the first vertices has been
removed.
GIACS PALERMO 9-4-08
•Wikipedia growth model
• We have a structure that resembles the bow-tie of the WWW
• We have a power-law decay for the degree distributions and also
a power-law decay for the number of one page updates
• Preferential Attachment in the Rewiring seems to be the driving force
in the evolution of the system
• The microscopic structure of rewiring is very different from that of WWW
In principle a user can change any series of edges and add as many
pages as wanted. Still most of the quantities are similar
GIACS PALERMO 9-4-08
•Wikipedia growth model
It turns out that the pagerank of the pages is not related with the
number of visit opens a very interesting scenario for further research
work. Since, by definition, pagerank should give us the visit time of
the page and since actually it is complety indipendent by the number
of visits, we wonder if pagerank is a good measure of the
authoritativeness of the pages in wikigraphs and which modifications
should be introduced in order to tune its performances.
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
From satellite images one gets Digital Elevation Models (DEM)
From DEM a spanning tree is computed (via steepest descent)
From the spanning tree, the number of points uphill is computed
156.4
132.4
111.4
2
3
4
170.8
161.3
108.2
1
1
6
182.4
154.5
106.0
1
2
9
GIACS PALERMO 9-4-08
•River Networks
HACK’S LAW
L// ~ Ah
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
Data on Mars topography were
collected through the Mars Orbiter
Laser Altimeter (MOLA)
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
GIACS PALERMO 9-4-08
•River Networks
Results are that we can distinguish regions whose DEM
networks have properties similar to River Networks on Earth.
For River on Earth
P(A)  A-1.43
GIACS PALERMO 9-4-08
THE LIQUIDITY MARKET
Monetary Policy
ECB
Reserves
Banks get liquidity from ECB through auctions
Monetary policy realised by ECB to control interest rates
BANKS MANAGE THEIR LIQUIDITY IN THE INTERBANK MARKET
The Market
Money Market
•EUROPEAN CENTRAL BANK provides LIQUIDITY to European Banks,
through weekly auctions.
•EVERY BANK must DEPOSIT to NATIONAL CENTRAL BANK the 2% of all
deposits and debts issued in the last two years. This reserves are
supposed to help in the case of liquidity shocks
•2% value fluctuates in time and it is recomputed every month.
ECB
Banks sell and buy liquidity to adjust their liquidity needs and at
the same time tend to reduce the value of reserve.
The Market
Market Data
The interbank markets are basically managed by each European country. These markets
are in almost all case phone-based, that means that each bank has some brokers doing
their transactions by phone. The only exception is the Italian market, which is totally
screen-based, implying that each banks operator can see real time quotes of all other
banks and do its transaction. The recent paper by Boss et al. investigate the network of
overall credit relationships in the Austrian Interbank market. In their study the authors
analyze all the liabilities for ten quarterly single months periods, between 2000 and 2003,
among 900 banks. They find a power-law distribution of contract sizes, and a power-law
decay of the distribution of incoming and outgoing links (a link between two banks exists
if the banks have an overall exposure with each other). Furthermore they show that the
most vulnerable vertices are those with the highest centrality (measured by the number of
paths that go through them). A different issue has been explored by Cocco et al. who have
investigated the nature of lending relationships in the fragmented Portuguese interbank
market over the period 1997-2001. In fragmented markets the amount and the interest
rate on each loan are agreed on a one-to-one basis between borrowing and lending
institutions. Other banks do not have access to the same terms, and no public information
regarding the loan is available. The authors showed that frequent and repeated
interactions between the same banks appear with a probability higher than those expected
for random matching. In addition they found that during illiquid periods, and in particular
during the Russian financial crisis preferential lending relationships increased.
The Market
Market Data
Italian Interbank Money Market
Banks operating on the Italian market, this
market is
fully electronic for interbank deposit since 1990
(e-Mid)
*) Daily volume 18 billion Euros
*) 200 participants
We report here the analysis on 196
Italian banks (plus 18 banks from
abroad who interact with them)
who did 85202 transactions in 2000.
INTRODUCTION
Time activity
two time scales:
day
one month maintenance period
Statistical Properties
Market Data
The network shows a rather peculiar architecture
The banks form a disassortative network where
large banks interact mostly with small ones.
Statistical Properties
Market Data
Actually the
banks form
different groups
roughly related
to their “size”
when considering
the average
volume of money
exchanged.
Statistical Properties
Degree Distributions
Using the latter quantity
we can divide banks in
four
groups
(same
number of classes of the
Bank
of
Italy
classification). Group 1
with volume in the range
0-23 million Euro per
day, Group 2 in the
range 23-70 million Euro
per day, Group 3 in the
range 70-165 million
Euro per day, Group 4
over 165 million Euro
per day. In this way we
find an overlap of more
than 90% between the
two classifications.
Communities
Separation of business
Two main communities
emerge
Many small banks and
few little banks.
Second eigenvector of the normal matrix
Modelling
Model of bank network
We assign to the N nodes (N is the size of the system) a
value drawn from the previous distribution. Vertices origin
and destination for one edge are chosen with a probability
pij proportional to the sum of respective sizes
vi and vj . In formulas
pij
(
v +v 
=
 (v + v 
i
j
i
i , j i
vi
vi
j
1
(

v
+
v
=

 (v + v  = 2( N - 1)V
i , j i
i
Vtot =
j
1
 vi
2 i
2 i , j i
i
j
tot
vi
vi
vi
Modelling
Market Data
MODELLING
Model and clustering
To quantify the agreement between
experimental and simulated networks
we also define an overlap parameter m
specifying how good is the behavior of
the model in reproducing the observed
clustering.
To quantify the agreement between
experimental and simulated networks,
we proceed in the following way.
We define a matrix E, that is a
weighted matrix 4 × 4, where the
weights represent the number of
connections between groups.
In order to measure the overlap
between the matrices obtained by data
and by computer model, we define a
distance based on the differences
between the elements of the matrices.
MODELLING
Model and clustering
d=
E
g ,k  g
ex
g ,k
-E
num
g ,k
We can define a distance between
the number of intergroup edges in
experimental data and numerical
simulation.
The sum of all elements, is equal to Etot
in both cases. Therefore the maximum
possible difference is 2Etot. This happens
when all the links are between two
groups in one case and in other two
groups in the other. We use this
maximum value to normalize the above
expression and we than define the
overlap parameter m: m = 1 − d/2Etot
ex
num
E
=
E
 g ,k  g ,k =Etot
g ,k  g
WE HAVE AN OVERLAP m=98%
g ,k  g
MODELLING
Model and clustering
To evaluate the relevance of division in
classes, we have to compare the value of
Eg,k with the corresponding quantity
Enullg,k for a network where there is not a
division in classes (null hypothesis). The
analytical expression for the null case is
Enullg,k = Etot/10 where 10 is the number
of possible couplings between the 4
groups. The comparison between the
two networks evidences that in the real
case emerges the division in groups: in
Table for each possible combination of
groups is reported the value Eg,k/Etot. In
the null case, each element of the same
matrix should be equal to 10.
Group
1
2
3
4
1
0
6
4
8
2
6
3
8
17
3
4
8
5
27
4
8
17
27
22
CONCLUSIONS
Market Data
Financial Networks can help
1.
In distinguishing behaviour of different markets
2.
In visualizing important features as the business role
3.
In testing the validity of market models
They might be an example of scale-free networks even more general
than those described by growth and preferential attachment.
CONCLUSIONS
Thanks to Giulie
Giulia De Masi,
Dep. Economics
Università delle Marche
Italy
Giulia Iori,
Department of Economics,
School of Social Science
City University, London UK
Download