Challenge 1 - EDBT Summer School 2015

advertisement
EDBT Summer School 2015
Research Challenge 1
Graphs with target clustering coefficient and given input degree
sequences
Speaker: Arnau Prat Pérez
Context and motivation:
Generating graphs with realistic structural characteristics is of high importance for
benchmarking graph processing systems and graph databases. One of such characteristics is
the clustering coefficient of the graph, which is the number of closed triangles in the network out
of the total possible triangles. Many approaches have been proposed to generate graphs with a
target clustering coeffcicient. Many of them are based on well known graph generation models,
such as the “configuration model” or the “preferential attachment model”. However, non of them
is designed with the following assumptions in mind:
 A predefined degree sequence is given
 The position of a node in the degree sequence establishes a similarity relation between
the nodes of the sequence. Nodes that are close in the sequence, are nodes that are
conceptually similar (people with similar interests, proteins with similar methabolic
functions, etc.).
The goal of this challenge, is to generate a synthetic graph with a target clustering coefficient,
with a degree distribution as close as possible as the one observed in the degree sequence
given apriori, and where the nodes have a higher probability to be connected if they are close in
the degree sequence. The reason of this last requirement is what is known as
the homophily
principle, which states that similar entities have a larger probability to be connected, thus
forming triangles among them.
Research/Design challenges
 Develop, either individually or in teams, an algorithm to generate synthetic graphs with
the afforementioned restrictions, which fulfill the following requirements:
 The clustering coefficient of the generated graph is close to the target one for
different input values (with an error less than 5%). An algorithm might work well
for clustering coefficients in the range of 0 – 0.3 but not above. We will
appreciate first those algorithms with larger ranges where the resulting clustering
coefficient has an error below 5%. In case of a draw, we will prefer those
algorithms with a smaller error.
 The degree distribution of the resulting graph is close to the original one (the
number of nodes with a degree different than the original one is less than 5%. No
nodes can have a larger degree than the original)
 We will appreciate nodes connected forming triangles with nodes close in the
degree sequence (but this does not mean they cannot form triangles with distant
EDBT Summer School 2015
nodes). The solutions should minimize the sum of the edge distances of the
graph (given an edge, the edge distance is the difference between the positions
of the nodes connected by the edge in the degree sequence)
 We will appreciate those innovative solutions that tackle the above requirements in the
most elegant way and that work well in practice, that is, efficienty will be also
appreciated.
Technical prerequisites of participants:
 Basic programming skills in any language the student is familiar with e.g. python, java
etc.
 Recomendable: familiar with some graph library such as sparksee, neo4j, python
networkx or snap
Technical support provided to participants:
 All needed resources can be obtained from the Internet
Download