EDBT Summer School 2015 Research Challenge 1 Graphs with target clustering coefficient and given input degree sequences Speaker: Arnau Prat Pérez Context and motivation: Generating graphs with realistic structural characteristics is of high importance for benchmarking graph processing systems and graph databases. One of such characteristics is the clustering coefficient of the graph, which is the number of closed triangles in the network out of the total possible triangles. Many approaches have been proposed to generate graphs with a target clustering coeffcicient. Many of them are based on well known graph generation models, such as the “configuration model” or the “preferential attachment model”. However, non of them is designed with the following assumptions in mind: A predefined degree sequence is given The position of a node in the degree sequence establishes a similarity relation between the nodes of the sequence. Nodes that are close in the sequence, are nodes that are conceptually similar (people with similar interests, proteins with similar methabolic functions, etc.). The goal of this challenge, is to generate a synthetic graph with a target clustering coefficient, with a degree distribution as close as possible as the one observed in the degree sequence given apriori, and where the nodes have a higher probability to be connected if they are close in the degree sequence. The reason of this last requirement is what is known as the homophily principle, which states that similar entities have a larger probability to be connected, thus forming triangles among them. Research/Design challenges Develop, either individually or in teams, an algorithm to generate synthetic graphs with the afforementioned restrictions, which fulfill the following requirements: The clustering coefficient of the generated graph is close to the target one for different input values (with an error less than 5%). An algorithm might work well for clustering coefficients in the range of 0 – 0.3 but not above. We will appreciate first those algorithms with larger ranges where the resulting clustering coefficient has an error below 5%. In case of a draw, we will prefer those algorithms with a smaller error. The degree distribution of the resulting graph is close to the original one (the number of nodes with a degree different than the original one is less than 5%. No nodes can have a larger degree than the original) We will appreciate nodes connected forming triangles with nodes close in the degree sequence (but this does not mean they cannot form triangles with distant EDBT Summer School 2015 nodes). The solutions should minimize the sum of the edge distances of the graph (given an edge, the edge distance is the difference between the positions of the nodes connected by the edge in the degree sequence) We will appreciate those innovative solutions that tackle the above requirements in the most elegant way and that work well in practice, that is, efficienty will be also appreciated. Technical prerequisites of participants: Basic programming skills in any language the student is familiar with e.g. python, java etc. Recomendable: familiar with some graph library such as sparksee, neo4j, python networkx or snap Technical support provided to participants: All needed resources can be obtained from the Internet