A Brief Overview on Some Recent Study of Graph Data Yunkai Liu, Ph. D., Gannon University Outlines • Graph Database vs. Traditional Database – Data structure – Some frequently-used measurements – Overview of Graph Databases • Graph Data on Social Networks – Case study • Graph Data on Biology – Case study • Graph Data on other areas What is the specialty of graph data in application • Basic Data Structure – G = (N, E) • Sometime edges are also named as links • Some difference / limitation – – – – Directed graph Contains a large amount of attribute categories in nodes Contains limited amount of attributes categories in edges Rarely using adjacent matrices; hash table and indices are widely used • Example – SN between us Some frequently-addressed graph properties • Homophily is the tendency to relate to people with similar characteristics (status, beliefs, etc.) – It leads to the formation of homogeneous groups (clusters) where forming relations is easier – Extreme homogenization can act counter to innovation and idea generation (heterophilyis thus desirable in some contexts) – Homophilousties can be strong or weak Some frequently-addressed graph properties • Transitivity is a property of ties: if there is a tie between A and B and one between B and C, then in a transitive network A and C will also be connected – Strong ties are more often transitive than weak ties; transitivity is therefore evidence for the existence of strong ties (but not a necessary or sufficient condition) – Transitivity and homophily together lead to the formation of cliques (fully connected clusters) – How to decide reasonable transitivity degree in graph models? Some frequently-addressed graph properties • Bridges are nodes and edges that connect across groups – Facilitate inter-group communication, increase social cohesion, and help spur innovation – They are usually weak ties, but not every weak tie is a bridge Some frequently-addressed graph properties -Degree centrality • A node’s (in-) or (out-)degree is the number of links that lead into or out of the node • In an undirected graph they are of course identical • Often used as measure of a node’s degree of connectedness and hence also influence and/or popularity • Useful in assessing which nodes are central with respect to spreading information and influencing others in their immediate ‘neighborhood’ Some frequently-addressed graph properties -Paths • A path between two nodes is any sequence of non-repeating nodes that connects the two nodes • The shortest path between two nodes is the path that connects the two nodes with the shortest number of edges (also called the distance between the nodes) – All shortest paths – K-th shortest path Some frequently-addressed graph properties – Betweeness centrality • The number of shortest paths that pass through a node divided by all shortest paths in the network • Sometimes normalized such that the highest value is 1 • Shows which nodes are more likely to be in communication paths between other nodes • Also useful in determining points where the network would break apart. Some frequently-addressed graph properties – Closeness centrality • The mean length of all shortest paths from a node to all other nodes in the network (i.e. how many hops on average it takes to reach every other node) • It is a measure of reach, i.e. how long it will take to reach other nodes from a given starting node • Useful in cases where speed of information dissemination is main concern • Lower values are better when higher speed is desirable Some frequently-addressed graph properties – Eigenvector centrality • A node’s eigenvector centrality is proportional to the sum of the eigenvector centralities of all nodes directly connected to it • In other words, a node with a high eigenvector centrality is connected to other nodes with high eigenvector centrality • This is similar to how Google ranks web pages: links from highly linked-to pages count more • Useful in determining who is connected to the most connected nodes Others measurements • Reciprocity (degree of) – The ratio of the number of relations which are reciprocated (i.e. there is an edge in both directions) over the total number of relations in the network – A useful indicator of the degree of mutuality and reciprocal exchange in a network, which relate to social cohesion – Only makes sense in directed graphs Others measurements • Density – A network’s density is the ratio of the number of edges in the network over the total number of possible edges between all pairs of nodes (which is n(n-1)/2, where n is the number of vertices, for an undirected graph) – It is a common measure of how well connected a network is (in other words, how closely knit it is) –a perfectly connected network is called a clique and has density=1 – A directed graph will have half the density of its undirected equivalent, because there are twice as many possible edges, i.e. n(n-1) – Density is useful in comparing networks against each other, or in doing the same for different regions within a single network Others measurements • Clustering – A node’s clustering coefficient is the density of its neighborhood(i.e. the network consisting only of this node and all other nodes directly connected to it) – The clustering coefficient for an entire network is the average of all coefficients for its nodes – Clustering indicative of the presence of different (sub-)communities in a network Others measurements • Average and longest distance – The longest shortest path (distance) between any two nodes in a network is called the network’s diameter – It also indicates how long it will take at most to reach any node in the network (sparser networks will generally have greater diameters) – The average of all shortest paths in a network is also interesting because it indicates how far apart any two nodes will be on average (average distance) What is Graph Database • Graph database started in 1970s • It is growing fast recently due to the development of computer science tech. – Some GD claimed that they can represent millions of nodes and billions of edges • GD is a part of NoSQL database Social Network Analysis (SNA) • News – In 2013 Feb, Facebook announced their new “graph search” app • Major questions – Networks: How to represent various social networks – Tie Strength: How to identify strong/weak ties in the network – Key Players: How to identify key/central nodes in network – Cohesion: How to characterize a network’s structure • Major application – – – – Social study National security Micro-advertisement … Some of my project • Meth-Hunter • Graph Data Management system • Graph Data warehouse protocol NodeXL - emails NodeXL - Facebook Graph Metric Graph Type Value Undirected Vertices 67 Unique Edges Edges With Duplicates Total Edges 165 0 165 Self-Loops Reciprocated Vertex Pair Ratio Reciprocated Edge Ratio Connected Components Single-Vertex Connected Components Maximum Vertices in a Connected Component Maximum Edges in a Connected Component Maximum Geodesic Distance (Diameter) Average Geodesic Distance Graph Density Modularity 0 Not Applicable Not Applicable 8 0 29 102 4 1.878997 0.074626866 0.564555 Graph Data in Biology • Multiple classes of bionetwork models exist, such as metabolic, protein-gene, or protein-protein interactions – Metabolic networks entail nodes as metabolites and edges as enzymes facilitating a specific reaction within the body or nature. – Protein-gene interactions involve understanding and mapping gene expression. – As with metabolic and gene expression, proteinprotein interaction networks include nodes as proteins Graph Data in Biology • The structure of bio-network is important for us to understand the nature • The analysis part is similar with SNA, – The clique-finding is important and it may related with tumar. One case study – bionetwork alignment • Two previous models include Graemlin (General and robust alignment of multiple large interaction networks) and PHUNKEE (Pairing subgrapHs Using NetworK Environment Equivalence) – As Graemlin considers the entire network spectrum, the PHUNKEE algorithm considers only the most conserved portions between two graphs One case study – bionetwork alignment • Graemlin was advantageous in that it could align multiple networks at a fast pace, however; all nodes and edges are considered whether or not they are similar to each other. • On the contrary, PHUNKEE considers only the most conserved portions of two graphs, taking into account that insertions and deletions may occur over time. However, the algorithm performs slowly, working in a step-by-step manner. One case study – bionetwork alignment • we realized that one method is not enough to determine the relationship between two graphs because of various factors from data. Thus, we create a comprehensive package for pairwise graph comparison. – The package includes two interfaces; one is for global alignment and another for local alignment. – Transitivity property is also considered in case of missing nodes or missing edges. The bionetworks of four species in our experiment. Rattus norvegicus Mus Saccharomyces musculus cerevisiae Homo sapiens Number of Nodes 1212 3214 4906 11713 Number of Edges 241746 343605 383008 1332225 The comparisons between three species and Homo sapiens. Rattus norvegicus vs Homo sapiens Mus musculus vs Homo sapiens Saccharomyces cerevisiae vs Homo sapiens 1124 (92.74%) 2928 (91.10%) 537(10.94%) 23233(9.61%) 17422 (5.07%) 1308(0.34%) 0.6461 -0.9850 0.8816 -0.8877 0.9045 -0.9978 Left Global Similarity Biased on the Three Species 0.4158 0.5616 -0.9771 Left Global Similarity Biased on Home sapiens -0.9848 -0.8824 -0.9959 Number of Shared Nodes Number of Shared Edges Inner Global Similarity Outer Global Similarity A Cladogram for Rattus norvegicus, Mus musculus and Saccharomyces cerevisiae Some Weird Part • The normalization of the data is a big challenge. It is easy to get a wrong conclusion, which is yeast is more close to human than mice. • It is just an example of graph mining in bioinformatics Other area of Graph Data • GIS • Financial / business – Public spending • Gaming • Some challenges of GD in CS – Cloud app and cloud computing – Visualization – Integrating with other databases