An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs Sitaram Asur, Srinivasan Parthasarathy and Duygu Ucar Department of Computer Science The Ohio State University Copyright 2006, Data Mining Research Laboratory Motivation Protein-protein interactions in yeast (Jeong et al, 2001) • Interaction Networks – Represent scientific data from various domains – Nodes represent entities – Edges represent interactions among entities – Examples: • Biological Networks - ProteinProtein Interaction (PPI) networks, gene expression networks • Collaboration networks • Social networks, online communities, blog networks Physicist collaboration network (Newman and Girvan, 2004) Copyright 2006, Data Mining Research Laboratory Motivation • Mining interaction networks important – Gain insight into structure, properties and behavior of these networks [Newman, 2001] • Modular nature of interaction networks important – Co-expression networks : dense components - > functional modules – Social networks : clusters -> community structure Copyright 2006, Data Mining Research Laboratory Motivation • A large number of earlier approaches focused on mining static interaction networks • Many important real-world networks are dynamic Ulrik de Lichtenberg, et al. Science 307, 724 (2005) Temporal protein interaction network of the yeast mitotic cell cycle. Copyright 2006, Data Mining Research Laboratory Motivation • Dynamic Interaction Networks – Nodes and interactions change over time – Structure changes in the network • Need for a structured method to characterize and model evolution – Understand nature of change (evolution) in networks – Consider evolution of individuals and communities – Develop models for reasoning and inference of future events Copyright 2006, Data Mining Research Laboratory Workflow Evolving Graph Temporal Snapshots Si Si+1 Clustering Ci Iterate i Ci+1 Event Detection Behavioral Patterns Copyright 2006, Data Mining Research Laboratory Analysis and Inference Temporal Snapshots • Split the graph data into non-overlapping temporal snapshots – Each snapshot corresponds to a graph – Consists of all nodes and interactions active in that time period – Nodes active if they have an interaction in a particular time period T1 A B E F T2 A B F E C D G C D Copyright 2006, Data Mining Research Laboratory G Clustering • Represent the snapshot graphs using clusters – Clusters of a graph can provide structure information – Examine the evolution of clusters over time – Can provide insight on corresponding changes to the graph T1 A B E F T2 A B F E C D G C D G – MCL clustering algorithm employed in this work – Ensemble clustering approaches can be employed to obtain robust clusters (Asur et al, ISMB 2007) Copyright 2006, Data Mining Research Laboratory Community-based Event Detection • • • • • Continue Merge Split Form Dissolve T=2 T=3 T=1 T=6 T=5 T=4 1 1 C1 C2 C22 C 31 1 1 C6 1 C4 C5 3 2 C6 2 C4 3 C4 2 C5 3 C5 Copyright 2006, Data Mining Research Laboratory 4 C5 C6 4 C6 5 C6 Entity-based Event Detection • • • • Appear Disappear Join Leave 1 1 C1 T=4 T=2 T=3 T=1 C2 A C B 2 1 A C22 B 1 C3 1 C4 A A B C 32 B C 24 Copyright 2006, Data Mining Research Laboratory Event Detection • Represent each set of snapshot clusters as a k X N binary cluster-membership matrix • Use bitwise operators to compute the events between each successive pair of matrices (snapshots) • Example: Continue Event Continue (Cj, Ck) = AND (Si(j), Si+1(k)) == OR(Si(j), Si+1(k)) • Event Detection algorithm linear in the number of nodes in the graph O(N) Copyright 2006, Data Mining Research Laboratory Temporal Analysis • Use critical events for analysis • Form and Dissolve events – Used to study group formation and dissipation • Merge and Split events – Evolution of groups • Continue events – Stability of clusters/groups – Evolution of topics in a collaboration network Copyright 2006, Data Mining Research Laboratory Behavioral Analysis • Use entity-based critical events discovered to compose incremental measures for capturing behavioral patterns • Behavioral measures can then be used to analyze evolutionary behavior of nodes and clusters • Four Behavioral measures – – – – Stability Index Sociability Index Popularity Index Influence Index Copyright 2006, Data Mining Research Laboratory Case Study 1 : DBLP Collaboration network • Data from 28 key conferences in databases/data mining/AI over 10 years • Authors (nodes) connected by collaborations (edges) • 23136 nodes and 54989 edges • Collaboration networks display many of the structural features of social networks (Kempe, Kleinberg and Tardos 2003, Newman 2001) Copyright 2006, Data Mining Research Laboratory Case Study 2 : Clinical Trials Network • Clinical Trials – Can provide information on risks, benefits and optimal dosage levels. – Consists of observations of patients under drug use as well as some under placebo – Generally represented as a set of multivariate time series • Evolving clinical trials network – Nodes representing patients – Correlations among patients modeled as edges – Edges change over time as correlations change • Motivation: Use evolution of correlation to identify potential toxic effects of drugs Copyright 2006, Data Mining Research Laboratory Stability Index • Propensity of a node to interact with the same group of people over time • Stability for a node over time incrementally computed based on the stability of the clusters it belongs to Copyright 2006, Data Mining Research Laboratory Stability for Clinical Trials data • Nodes with low Stability Index values represent patients with fluctuating correlation values (outliers) • Null Hypothesis: – If the drug does not result in toxicity, then outliers are likely to be flagged at random from each group (drug and placebo). • Experiment on clinical trials network for diabetes patients – 19 nodes (patients) found having Stability Index below threshold. 18 out of the 19 were on the drug!!! – The drug under study was discontinued due to possible toxic effects. Copyright 2006, Data Mining Research Laboratory Sociability Index • Incremental measure of the different interactions a node participates in • Opposite of the Stability Index Does not represent degree! Copyright 2006, Data Mining Research Laboratory Sociability Index for Community Prediction • Goal : To identify future cluster co-occurrences based on history data for the DBLP dataset • Key Intuition: If two authors have high sociability, and they have not yet collaborated (not been clustered together), there is a high chance they will. • Setup : Use the data for 1997-2001 to predict cluster cooccurrences for 2002-2006 Copyright 2006, Data Mining Research Laboratory Experimental Results • Comparison with other measures (Liben-Nowell and Kleinberg, CIKM 2003) – Common Neighbor – Adamic-Adar – Jacquard Copyright 2006, Data Mining Research Laboratory Popularity Index • Measure of attraction of nodes to a cluster • Influence measure of a cluster • Does not reflect the size of the cluster • DBLP dataset – Can be used to identify hot topics – If a large number of nodes join a cluster and they are all working on a similar topic, it indicates a buzz around that topic for that year Copyright 2006, Data Mining Research Laboratory Application of Popularity Index • Example : XML • Year 1999 : 3 authors (XML and web applications) • Year 2000 : 50 joins – 30 of these authors published papers on XML Copyright 2006, Data Mining Research Laboratory Influence Index • Measure of influence of a node on others • Influence in terms of participation in critical events • Influence of a node initially computed as • Follower nodes need to be pruned! unless Copyright 2006, Data Mining Research Laboratory Top Influential authors – DBLP dataset Copyright 2006, Data Mining Research Laboratory Diffusion Models • Study the spread of information in an evolving interaction network (Kempe et al, 2003, 2005) – – – – Nodes activated with information Newly activated nodes become contagious briefly Information propagates through the network Activation function maps weights of the links of a t1 t2 t3 t4 node to determine if it is activated • SUM Activation: If sum of weights > threshold, activate • MAX Activation: If any single weight > threshold, activate Copyright 2006, Data Mining Research Laboratory Diffusion Models – Influence Maximization • Influence Maximization Problem : Find initial set of nodes that can activate the most number of nodes over a time period – Critical in applications such as viral marketing and for epidemiological research – Complicated in the case of dynamic interaction networks as the network changes over time • Need for dynamic measures that reflect the current status of the network – Sociability Index used to weight links • Highly sociable nodes have high propensity to pass on information – Influence Index to determine initial set of active nodes – Comparison with random choice of nodes and degree-based selection (Wasserman and Faust, 1994) Copyright 2006, Data Mining Research Laboratory Conclusions • Most real-world graphs dynamic in nature – Need for analysis, reasoning and inference – Proposed an event-based framework • Clusters to capture structure at different snapshots • Critical events over clusters to identify dynamic properties of graphs • Behavioral patterns incrementally composed from critical events – Proposed method useful in many application domains • Protein function prediction, drug design, recommender systems, viral marketing, epidemiology Copyright 2006, Data Mining Research Laboratory Temporal Snapshots Clustering Event Detection Behavioral Patterns Analysis and Inference Future Directions • Extensions to large interaction graphs • Use of semantic information for reasoning and inference – Merge and Split Events • If two clusters have high semantic similarity, probability of a Merge is high – Continue events • Track the evolution of topics • Sequences of Form, Continue, Continue … • Multi-scale temporal modeling • Analyze snapshots of different granularity Copyright 2006, Data Mining Research Laboratory Thanks! • Poster # 36, this evening (Mon 13th Aug, 6:15 – 9:15 pm) • This work was supported by the following grants: – DOE Early Career Principal Investigator Award No. DE-FG02- 04ER25611 – NSF CAREER Grant IIS-0347662 • Contacts: – Sitaram Asur : asur@cse.ohio-state.edu – Dr Srinivasan Parthasarathy : srini@cse.ohio-state.edu – Duygu Ucar : ucar@cse.ohio-state.edu • Group Webpage : http://dmrl.cse.ohio-state.edu Copyright 2006, Data Mining Research Laboratory Event Detection Copyright 2006, Data Mining Research Laboratory Event Detection Copyright 2006, Data Mining Research Laboratory