Co-evolution of network structure and content Lada Adamic School of Information & Center for the Study of Complex Systems University of Michigan Outline Co-evolution of network structure and content Can the structure of Twitter and virtual world interactions reveal something about their content? http://arxiv.org/abs/1107.5543 Can the structure of a commodity futures trading network reveal something about information flowing into the market? http://papers.ssrn.com/sol3/papers.cfm?abstract_id=136 1184 What is the relationship between network structure and information diffusion? 3 Is information flowing over the network? Or is information shaping the network? Can the shape of the network reveal properties of information Big news! Giant microbes! Can the shape of the network reveal properties of information Little news. How’s the weather? Related work on time evolving graphs Densification over time (Leskovec et al. 2005) Community structure over time (Leicht et al. 2007, Mucha et al. 2010) Change in structure (ability to “compress” network) signals events (Graphscope by Sun et al. 2007) Disease propagation & timing (Moody 2002, Liljeros 2010) Enron email (B. Aven, 2011) What’s different here We look at network dynamics at relatively short time scales and construct time series A range of network metrics, instead of just community structure Information novelty and diversity as opposed to tracking single events / pieces of information Can the network reveal… If everyone is talking about the same thing, or if there is just background chatter. If what they are talking about is novel? 1st context: virtual worlds Networks: asset transfers (gestures, landmarks) and transactions (e.g. rent, object purchases) Content: assets being transferred 10 Study transfers in the context of 100 groups with highest numbers of transfers 11 Second context: Twitter Network microblogging : < 140 characters / tweet Network links read from tweets Reply or mention: by putting the @ in front of the username Retweet: repeat something someone else wrote on twitter, preceded by the letters RT and @ in front of their username Selecting Twitter communities to track http://wefollow.com/twitter/researcher For each “researcher” gather tweets of accounts they follow Highly dynamic networks repeated of edges 0.10 0.15 0.20 SecondLife Twitter Segmentation: Twitter: every 800 tweets % 0.05 1 2 3 4 5 6 7 Segments # of segment elapsed 8 median segment duration 1.5 days SecondLife: every 50 asset transfers 0.00 percentage 0.25 median segment duration 8.4 days Conductance: capturing potential for information flow A A B low conductance A B B medium conductance high conductance wkl Cij = å Õ deg(k) paths _i _ j edges _ k _ l _ on _ path Temporal conductance (summed over all pairs): High if pairs of nodes share edges, or many short, indirect paths Koren, North, Volinsky, KDD, 2006 Network expectedness Define expectedness: Average conductance of all neighbor pairs at time t, based on conductance of pair at time t-1 1 Xt = Et å edges(i, j ) C i,t-1j expected unexpected 16 network configuratio n at t=0 conductance = 4 possible configuratio ns at t=1 conductance = 4 expectedness = 1.5 edge jaccard = 1 Conductan ce and expectedn ess as a toy network evolves d conductance = 4.5 conductance = 6 expectedness = 1.3333 expectedness = 0.5 edge jaccard = 0.6667 edge jaccard = 0.25 SecondLife: network structure and content standard network metrics are not indicative of information properties overlapoverlapD diversity D diversity t-1,t t,t+1 t-1, t t, (t+1) conductance and expectedness are Conductance & diversity of information High conductance brings higher content diversity Repeat network patterns bring less diversity and less novelty but… similarity and novelty are positively correlated (r = 0.19) Social and transaction network of top sellers in SL Twitter: textual diversity and novelty Semantic metrics Metric Type Computation Methods between connected node pairs in the graph Contemporary Metrics (average cosine similarity of words in Tweets) between indirectly-connected node pairs, i.e., non-neighbors with an undirected path of length > 1 between them between isolated pairs (in different components) Novelty Metric (Language Model distance) between two sets of tweets associated with Twitter networks captured at different times network structure Twitter: network structure and information diversity # nodes(T) -0.584 ! ! ! -0.632 0.305 ! ! ! 0.030 ! ! ! # edges(T) -0.537 ! ! ! -0.601 0.348 ! ! ! 0.058 ! ! ! 0.6 0.4 reciprocity(T) -0.160 ! -0.179 ! 0.176 ! ! 0.128 ! clustering coef.(T) -0.198 ! ! -0.240 0.181 ! ! 0.030 ! ! ! centralization(T) -0.121 ! -0.176 0.158 ! ! 0.062 ! ! edge deg cor.(T) 0.027 -0.155 ! ! 0.113 0.054 ! ! av. degree(T) -0.287 ! ! ! -0.353 0.323 ! ! ! 0.093 ! ! ! sd. degree(T) -0.212 ! ! -0.277 0.251 ! ! ! 0.048 ! ! WCC size(T) 0.317 ! ! ! 0.303 -0.126 ! ! 0.038 ! ! ! conductance(T) -0.444 ! ! ! -0.506 0.369 ! ! ! 0.121 ! ! ! expectedness(T) -0.145 ! ! -0.161 ! 0.234 ! ! ! 0.092 ! ! ! all-pairs unconnected indirectly-connected connected content similarity 0.2 0.0 -0.2 -0.4 -0.6 Inferring Network Semantic Information Question: Does the network structural information help to improve the prediction performance of the characteristics of information exchanged? Semantic variables Topological variables Kernel Regression Prediction Model Semantic variables Example: Inferring the average similarity score between isolated pairs 0.8 0.6 0.4 2 R in predicting the ASS between isolated nodes 1 0.2 Q c :X ={connected} 1 1 c2:X2={indireclty−connected} c :X ={# nodes} 3 3 c :X ={# edges} 4 4 0 . . s s s y ted cted ode dge ocit coef ation cor Deg Deg Size ance nes c v d r g t e ne n r d t C e t a c p g n e s C u cte n n # # eci rin cen e d W o n d pe r te n co −co g s o d c ex y l u c e t l c ec r i ind The input variables of curve ci start from Xi and increase each time by adding the variable labeled on x-axis. Don’t need to use other textual variables (e.g. similarity between indirectly connected pairs) when sufficient topological information available Reason: topological variables account for much of the pattern in the text! Network structure and information novelty Greater novelty in edges # nodes(T-1,T) corresponds to # edges(T-1,T) greater novelty in reciprocity(T-1,T) content shared clustering coef.(T-1,T) centralization(T-1,T)) For nodes that are edge deg cor.(T-1,T) interacting (citing av. degree(T-1,T)) or being cited): sd. degree(T-1,T)) WCC size(T-1,T)) Higher edge jaccard(T-1,T) conductance and conductance(T-1,T) expectedness expectedness(T-1) correlates with less expectedness(T) information novelty 0.3 0.124 ! -0.050 ! ! 0.171 ! -0.117 ! ! ! 0.042 -0.004 0.149 ! -0.197 ! ! ! -0.018 0.038 -0.111 ! ! 0.101 ! 0.066 -0.044 ! 0.083 -0.119 ! ! 0.085 -0.101 ! -0.233 ! ! -0.230 ! ! ! 0.202 ! -0.225 ! ! ! 0.171 ! ! -0.143 ! ! 0.093 ! -0.273 ! ! ! 0.2 0.1 0.0 -0.1 -0.2 -0.3 LMdist_allNodes(T-1,T) LMdist_NodesWithNeighbors(T-1,T) Information in trading networks CFTC = Commodity futures trading commission stated mission: protect market users and the public from fraud, manipulation, and abusive practices futures contracts started out as contracts for agricultural products, but expanded to more exotic contracts, including index futures Collaboration with Celso Brunetti, Jeff Harris, and Andrei Kirilenko http://papers.ssrn.com/sol3/papers.cfm?abstract_id= 25 Data 6.3 million transactions in Aug. 2008 in the Sept. E-mini S&P futures contract price discovery for the index occurs mostly in this contract (Hasbrouck (2003)) data includes: date & time, executing broker, opposite broker, buy or sell, price, quantity sample in broker transaction windows of 240 transactions executing opposite broker quantity: 10 price: $171.25 matching algorithm sell 10 contracts at $171.25 buy 20 30 contracts at $171.50 $171.25 sell 5 contracts at $171.75 buy 30 20 contracts at $171.25 $171.50 sell 20 contracts at $172.00 buy 50 contracts at $171.00 limit order book 27 not social, not intentional, not persistent 28 Financial variables Rate of return: Last price to first price in logs (close-to-open) Volatility: Range – log difference between max and min price Duration: start Total period duration - time in seconds between the and end of each sampling period Proxy for arrival of new information Volume: Trading volume – number of contracts traded What can we learn from network structure? e.g. centralization? low in-centralization high in-centralization low outdegree low indegree high outdegree high indegree 30 overview of network variables # nodes, # edges clustering coefficient, LSCC, reciprocity CEN = giniin-degree – giniout-degree INOUT = r(indegree of node, outdegree of same node) AI (asymmetric information) 31 Correlations between network and financial variables High Centralization: market dominance - a dominant trader buys from many small sellers – low duration, low volume Correlations between network and financial variables Negative assortativity: large sellers sell to small buyers and vice versa – low duration, higher volume Correlations between network and financial variables High av. degree & largest strongly connected component: no news - many buyers and sellers – high duration, high volume Correlations between network and financial variables Rate of return: positive correlation with centralization Volatility & duration: Volume: correlated with standard deviation of degree, average deg. and the total number of edges (E). Correlated with a few network variables, sign varies. Conclusion Network structure alone is revealing of the diversity and novelty information content being transmitted Results depend on the scope and relative position of the activity in the network Future work Sensitivity to inclusion of non-interactive or across-community interactions Applying novelty & conductance metrics to financial time series Continuous formulation of novelty and other network metrics (because segmentation is problematic) Roles of individual nodes Thanks: Edwin Teng Liuling Gong Avishay Livne Information network academic research center Questions?