Modelling Communication Dynamics in Social Networks Amber Tomas September 28, 2011 Key Points I Aim is to understand the factors influencing communication patterns I Model communications as events on edges of a network I Allow for factors influencing communication to change over time I Jointly model events on edges and nodes Questions: I What are the patterns of communication? I How can we explain them? Questions: I What are the patterns of communication? I How can we explain them? Statistical modelling allows for: I Analysis of behaviour (testing hypotheses) I Improved understanding of communication systems I Monitoring of communication patterns for changes Network Event Data At any time t I There exists a set of actors who may communicate I There exists a set of possible communications between pairs of actors An “event” is an instantaneous communication between two actors. Network Event Data Network Event Data Network Event Data Network Event Data Network Event Data Network Event Data Network Event Data Every event is recorded as a tuple (a, b, k, t): a − sender b − receiver k − type of event t − time of event Every event is recorded as a tuple (a, b, k, t): a − sender b − receiver k − type of event t − time of event A dataset consists of 1. A sequence of recorded events 2. Information about each of the actors Example - Enron Email Corpus I I 1.8 million emails sent from employees of Enron between November 1998 and September 2002 Data includes I I I I I I Time stamp sender recipients type of recipient (to, cc, bcc) subject message content Models for Event Data Butts (2008) proposed a “relational event framework” for modelling network event data. This models the rate of events between actors conditional on event history and actor characteristics. Events are assumed to be independent conditional on the past event history. Models for Event Data The likelihood of the event history up to the most recent event is given by ## " t " Y Y S(ti − ti−1 |i , e1:t−1 ) p(e1:t ) = h(ti |ei , e1:i ) × i=1 i Models for Event Data The likelihood of the event history up to the most recent event is given by ## " t " Y Y S(ti − ti−1 |i , e1:t−1 ) p(e1:t ) = h(ti |ei , e1:i ) × i=1 i It is assumed the hazard is constant given past history, so h(t) = λ S(∆t) = e −λ∆t Models for Event Data The rate function is modelled as n o λ(et , e1:t−1 , X , θ) = exp λ0 + θT u(et , e1:t−1 , X ) Models for Event Data The rate function is modelled as n o λ(et , e1:t−1 , X , θ) = exp λ0 + θT u(et , e1:t−1 , X ) Considering only the ordering of data leads to the likelihood p(e1:t |θ, X ) = t Y i=1 exp θT u(et , e1:t−1 , X ) P T i exp {θ u(i , e1:t−1 , X )} Models for Event Data The rate function is modelled as n o λ(et , e1:t−1 , X , θ) = exp λ0 + θT u(et , e1:t−1 , X ) Considering only the ordering of data leads to the likelihood p(e1:t |θ, X ) = t Y i=1 exp θT u(et , e1:t−1 , X ) P T i exp {θ u(i , e1:t−1 , X )} An important aspect of fitting this model is choosing the statistics u(et , e1:t−1 , X ). Choice of Statistics 1. Characteristics of sender a and receiver b 2. Tendency for a to send messages 3. Tendency for b to receive messages 4. Tendency for communications between a and b 5. Communications involving joint contacts of a and b Choice of Statistics 1. Characteristics of sender a and receiver b 2. Tendency for a to send messages (node) 3. Tendency for b to receive messages (node) 4. Tendency for communications between a and b (dyad) 5. Communications involving joint contacts of a and b (triad etc) Including network statistics allows for incorporation of complex dependencies. Some Possible Statistics To simplify computation, an event network is used to compute statistics. The weight of an edge is an exponentially weighted aggregation of past events (Brandes et al). Some Possible Statistics Model Fitting Model fitting can proceed as follows: 1. For every event 1.1 calculate relevant statistics 1.2 calculate contribution to likelihood 1.3 Update weighted event network 2. Estimate parameters using maximum likelihood Model Fitting Model fitting can proceed as follows: 1. For every event 1.1 calculate relevant statistics 1.2 calculate contribution to likelihood 1.3 Update weighted event network 2. Estimate parameters using maximum likelihood Alternatively, it is natural to update the parameter estimates sequentially. Model Fitting The main difficulty is in specifying appropriate statistics, and correctly interpreting the fitted model. Example - Enron Data ENRON Example - Enron Data ENRON SUCCESSFUL COMPANY Example - Enron Data ENRON SUCCESSFUL COMPANY REALITY Example - Enron Data ENRON SUCCESSFUL COMPANY REALITY Accounting Shenanigans Example - Enron Data Considered a subset of email messages I Only consider addresses ending in @enron.com I Only consider messages sent to <= 5 receivers I Only consider actors who send and receive at least 2 messages I Leaves 134 actors Aggregated Data Aggregated Data Aggregated Data Aggregated Data 800 400 600 count 300 200 0 200 100 0 count 400 1000 Rate of Sending 1 4 7 10 14 time 18 22 M T W T day F S S 60 40 20 0 receiver 80 100 120 Candidate Events 0 200 400 600 800 time (days) 1000 1200 Choice of Statistics ●● 300 ● Mark Taylor 200 ● 50 100 ● ● ●● ● ●●●● ● ●● ●● ●● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0 ● John Lavorato ● ● 0 number sent 400 Tana Jones Jeff Dasovich Richard Shapiro ● ● ● ● ● ● 100 ● Louise Kitchen ● 150 200 250 number received 300 350 Choice of Statistics ●● 300 ● Mark Taylor 200 ● 50 0 ● ● ●● ● ●●●● ● ●● ●● ●● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0 ● John Lavorato ● ● 100 number sent 400 Tana Jones Jeff Dasovich Richard Shapiro ● ● ● ● ● ● 100 ● Louise Kitchen ● 150 200 250 300 350 number received Need to control for actor tendency to send/receive messages Choice of Statistics I “Inertia” wab I “Reciprocity” wba I Sender “out-degree” I Triad effects: P h a wah h b Transitivity h a h b Shared Triad a b Choice of Statistics I role (manager, lawyer, executive, other) I role homophily I new member 50 Results 0 10 20 E[βi] 30 40 inertia reciprocity sender.out.degree receiver.out.degree sender.in.degree receiver.in.degree 0 1000 2000 3000 time 4000 5000 6000 Results 0 10 20 E[βi] 30 40 50 inertia reciprocity sender.out.degree receiver.out.degree sender.in.degree receiver.in.degree 0 1000 2000 3000 time 4000 5000 6000 Results −10 0 10 E[βi] 20 30 inertia reciprocity sender.out.degree receiver.out.degree sender.in.degree receiver.in.degree transitivity shared.triad joint.triad 0 1000 2000 3000 time 4000 5000 6000 Example - Hospital Acquired Infections Data is available for a large hospital network, including I Patient ward transfers (time, from ward, to ward, patient condition) I Tests for hospital acquired infections (bug, time of test, time of result, result, patient info) Example - Hospital Acquired Infections Data is available for a large hospital network, including I Patient ward transfers (time, from ward, to ward, patient condition) I Tests for hospital acquired infections (bug, time of test, time of result, result, patient info) I 14 000 transfers (11 000 discharged or died) I 22 000 bug tests I 2 700 bug positives Example - Hospital Acquired Infections Data is available for a large hospital network, including I Patient ward transfers (time, from ward, to ward, patient condition) I Tests for hospital acquired infections (bug, time of test, time of result, result, patient info) I 14 000 transfers (11 000 discharged or died) I 22 000 bug tests I 2 700 bug positives We might be interested in the joint dynamics of infections and ward transfers. Hospital Data M K D F A J H O N C B E G P I L Hospital Data Can model the joint dynamics as p(e1:E , v1:V |θ, γ) = p(e1:E |θ)p(v1:V |γ), where p(v1:V |γ) = Y i exp γ T u 0 (et<i , vt<i ) P T 0 ν exp {γ u (et<i , νt<i )} Hospital Data Statistics might include I number recently admitted to ward I number on ward I infection history of ward I infection history of source wards I transfer patterns and rates between wards Conclusions Network event data is becoming more common. Analysis can help improve understanding of the factors driving the communication dynamic. Conclusions Network event data is becoming more common. Analysis can help improve understanding of the factors driving the communication dynamic. Possible applications include I Business and emergency communications I political dynamics I personal communications New studies should be designed with the possibilities for network analysis in mind. Network Models I The appropriate model for any data set will depend on what we are interested in. I Suppose interested in the probability of an edge between two actors, given personal characteristics, network data. The probability of a tie between actors A and B might depend on I I I which other ties are present (degree, reciprocity, transitivity, etc) actor characteristics (age, sex, role, etc) Network Models - ERGM P(X = x) ∝ exp{ X βk sk (x)}, k where β is a vector of parameters and s(x) is a vector of statistics. I implies edge probabilities I allows testing of hypotheses regarding significance of actor or network statistics I normalizing constant causes computational difficulties I requires a single observation of the network ERGMs and Event Data We could aggregate all events to produce a “communication network”, i.e. who had communications with whom at any point in time. This allows investigation of questions such as I Is transitivity a significant influence on the structure of the network? I How many contacts do people have? I Are communications reciprocated? I Are communications more likely among actors of similar gender/age.ethnicity etc? But this tells us nothing about the temporal aspects. Network Models - SAOMs Stochastic actor-oriented network models (SAOMs) have been developed to model longitudinal network data. We could aggregate event data in sequential time intervals to produce such data. I Allows testing of temporal hypotheses, eg did A become a manager because he was communicating with manager B, or did A only begin to communicate with B after obtaining a promotion independently? I Results may be sensitive to the periods of aggregation I Tells us little about commnuication rates or patterns over short time intervals.