Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng Department of Computer Science University of Missouri, Columbia ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen March 19, 2016 Data Mining: Concepts and Techniques 1 Network Mining Social Network Introduction Statistics and Probability Theory Models of Social Network Generation Networks in Biological System Mining on Social Network Summary March 19, 2016 Data Mining: Concepts and Techniques 2 Society Nodes: individuals Links: social relationship (family/work/friendship/etc.) S. Milgram (1967) Six Degrees of Separation John Guare Social networks: Many individuals with diverse social interactions between them. March 19, 2016 Data Mining: Concepts and Techniques 3 Communication networks The Earth is developing an electronic nervous system, a network with diverse nodes and links are -computers -phone lines -routers -TV cables -satellites -EM waves Communication networks: Many non-identical components with diverse connections between them. March 19, 2016 Data Mining: Concepts and Techniques 4 Complex systems Made of many non-identical elements connected by diverse interactions. NETWORK March 19, 2016 Data Mining: Concepts and Techniques 5 “Natural” Networks and Universality Consider many kinds of networks: social, technological, business, economic, content,… These networks tend to share certain informal properties: large scale; continual growth distributed, democratic growth: vertices “decide” who to link to mixture of local and long-distance connections abstract notions of distance: geographical, content, social,… Do natural networks share more quantitative universals? What would these “universals” be? How can we make them precise and measure them? How can we explain their universality? This is the domain of social network theory Sometimes also referred to as link analysis March 19, 2016 Data Mining: Concepts and Techniques 6 Some Interesting Quantities Connected components: Network diameter: maximum (worst-case) or average? exclude infinite distances? (disconnected components) the small-world phenomenon Clustering: how many, and how large? to what extent that links tend to cluster “locally”? what is the balance between local and long-distance connections? what roles do the two types of links play? Degree distribution: what is the typical degree in the network? what is the overall distribution? March 19, 2016 Data Mining: Concepts and Techniques 7 A “Canonical” Natural Network has… Few connected components: often only 1 or a small number, indep. of network size Small diameter: often a constant independent of network size (like 6) or perhaps growing only logarithmically with network size or even shrink? typically exclude infinite distances A high degree of clustering: considerably more so than for a random network in tension with small diameter A heavy-tailed degree distribution: a small but reliable number of high-degree vertices often of power law form March 19, 2016 Data Mining: Concepts and Techniques 8 Probabilistic Models of Networks All of the network generation models we will study are probabilistic or statistical in nature They can generate networks of any size They often have various parameters that can be set: size of network generated average degree of a vertex fraction of long-distance connections The models generate a distribution over networks Statements are always statistical in nature: with high probability, diameter is small on average, degree distribution has heavy tail Thus, we’re going to need some basic statistics and probability theory March 19, 2016 Data Mining: Concepts and Techniques 9 Social Network Analysis Social Network Introduction Statistics and Probability Theory Models of Social Network Generation Networks in Biological System Mining on Social Network Summary March 19, 2016 Data Mining: Concepts and Techniques 10 The Normal Distribution The normal or Gaussian density: applies to continuous, real-valued random variables characterized by mean (average) m and standard deviation s density at x is defined as peaks at x = m, then dies off exponentially rapidly the classic “bell-shaped curve” exam scores, human body temperature, remarks: March 19, 2016 can control mean and standard deviation independently can make as “broad” as we like, but always have finite variance Data Mining: Concepts and Techniques 11 The Normal Distribution March 19, 2016 Data Mining: Concepts and Techniques 12 The Binomial Distribution coin with Pr[heads] = p, flip n times probability of getting exactly k heads: choose(n,k) pk(1-p)n-k for large n and p fixed: approximated well by a normal with m = np, s = sqrt(np(1-p)) s/m 0 as n grows leads to strong large deviation bounds March 19, 2016 Data Mining: Concepts and Techniques 13 The Binomial Distribution How many wins in 21 bets? www.professionalgambler.com/ binomial.html March 19, 2016 Data Mining: Concepts and Techniques 14 The Poisson Distribution like binomial, applies to variables taken on integer values > 0 often used to model counts of events number of phone calls placed in a given time period number of times a neuron fires in a given time period single free parameter l probability of exactly x events: mean and variance are both l binomial distribution with n large, p = l/n (l fixed) converges to Poisson with mean l March 19, 2016 Data Mining: Concepts and Techniques 15 The Poisson Distribution single photoelectron distribution March 19, 2016 Data Mining: Concepts and Techniques 16 Heavy-tailed Distributions March 19, 2016 Data Mining: Concepts and Techniques 17 Heavy-Tailed Distributions March 19, 2016 Data Mining: Concepts and Techniques 18 Distributions vs. Data All these distributions are idealized models In practice, we do not see distributions, but data Thus, there will be some largest value we observe Also, can be difficult to “eyeball” data and choose model So how do we distinguish between Poisson, power law, etc? Typical procedure: might restrict our attention to a range of values of interest accumulate counts of observed data into equal-sized bins look at counts on a log-log plot note that power law: Normal: log(Pr[X = x]) = log(a exp(-x2/b)) = log(a) – x2/b non-linear, concave near mean Poisson: March 19, 2016 log(Pr[X = x]) = log(1/xa) = -a log(x) linear, slope –a log(Pr[X = x]) = log(exp(-l) lx/x!) also non-linear Data Mining: Concepts and Techniques 19 Zipf’s Law Look at the frequency of English words: “the” is the most common, followed by “of”, “to”, etc. claim: frequency of the n-th most common ~ 1/n (power law, α = 1) General theme: rank events by their frequency of occurrence resulting distribution often is a power law! Other examples: North America city sizes personal income file sizes March 19, 2016 Data Mining: Concepts and Techniques 20 Zipf’s Law The same data plotted on linear and logarithmic scales. Both plots show a Zipf distribution with 300 datapoints Linear scales on both axes March 19, 2016 Logarithmic scales on both axes Data Mining: Concepts and Techniques 21 Social Network Analysis Social Network Introduction Statistics and Probability Theory Models of Social Network Generation Networks in Biological System Mining on Social Network Summary March 19, 2016 Data Mining: Concepts and Techniques 22 Some Models of Network Generation Random graphs (Erdös-Rényi models): Watts-Strogatz models: gives few components, small diameter and heavy-tailed distribution does not give high clustering Hierarchical networks: give few components, small diameter and high clustering does not give heavy-tailed degree distributions Scale-free Networks: gives few components and small diameter does not give high clustering and heavy-tailed degree distributions is the mathematically most well-studied and understood model few components, small diameter, high clustering, heavy-tailed Affiliation networks: models group-actor formation March 19, 2016 Data Mining: Concepts and Techniques 23 Models of Social Network Generation Random Graphs (Erdös-Rényi models) Watts-Strogatz models Scale-free Networks March 19, 2016 Data Mining: Concepts and Techniques 24 The Erdös-Rényi (ER) Model (Random Graphs) All edges are equally probable and appear independently NW size N > 1 and probability p: distribution G(N,p) each edge (u,v) chosen to appear with probability p N(N-1)/2 trials of a biased coin flip The usual regime of interest is when p ~ 1/N, N is large e.g. p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc. in expectation, each vertex will have a “small” number of neighbors will then examine what happens when N infinity can thus study properties of large networks with bounded degree Degree distribution of a typical G drawn from G(N,p): draw G according to G(N,p); look at a random vertex u in G what is Pr[deg(u) = k] for any fixed k? Poisson distribution with mean degree: λ = p(N-1) ~ pN Sharply concentrated; not heavy-tailed Especially easy to generate NWs from G(N,p) March 19, 2016 Data Mining: Concepts and Techniques 25 Erdös-Rényi Model (1960) Connect with probability p Pál Erdös p=1/6 N=10 k~1.5 Poisson distribution (1913-1996) - Democratic - Random March 19, 2016 Data Mining: Concepts and Techniques 26 A Closely Related Model For any fixed m <= N(N-1)/2, define distribution G(N,m): March 19, 2016 choose uniformly at random from all graphs with exactly m edges G(N,m) is “like” G(N,p) with p = m/(N(N-1)/2) ~ 2m/N2 this intuition can be made precise, and is correct if m = cN then p = 2c/(N-1) ~ 2c/N mathematically trickier than G(N,p) Data Mining: Concepts and Techniques 27 Another Closely Related Model Graph process model: start with N vertices and no edges at each time step, add a new edge choose new edge randomly from among all missing edges Allows study of the evolution or emergence of properties: as the number of edges m grows in relation to N equivalently, as p is increased March 19, 2016 Data Mining: Concepts and Techniques 28 Evolution of a Random Network We have a large number n of vertices We start randomly adding edges one at a time At what time t will the network: have at least one “large” connected component? have a single connected component? have “small” diameter? have a “large” clique? How gradually or suddenly do these properties appear? March 19, 2016 Data Mining: Concepts and Techniques 29 Combining and Formalizing Familiar Ideas crime rate Explaining universal behavior through statistical models our models will always generate many networks almost all of them will share certain properties (universals) Explaining tipping through incremental growth we gradually add edges, or gradually increase edge probability p many properties will emerge very suddenly during this process prob. NW connected size of police force March 19, 2016 number of edges Data Mining: Concepts and Techniques 30 So Which Properties Tip? Just about all of them! The following properties all have threshold functions: having a “giant component” being connected having a perfect matching (N even) having “small” diameter With remarkable consistency (N = 50): March 19, 2016 giant component (~ 40 edges), connected (~ 100 edges), small diameter (~ 180 edges) Data Mining: Concepts and Techniques 31 Ever More Precise… March 19, 2016 Data Mining: Concepts and Techniques 32 Erdos-Renyi Summary A model in which all connections are equally likely each of the N(N-1)/2 edges chosen randomly & independently As we add edges, a precise sequence of events unfolds: graph acquires a giant component graph becomes connected graph acquires small diameter Many properties appear very suddenly (tipping, thresholds) All statements are mathematically precise But is this how natural networks form? If not, which aspects are unrealistic? may all edges are not equally likely! March 19, 2016 Data Mining: Concepts and Techniques 33 The Clustering Coefficient of a Network Let nbr(u) denote the set of neighbors of u in a graph all vertices v such that the edge (u,v) is in the graph The clustering coefficient of u: let k = |nbr(u)| (i.e., number of neighbors of u) choose(k,2) = k*(k-1) / 2: max possible # of edges between vertices in nbr(u) c(u) = (actual # of edges between vertices in nbr(u))/choose(k,2) 0 <= c(u) <= 1; measure of cliquishness of u’s neighborhood Clustering coefficient of a graph: average of c(u) over all vertices u k=4 choose(k,2) = 6 c(u) = 4/6 = 0.666… March 19, 2016 Data Mining: Concepts and Techniques u 34 The Clustering Coefficient of a Network Clustering: My friends will likely know each other! Probability to be connected C »p # of links between 1,2,…n neighbors C= n(n-1)/2 L: diameter? Networks are clustered [large C(p)] but have a small characteristic path length [small L(p)]. March 19, 2016 Network C Crand L N WWW 0.1078 0.00023 3.1 153127 Internet 0.18-0.3 0.001 3.7-3.76 30156209 Actor 0.79 0.00027 3.65 225226 Coauthorship 0.43 0.00018 5.9 52909 Metabolic 0.32 0.026 2.9 282 Foodweb 0.22 0.06 2.43 134 C. elegance 0.28 0.05 2.65 282 Data Mining: Concepts and Techniques 35 Erdos-Renyi: Clustering Coefficient Generate a network G according to G(N,p) Examine a “typical” vertex u in G choose u at random among all vertices in G what do we expect c(u) to be? Answer: exactly p. In G(N,m), expect c(u) to be 2m/N(N-1) Both cases: c(u) entirely determined by overall density Baseline for comparison with “more clustered” models Erdos-Renyi has no bias towards clustered or local edges March 19, 2016 Data Mining: Concepts and Techniques 36 Models of Social Network Generation Random Graphs (Erdös-Rényi models) Watts-Strogatz models Scale-free Networks March 19, 2016 Data Mining: Concepts and Techniques 37 Caveman and Solaria Erdos-Renyi: sharing a common neighbor makes two vertices no more likely to be directly connected than two very “distant” vertices every edge appears entirely independently of existing structure But in many settings, the opposite is true: you tend to meet new friends through your old friends two web pages pointing to a third might share a topic two companies selling goods to a third are in related industries Watts’ Caveman world: overall density of edges is low but two vertices with a common neighbor are likely connected March 19, 2016 Data Mining: Concepts and Techniques 38 Making it (Somewhat) Precise: the a-model The a-model has the following parameters or “knobs”: N: size of the network to be generated k: the average degree of a vertex in the network to be generated p: the default probability two vertices are connected a: adjustable parameter dictating bias towards local connections For any vertices u and v: define m(u,v) to be the number of common neighbors (so far) Key quantity: the propensity R(u,v) of u to connect to v if m(u,v) >= k, R(u,v) = 0 (share too many friends not to connect) if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect) else, R(u,v) = p + (m(u,v)/k)a (1-p) Generate NW incrementally using R(u,v) as the edge probability; details omitted Note: a = infinity is “like” Erdos-Renyi (but not exactly) March 19, 2016 Data Mining: Concepts and Techniques 39 Small Worlds and Occam’s Razor For small a, should generate large clustering coefficients we “programmed” the model to do so Watts claims that proving precise statements is hard… But we do not want a new model for every little property Erdos-Renyi small diameter a-model high clustering coefficient In the interests of Occam’s Razor, we would like to find a single, simple model of network generation… … that simultaneously captures many properties Watt’s small world: small diameter and high clustering March 19, 2016 Data Mining: Concepts and Techniques 40 Meanwhile, Back in the Real World… Watts examines three real networks as case studies: the Kevin Bacon graph the Western states power grid the C. elegans nervous system For each of these networks, he: computes its size, diameter, and clustering coefficient compares diameter and clustering to best Erdos-Renyi approx. shows that the best a-model approximation is better important to be “fair” to each model by finding best fit Overall moral: if we care only about diameter and clustering, a is better than p March 19, 2016 Data Mining: Concepts and Techniques 41 Case 1: Kevin Bacon Graph Vertices: actors and actresses Edge between u and v if they appeared in a film together Kevin Bacon No. of movies : 46 No. of actors : 1811 Average separation: 2.79 Is Kevin Bacon the most connected actor? NO! March 19, 2016 Rod Steiger Donald Pleasence Martin Sheen Christopher Lee Robert Mitchum Charlton Heston Eddie Albert Robert Vaughn Donald Sutherland John Gielgud Anthony Quinn James Earl Jones Average distance 2.537527 2.542376 2.551210 2.552497 2.557181 2.566284 2.567036 2.570193 2.577880 2.578980 2.579750 2.584440 # of movies 112 180 136 201 136 104 112 126 107 122 146 112 # of links 2562 2874 3501 2993 2905 2552 3333 2761 2865 2942 2978 3787 KevinBacon Bacon Kevin 2.786981 2.786981 46 46 1811 1811 Rank Name 1 2 3 4 5 6 7 8 9 10 11 12 … 876 876 … Data Mining: Concepts and Techniques 42 #1 Rod Steiger #876 Kevin Bacon Donald #2 Pleasence #3 Martin Sheen March 19, 2016 Data Mining: Concepts and Techniques 43 Case 2: New York State Power Grid Vertices: generators and substations Edges: high-voltage power transmission lines and transformers Line thickness and color indicate the voltage level Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV March 19, 2016 Data Mining: Concepts and Techniques 44 Case 3: C. Elegans Nervous System Vertices: neurons in the C. elegans worm Edges: axons/synapses between neurons March 19, 2016 Data Mining: Concepts and Techniques 45 Two More Examples M. Newman on scientific collaboration networks coauthorship networks in several distinct communities differences in degrees (papers per author) empirical verification of giant components small diameter (mean distance) high clustering coefficient Alberich et al. on the Marvel Universe purely fictional social network two characters linked if they appeared together in an issue “empirical” verification of March 19, 2016 heavy-tailed distribution of degrees (issues and characters) giant component rather small clustering coefficient Data Mining: Concepts and Techniques 46 One More (Structural) Property… A properly tuned a-model can simultaneously explain small diameter high clustering coefficient But what about heavy-tailed degree distributions? a-model and simple variants will not explain this intuitively, no “bias” towards large degree all vertices are created equal As always, we want a “natural” model March 19, 2016 Data Mining: Concepts and Techniques 47 Models of Social Network Generation Random Graphs (Erdös-Rényi models) Watts-Strogatz models Scale-free Networks March 19, 2016 Data Mining: Concepts and Techniques 48 World Wide Web Nodes: WWW documents Links: URL links 800 million documents (S. Lawrence, 1999) ROBOT: collects all URL’s found in a document and follows them recursively R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999) March 19, 2016 Data Mining: Concepts and Techniques 49 World Wide Web Expected Result Real Result out= 2.45 in = 2.1 k ~ 6 P(k=500) ~ 10-99 NWWW ~ 109 N(k=500)~10-90 March 19, 2016 Pout(k) ~ k-out P(k=500) ~ 10-6 Pin(k) ~ k- in NWWW ~ 109 N(k=500) ~ 103 J. Kleinberg, et. al, Proceedings of the ICCC (1999) Data Mining: Concepts and Techniques 50 World Wide Web 3 l15=2 [125] 6 1 l17=4 [1346 7] 4 5 2 7 … < l > = ?? Finite size scaling: create a network with N nodes with Pin(k) and Pout(k) < l > = 0.35 + 2.06 log(N) L is the length of shortest paths 19 degrees of separation R. Albert et al Nature (99) nd.edu <l> based on 800 million webpages [S. Lawrence et al Nature (99)] IBM A. Broder et al WWW9 (00) March 19, 2016 Data Mining: Concepts and Techniques 51 What does that mean? Poisson distribution Exponential Network March 19, 2016 Power-law distribution Scale-free Network Data Mining: Concepts and Techniques 52 Scale-free Networks The number of nodes (N) is not fixed Networks continuously expand by additional new nodes WWW: addition of new nodes Citation: publication of new papers The attachment is not uniform A node is linked with higher probability to a node that already has a large number of links March 19, 2016 WWW: new documents link to well known sites (CNN, Yahoo, Google) Citation: Well cited papers are more likely to be cited again Data Mining: Concepts and Techniques 53 Scale-Free Networks Start with (say) two vertices connected by an edge For i = 3 to N: for each 1 <= j < i, d(j) = degree of vertex j so far let Z = S d(j) (sum of all degrees so far) add new vertex i with k edges back to {1, …, i-1}: i is connected back to j with probability d(j)/Z Vertices j with high degree are likely to get more links! “Rich get richer” Natural model for many processes: hyperlinks on the web new business and social contacts transportation networks Generates a power law distribution of degrees exponent depends on value of k March 19, 2016 Data Mining: Concepts and Techniques 54 Scale-Free Networks Preferential attachment explains heavy-tailed degree distributions small diameter (~log(N), via “hubs”) Will not generate high clustering coefficient March 19, 2016 no bias towards local connectivity, but towards hubs Data Mining: Concepts and Techniques 55 Case1: Internet Backbone Nodes: computers, routers Links: physical lines (Faloutsos, Faloutsos and Faloutsos, 1999) March 19, 2016 Data Mining: Concepts and Techniques 56 March 19, 2016 Data Mining: Concepts and Techniques 57 Case2: Actor Connectivity Days of Thunder (1990) Far and Away (1992) Eyes Wide Shut (1999) Nodes: actors Links: cast jointly N = 212,250 actors k = 28.78 P(k) ~k- =2.3 March 19, 2016 Data Mining: Concepts and Techniques 58 Case 3: Science Citation Index 25 Nodes: papers Links: citations Witten-Sander PRL 1981 1736 PRL papers (1988) 2212 P(k) ~k- ( = 3) (S. Redner, 1998) March 19, 2016 Data Mining: Concepts and Techniques 59 Case 4: Science Coauthorship Nodes: scientist (authors) Links: write paper together (Newman, 2000, H. Jeong et al 2001) March 19, 2016 Data Mining: Concepts and Techniques 60 Case 5: Food Web Nodes: trophic species Links: trophic interactions R. Sole (cond-mat/0011195) March 19, 2016 R.J. Williams, N.D. Martinez Nature (2000) Data Mining: Concepts and Techniques 61 Case 6: Sex-Web Nodes: people (Females; Males) Links: sexual relationships 4781 Swedes; 18-74; 59% response rate. Liljeros et al. Nature 2001 March 19, 2016 Data Mining: Concepts and Techniques 62 Robustness of Random vs. Scale-Free Networks March 19, 2016 Data Mining: Concepts and Techniques The accidental failure of a number of nodes in a random network can fracture the system into noncommunicating islands. Scale-free networks are more robust in the face of such failures. Scale-free networks are highly vulnerable to a coordinated attack against their hubs. 63 Social Network Analysis Social Network Introduction Statistics and Probability Theory Models of Social Network Generation Networks in Biological System Mining on Social Network Summary March 19, 2016 Data Mining: Concepts and Techniques 64 Bio-Map GENOME protein-gene interactions PROTEOME protein-protein interactions METABOLISM Bio-chemical reactions Citrate Cycle March 19, 2016 Data Mining: Concepts and Techniques 65 Metabolic Network Citrate Cycle March 19, 2016 METABOLISM Bio-chemical reactions Data Mining: Concepts and Techniques 66 March 19, 2016 Data Mining: Concepts and Techniques 67 Metabolic Network Nodes: chemicals (substrates) Links: bio-chemical reactions March 19, 2016 Data Mining: Concepts and Techniques 68 Metabolic Network Archaea Bacteria Eukaryotes Organisms from all three domains of life are scale-free networks! H. Jeong, B. Tombor, R. Albert, Z.N. Oltvai, and A.L. Barabasi, Nature, 407 651 (2000) March 19, 2016 Data Mining: Concepts and Techniques 69 Bio-Map GENOME protein-gene interactions PROTEOME protein-protein interactions METABOLISM Bio-chemical reactions Citrate Cycle March 19, 2016 Data Mining: Concepts and Techniques 70 Protein Network PROTEOME protein-protein interactions March 19, 2016 Data Mining: Concepts and Techniques 71 Yeast Protein Network Nodes: proteins Links: physical interactions (binding) P. Uetz, et al. Nature 403, 623-7 (2000). March 19, 2016 Data Mining: Concepts and Techniques 72 Topology of the Protein Network P (k ) ~ (k k0 ) exp( k k0 ) k H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001) March 19, 2016 Data Mining: Concepts and Techniques 73 p53 Network Nature 408 307 (2000) … “One way to understand the p53 network is to compare it to the Internet. The cell, like the Internet, appears to be a ‘scale-free network’.” March 19, 2016 Data Mining: Concepts and Techniques 74 p53 Network (mammals) March 19, 2016 Data Mining: Concepts and Techniques 75 Social Network Analysis Social Network Introduction Statistics and Probability Theory Models of Social Network Generation Networks in Biological System Mining on Social Network Summary March 19, 2016 Data Mining: Concepts and Techniques 76 Information on the Social Network Heterogeneous, multi-relational data represented as a graph or network Nodes are objects May have different kinds of objects Objects have attributes Objects may have labels or classes Edges are links May have different kinds of links Links may have attributes Links may be directed, are not required to be binary Links represent relationships and interactions between objects - rich content for mining March 19, 2016 Data Mining: Concepts and Techniques 77 What is New for Link Mining Here Traditional machine learning and data mining approaches assume: Real world data sets: A random sample of homogeneous objects from single relation Multi-relational, heterogeneous and semi-structured Link Mining March 19, 2016 Newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming Data Mining: Concepts and Techniques 78 A Taxonomy of Common Link Mining Tasks Object-Related Tasks Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution) Link-Related Tasks Link prediction March 19, 2016 Data Mining: Concepts and Techniques 79 What Is a Link in Link Mining? Link: relationship among data Two kinds of linked networks homogeneous vs. heterogeneous Homogeneous networks Single object type and single link type Single model social networks (e.g., friends) WWW: a collection of linked Web pages Heterogeneous networks Multiple object and link types Medical network: patients, doctors, disease, contacts, treatments Bibliographic network: publications, authors, venues March 19, 2016 Data Mining: Concepts and Techniques 80 Link-Based Object Ranking (LBR) LBR: Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single link type This is a primary focus of link analysis community Web information analysis PageRank and Hits are typical LBR approaches In social network analysis (SNA), LBR is a core analysis task Objective: rank individuals in terms of “centrality” Degree centrality vs. eigen vector/power centrality Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs March 19, 2016 Data Mining: Concepts and Techniques 81 PageRank: Capturing Page Popularity (Brin & Page’98) Intuitions Links are like citations in literature A page that is cited often can be expected to be more useful in general PageRank is essentially “citation counting”, but improves over simple counting Consider “indirect citations” (being cited by a highly cited paper counts a lot…) Smoothing of citations (every page is assumed to have a non-zero citation count) PageRank can also be interpreted as random surfing (thus capturing popularity) March 19, 2016 Data Mining: Concepts and Techniques 82 The PageRank Algorithm (Brin & Page’98) Random surfing model: At any page, With prob. a, randomly jumping to a page With prob. (1 – a), randomly picking a link to follow d1 d3 d2 0 1 M 0 1/ 2 1/ 2 1/ 2 0 0 0 1 0 0 1/ 2 0 0 0 pt 1 (di ) (1 a ) d4 d j IN ( di ) p(di ) [ k March 19, 2016 m ji pt (d j ) a k Same as a/N (why?) 1 pt (d k ) N 1 a (1 a )mki ] p(d k ) N p (a I (1 a ) M )T p Initial value p(d)=1/N “Transition matrix” Iij = 1/N Stationary (“stable”) distribution, so we ignore time Iterate until converge Data Mining: Concepts and Techniques 83 Link-Based Object Classification (LBC) Predicting the category of an object based on its attributes, its links and the attributes of linked objects Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations Epidemics: Predict disease type based on characteristics of the patients infected by the disease Communication: Predict whether a communication contact is by email, phone call or mail March 19, 2016 Data Mining: Concepts and Techniques 84 Group Detection Cluster the nodes in the graph into groups that share common characteristics Web: identifying communities Citation: identifying research communities March 19, 2016 Data Mining: Concepts and Techniques 85 Entity Resolution Predicting when two objects are the same, based on their attributes and their links Also known as: deduplication, reference reconciliation, coreference resolution, object consolidation Applications Web: predict when two sites are mirrors of each other Citation: predicting when two citations are referring to the same paper Epidemics: predicting when two disease strains are the same Biology: learning when two names refer to the same protein March 19, 2016 Data Mining: Concepts and Techniques 86 Entity Resolution Methods Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes Importance at considering links Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents Use of links in resolution Collective entity resolution: one resolution decision affects another if they are linked Propagating evidence over links in a depen. graph Probabilistic models interact with different entity recognition decisions March 19, 2016 Data Mining: Concepts and Techniques 87 Link Prediction Predict whether a link exists between two entities, based on attributes and other observed links Applications Web: predict if there will be a link between two pages Citation: predicting if a paper will cite another paper Epidemics: predicting who a patient’s contacts are Methods Often viewed as a binary classification problem Local conditional probability model, based on structural and attribute features Difficulty: sparseness of existing links Collective prediction, e.g., Markov random field model March 19, 2016 Data Mining: Concepts and Techniques 88 Link Cardinality Estimation Predicting the number of links to an object Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out-links Citation: predicting the impact of a paper based on the number of citations Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease Predicting the number of objects reached along a path from an object Web: predicting number of pages retrieved by crawling a site Citation: predicting the number of citations of a particular author in a specific journal March 19, 2016 Data Mining: Concepts and Techniques 89 Social Network Analysis Social Network Introduction Statistics and Probability Theory Models of Social Network Generation Networks in Biological System Mining on Social Network Summary March 19, 2016 Data Mining: Concepts and Techniques 90 Ref: Mining on Social Networks D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social Networks. CIKM’03 P. Domingos and M. Richardson, Mining the Network Value of Customers. KDD’01 M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for Viral Marketing. KDD’02 D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of Influence through a Social Network. KDD’03. P. Domingos, Mining Social Networks for Viral Marketing. IEEE Intelligent Systems, 20(1), 80-82, 2005. S. Brin and L. Page, The anatomy of a large scale hypertextual Web search engine. WWW7. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99 D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004. March 19, 2016 Data Mining: Concepts and Techniques 91