E N M W T I R O K G N I Jerry Scripps N Overview What is network mining? Motivation Preliminaries definitions metrics network types Network mining techniques What is Network Mining? Statistics Mathematics Computer Science Data Mining Pattern Recognition Machine Learning Social Network Analysis Graph Theory Network Mining What is Network Mining? Border Disciplines Statistics Computer Science Physics Math Psychology Law Enforcement Sociology Military Biology Medicine Chemistry Business What is Network Mining? Examples: Discovering communities within collaboration networks Finding authoritative web pages on a given topic Selecting the most influential people in a social network Network Mining – Motivation Emerging Data Sets World wide web Social networking Collaboration databases Customer or Employee sets Genomic data Terrorist sets Supply Chains Many more… Network Mining – Motivation Direct Applications What is the community around msu.edu? What are the authoritative pages? Who has the most influence? Who is the likely member of terrorist cell? Is this a news story about crime, politics or business? Network Mining – Motivation Indirect Applications Convert ordinary data sets into networks Integrate network mining techniques into other techniques Preliminaries Definitions Metrics Network Types Definitions Community Node (vertex, point, object) Link (edge, arc) Metrics Node Degree Closeness Betweenness Clustering coefficient Node Pair Graph distance Min-cut Common neighbors Jaccard’s coef Adamic/adar Pref. attachment Katz Hitting time Rooted pageRank simRank Bibliographic metrics Network Characteristic path length Clustering coefficient Min-cut diameter Network Types – Random Network Types – Small World Watts & Strogatz Regular Small World Random Networks – Scale-free GVSU FaceBook GVSU FaceBook (log scale) 1000 1000 800 600 400 100 10 200 0 600 500 400 300 Degree 200 100 0 1 1000 100 10 Degree 1 Counts Barabasi & Bonabeau Degree follows a power law ~ 1/kn Can be found in a wide variety of real-world networks Counts Network recap Network Type Clustering coefficient Random Low Characteristic Power Law path length Low No Regular High High No Small world High Low ? Scale-free ? ? Yes Techniques Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link-Based Classification ? Include features from linked objects: building a single model on all features Fusion of link and attribute models Link-Based Classification Chakrabarti, et al. Copying data from neighboring web pages actually reduced accuracy Using the label from neighboring page improved accuracy 111011 ? 101011 B 010010 A 011110 A Link-Based Classification Lu & Getoor Define vectors for attributes and links Attribute data OA(X) Link data LD(X) constructed using mode (single feature – class of plurality) count (feature for each class – count for neighbors) binary (feature for each class – 0/1 if exists) 111011 ? 101011 B 010010 A 011110 A OA (attr) LD (link) 111011 210 1 A … … Model 1Model Model 2 Link-Based Classification Lu & Getoor Define probabilities for both 1 P(c | w , OA( X )) Attribute exp( w OA( X )c) 1 o Link P(c | wl , LD( X )) T o 1 exp( wlT LD( X )c) 1 Class estimation: Cˆ ( X ) P(c | w0 , OA( X )) P(c | w1 , LD( X )) Collective Classification Uses both attributes and links Iteratively update the unlabeled instances message passing, loopy belief nets, etc. Link-Based Classification Summary Using class of neighbors improves accuracy Using separate models for attribute and link data further improves accuracy Other considerations: improvements are possible by using community information knowledge of network type could also benefit classifier Techniques Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Prediction Link Prediction Liben-Nowell and Kleinberg Tested node-pair metrics: Graph distance Common neighbors Jaccards coefficient Adamic/adar Preferential attachment Katz Hitting time Rooted PageRank SimRank Neighborhood Ensemble of paths Link Prediction - results Link Prediction – newer methods maximum likelihood stochastic block model probabilistic Link Prediction – summary There is room for growth – best predictor has accuracy of only around 9% Predicting collaborations is difficult New problem could be to predict the direction of the link Techniques Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Link Completion Ranking Ranking – Markov Chain Based Random-surfer analogy Problem with cycles PageRank uses random vector Ranking – summary Other methods such as HITS and SALSA also based on Markov chain Ranking has been applied in other areas: text summarization anomaly detection Techniques Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Influence Influence Maximization Problem: find the best nodes to activate Approaches: degree – fast but not effective greedy – effective but slow improvements to greedy: degree heuristics and Shapely value use communities cost-benefit – probabilistic approach Maximizing influence model-based Problem – finding the k best nodes to activate to maximize the number of nodes activated Models: independent cascade – when activated a node has a one-time change to activate neighbors with prob. pij linear threshold – node becomes activated when the percent of its neighbors crosses a threshold Maximizing influence model-based Models: independent cascade & linear threshold A function f:S→S*, can be created using either model Functions use monte-carlo, hill-climbing solution Submodular functions, f (S {v}) f (S ) f (T {v}) f (T ) where ST are proven in another work to be NP-C but by using a hill-climbing solution can get to within 1-1/e of optimum. Maximizing influence – cost/benefit Assumptions: If customer purchases profit is: product x sells for $100 a discount of 10% can be offered to various prospective customers 90 if discount is offered 100 if discount is not offered Expected lift in profit (ELP) from offering discount is: 90*P(buy|discount) - 100*P(buy|no discount) Maximizing influence – cost/benefit n ELP( X , Y , M ) r1 P( X i 1 | X k , Y , f i1 ( M )) r0 P( X i 1 | X k , Y , f i 0 ( M )) c k i 1 Goal is to find M that maximizes global ELP Three approximations used: single pass greedy hill-climbing Xi is the decision of customer i to buy Y is vector of product attributes M is vector of marketing decision f is a function to set the ith element of M r0 and r1 are revenue gained c is the cost of marketing Comparison of approaches Size of starting set uses attributes probabilities Cost/benefit variable - based on max. lift yes extracted from data set Model-based fixed no assigned to links An extension would be to spread influence to the most number of communities Improvements can be made in speed Techniques Link-Based Classification Link Prediction Ranking Influential Nodes Community Finding Communities Gibson, Kleinberg and Raghavan Query Search Engine Root Set Base Set: add forward and back links Use HITS to find top 10 hubs and authorities Flake, Lawrence and Giles Uses Min-cut Start with seed set Add linked nodes Find nodes from outgoing links Create virtual source node Add virtual sink linking it to all nodes Find the min-cut of the virtual source and sink Community Finding Girvan and Newman – minimize betweenness Clauset, et al. – agglomerative, uses modularity Shi & Malik – spectral clustering Communities - summary There are many options for building communities around a small group of nodes Possible future directions finding communities in networks having different link types impact of network type on community finding techniques