Mining social networks for knowledge management Prabhakar Raghavan Overview • Sampling of social network studies • The knowledge management challenge › Enterprise complications • Extended social networks › Network and tensor-based models • Power law behaviors › Why, and what they mean for mining • A research agenda Milgram’s experiments • Began with volunteers from Omaha, NE. • Asked to get a letter to a physician near Boston. • Could only send to first-name acquaintance, to be forwarded etc. • Median path length of successful deliveries was 6. • Led to famous “6 degrees of separation” folklore. eMail cliques: Schwartz/Wood • Studied eMail (sub)graph. • Proposed metrics for groups of people to share interests; cluster analysis. • Qualitatively “good” results. • Raised issues of ethical use of data and privacy. Various other projects • PHOAKS › Extracting heavily cited resources in newsgroups, etc. • Call graphs › Discerning home, business and fax lines › Calling circles. • Recommendation systems › Input: users’ product endorsements. › Output: product recommendations to each user. Trawling bipartite cliques • Take the (directed) Web link graph. • Enumerate all (small) bipartite cliques. Alice Bob AT&T Sprint MCI R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins (1998). Insights from hubs Authority Hub Link-based hypothesis: Dense bipartite subgraph Web community. Fans Centers Communities from cores • What is a “dense bipartite subgraph”? • Define (i,j)-core: complete bipartite subgraph with i nodes all of which point to each of j others. • Enumerate (i,j)-cores for various small i,j. Fans Centers (2,3) core Results for cores Thousands 100 80 i=3 5 4 6 60 40 20 0 3 5 7 9 Number of cores found by Elimination/Generation Thousands 80 i=3 60 40 4 20 0 3 5 7 Number of cores found during postprocessing 9 Japanese Elementary Schools Fans • • • • • • • • • • • • • • • • • • schools LINK Page-13 “ú–{‚ÌŠw• Z a‰„ ¬Šw Zƒz [ƒ ƒy [ƒW 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j ¬Šw Z‚U”N‚P‘g•¨Œê ÒŠ—’¬—§ ÒŠ—“Œ ¬Šw Z Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“ì ¬Šw Z‚̃z [ƒ ƒy [ƒW UNIVERSITY ‰J—³ ¬Šw Z DRAGON97-TOP ‰ª ¬Šw Z‚T”N‚P‘gƒz [ƒ ƒy [ƒW ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼ Centers • • • • • • • • • • • • • • • • • • The American School in Japan The Link Page ‰ª• èŽ s—§ˆä“c ¬Šw Zƒz [ƒ ƒy [ƒW Kids' Space ˆÀ• éŽ s—§ˆÀ é ¼•” ¬Šw Z ‹{ 鋳ˆç‘åŠw• ‘® ¬Šw Z KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“Þ ìŒ§ E‰¡•l s—§’† ì ¼ ¬Šw Z‚̃y http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School... Yenta: Forman • Analyzes documents “associated” with each user. • Distils significant “interests” for each. • Matches/clusters groups of users with overlapping interests. • Decentralized; aims for privacy protection. • Elements of peer-to-peer operation. ReferralWeb: Kautz/Selman • Establishes links between people, e.g., › co-authorship › colleagues in an organization • Allows search through this social network, e.g., › find me someone within distance 2 who will referee a paper on xyz ... Who can I ask to review a paper on “expander graphs”? Source: H. Kautz & B. Selman Paths to Experts Source: H. Kautz & B. Selman Observations Source: H. Kautz & B. Selman • Official company hierarchy only a sparse subset of the corporate social network • Shortest (and often best) paths involve a combination of official and unofficial links › Conditions for trust and evaluation may greatly differ › Global social network is the union of many different kinds of sub-networks Search greatly aided when user can choose different views of the network types of edge strength of edge Knowledge management • The big challenge: Increase productivity in knowledge workers by getting them the expertise they need at all times > the right information (documents?) > the right experts. • Enterprise: group of people engaged in a collective endeavour, typically with proprietary content. Enterprise knowledge mgmt. • Examples - Schwartz/Wood, ReferralWeb, Yenta, PHOAKS all have some applicability. › ReferralWeb was originally devised and deployed at AT&T Labs. • Enterprise knowledge management introduces some novel challenges. Challenges in the enterprise • Information resides in heterogeneous › formats (email, pdf, word, …) › repositories (Lotus, Exchange, Documentum, databases …) › applications (HR, ERP, Siebel, …) • Need to combine structured relations (from applications) with learning. Challenges in the enterprise • Data security: information units have many different access classes. › e.g., compound documents have pieces, each with its own access lists. › My search should hit the doc only if it hits the pieces I can see. • Knowledge security is the deeper issue. › Learning: consider class models learned from security-limited information. › Inferences (what does a recommendation tell me about confidential data?) General formulation • How do we combine different sources of content and context? › terms in docs › links between docs › users’ access patterns › users’ profile information. General formulation • Every item of interest - each term, query, doc, person, treated as a node. • Impose similarity metric between pairs of nodes. • Need to be able to measure proximity from sets of nodes (a person+a doc they’re viewing+a query they’ve issued) to nodes of a target type (a person). Issues in formulation • If a user is close to two docs d1 and d2, are the docs d1 and d2 close to each other? • How do you measure proximity from a set of nodes? • How do you capture collaborative (as opposed to content and context-based filtering). • How do you succinctly represent and manipulate similarities? Graph-based models • Each node an entity, associated with a set of features. • Pairwise similarities based on feature matches. • Issues: › Not easy to do proximity from sets of nodes. › Have to maintain (quadratic) pairwise information. › Consistency. Tensor-based models • Turn every entity into a vector. • Axes are terms, profile features, … • Combination of user, context++ becomes a tensor. • Measure proximity to tensors of a certain type (e.g., user, doc recommendation). Context with content • Docs’ content captured in term axes. • Other attributes (user profile, etc.) captured in other axes. • A probe consists of 1 : a tensor t (say, a user vector plus a query) 2 : a type of vector to be retrieved (say, a user). • Result = vectors of chosen type closest to t. “Standard” mining tricks • Dimensionality reduction - for collaborative filtering. • Hierarchical clustering - for fast nearneighbor search. • Incremental indexes - real-time updates. Upshot • Verity social networks project Screenshot. • Security issues remain thorny. • What aspects of social behavior can we exploit in the algorithms? › Power laws Power laws in mining Recurring phenomena • Many interesting distributions › term frequencies in a corpus › citations › in-links to web pages › document access frequencies … follow an inverse polynomial function. Zipf versus power laws • We call a distribution on the positive integers power law if it’s of the form p(i) ~ 1/i . › a Zipf law if p(i)~1/j where j is the rank of i. › a • Typically >1. Other Zipf/power laws • Populations of US cities • Degrees of internet nodes • See http://www.cs.berkeley.edu/~christos/games/powerlaw.ps What leads to power laws? • “Scale free” growth. • “Highly optimized tolerance”. • Behavioral models. › Model behavior of individuals in social network. In-degrees on the Web graph • Web in-degrees are distributed as p(i) ~ 1/i 2.1. › Consistently across many independent studies. • Erdos-Renyi random graphs would not lead to such a power law. • Need a new stochastic model for such graphs. Random replication graphs • Central thesis - random replication in an evolving graph. › Some page creators create content without regard to what exists on the web. › Many are inspired by pre-existing content. › i.e., some links are random, others are copied from pre-existing pages. Model details • Evolution: Nodes are created in a sequence of discrete time steps › e.g. at each time step, a new node is created with d=O(1) out-links • Probabilistic copying › links go to random nodes with probability › copy d links from a random “existing” node with probability Theoretical Results • New model yields › convergence to power-law in-degrees; › › number of bipartite cliques that grows with time; › evolution without copying would not yield these phenomena. R. Kumar, P. Raghavan, S. Rajagopalan, R. Sivakumar, A. Tomkins, E. Upfal (2000). Compound structures • “First order” structures (terms frequencies, in-degrees, citations) exhibit power laws. • What about “higher order” structures (pairs of terms, bipartite cliques, etc.)? • Motivations: › Criteria for mining interesting higher order structures. › Turning algorithms for higher order mining. Pair frequencies for terms • Analyzed several corpora of news items. • Studied frequencies of k-tuples of terms (k=1, 2, …) in › corpus › documents › sentences › windows of width w. Ongoing work with P. Tsaparas. Sentence log-rank vs. logfrequency Pair distributions • Based on term frequencies, compute pair frequencies under independence assumption. • Measure actual pair frequencies. › Outliers under mutual information measure. • Higher order outliers: useful for building clusters/concept maps in corpora. Pairs: independence vs. actual Computational speedup • Inspired by pruning algorithms in trawling. • As higher order associations are built up, keep discarding obviated terms. › Docs keep getting shorter. › Fit in memory quickly. › Not an issue with relational tables? A research agenda • New ways of combining content, context and collaboration in the social network. • Analyze and model structures in social networks. › Tune algorithms on models. › Build on “standard” mining paradigms: associations, clustering ... • Incorporate enterprise constraints: › Roles and profile information from apps › Security and Privacy!