SLIDES: Mining Social Networks for Knowledge Management

advertisement
Mining social networks for
knowledge management
Prabhakar Raghavan
Overview
• Sampling of social network studies
• The knowledge management challenge
› Enterprise complications
• Extended social networks
› Network and tensor-based models
• Power law behaviors
› Why, and what they mean for mining
• A research agenda
Milgram’s experiments
• Began with volunteers from Omaha, NE.
• Asked to get a letter to a physician near
Boston.
• Could only send to first-name
acquaintance, to be forwarded etc.
• Median path length of successful
deliveries was 6.
• Led to famous “6 degrees of
separation” folklore.
eMail cliques: Schwartz/Wood
• Studied eMail (sub)graph.
• Proposed metrics for groups of people
to share interests; cluster analysis.
• Qualitatively “good” results.
• Raised issues of ethical use of data and
privacy.
Various other projects
• PHOAKS
› Extracting heavily cited resources in
newsgroups, etc.
• Call graphs
› Discerning home, business and fax lines
› Calling circles.
• Recommendation systems
› Input: users’ product endorsements.
› Output: product recommendations to each
user.
Trawling bipartite cliques
• Take the (directed) Web link graph.
• Enumerate all (small) bipartite cliques.
Alice
Bob
AT&T
Sprint
MCI
R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins (1998).
Insights from hubs
Authority
Hub
Link-based hypothesis:
Dense bipartite subgraph
 Web community.
Fans
Centers
Communities from cores
• What is a “dense bipartite subgraph”?
• Define (i,j)-core: complete bipartite subgraph
with i nodes all of which point to each of j
others.
• Enumerate (i,j)-cores for various small i,j.
Fans
Centers
(2,3) core
Results for cores
Thousands
100
80
i=3
5
4
6
60
40
20
0
3
5
7
9
Number of cores found by Elimination/Generation
Thousands
80
i=3
60
40
4
20
0
3
5
7
Number of cores found during postprocessing
9
Japanese Elementary Schools
Fans
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
schools
LINK Page-13
“ú–{‚ÌŠw•
Z
a‰„

¬Šw
Zƒz
[ƒ
ƒy
[ƒW
100 Schools Home Pages (English)
K-12 from Japan 10/...rnet and Education
)
http://www...iglobe.ne.jp/~IKESAN
‚l‚f‚j
¬Šw
Z‚U”N‚P‘g•¨Œê
ÒŠ—’¬—§

ÒŠ—“Œ
¬Šw
Z
Koulutus ja oppilaitokset
TOYODA HOMEPAGE
Education
Cay's Homepage(Japanese)
–y“ì
¬Šw
Z‚̃z
[ƒ
ƒy
[ƒW
UNIVERSITY
‰J—³
¬Šw
Z DRAGON97-TOP
‰ª

¬Šw
Z‚T”N‚P‘gƒz
[ƒ
ƒy
[ƒW
¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
Centers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
The American School in Japan
The Link Page
以
èŽ
s—§ˆä“c
¬Šw
Zƒz
[ƒ
ƒy
[ƒW
Kids' Space
ˆÀ•
éŽ
s—§ˆÀ
é
¼•”
¬Šw
Z
‹{
鋳ˆç‘åŠw•
‘®
¬Šw
Z
KEIMEI GAKUEN Home Page (
Japanese )
Shiranuma Home Page
fuzoku-es.fukui-u.ac.jp
welcome to Miasa E&J school
_“ޏ

쌧
E‰¡•l
s—§’†
ì
¼
¬Šw
Z‚̃y
http://www...p/~m_maru/index.html
fukui haruyama-es HomePage
Torisu primary school
goo
Yakumo Elementary,Hokkaido,Japan
FUZOKU Home Page
Kamishibun Elementary School...
Yenta: Forman
• Analyzes documents “associated” with
each user.
• Distils significant “interests” for each.
• Matches/clusters groups of users with
overlapping interests.
• Decentralized; aims for privacy
protection.
• Elements of peer-to-peer operation.
ReferralWeb: Kautz/Selman
• Establishes links between people, e.g.,
› co-authorship
› colleagues in an organization
• Allows search through this social
network, e.g.,
› find me someone within distance 2 who
will referee a paper on xyz ...
Who can I ask to review a paper on “expander
graphs”? Source: H. Kautz & B. Selman
Paths to Experts
Source: H. Kautz & B. Selman
Observations
Source: H. Kautz & B. Selman
• Official company hierarchy only a sparse
subset of the corporate social network
• Shortest (and often best) paths involve a
combination of official and unofficial links
› Conditions for trust and evaluation may greatly
differ
› Global social network is the union of many
different kinds of sub-networks
Search greatly aided when user can choose different views
of the network
types of edge
strength of edge
Knowledge management
• The big challenge:
Increase productivity in knowledge
workers by getting them the expertise
they need at all times
> the right information (documents?)
> the right experts.
• Enterprise: group of people engaged in
a collective endeavour, typically with
proprietary content.
Enterprise knowledge mgmt.
• Examples - Schwartz/Wood,
ReferralWeb, Yenta, PHOAKS all have
some applicability.
› ReferralWeb was originally devised and
deployed at AT&T Labs.
• Enterprise knowledge management
introduces some novel challenges.
Challenges in the enterprise
• Information resides in heterogeneous
› formats (email, pdf, word, …)
› repositories (Lotus, Exchange,
Documentum, databases …)
› applications (HR, ERP, Siebel, …)
• Need to combine structured relations
(from applications) with learning.
Challenges in the enterprise
• Data security: information units have
many different access classes.
› e.g., compound documents have pieces,
each with its own access lists.
› My search should hit the doc only if it hits
the pieces I can see.
• Knowledge security is the deeper issue.
› Learning: consider class models learned
from security-limited information.
› Inferences (what does a recommendation
tell me about confidential data?)
General formulation
• How do we combine different sources
of content and context?
› terms in docs
› links between docs
› users’ access patterns
› users’ profile information.
General formulation
• Every item of interest - each term,
query, doc, person, treated as a node.
• Impose similarity metric between pairs
of nodes.
• Need to be able to measure proximity
from sets of nodes (a person+a doc
they’re viewing+a query they’ve issued)
to nodes of a target type (a person).
Issues in formulation
• If a user is close to two docs d1 and d2,
are the docs d1 and d2 close to each
other?
• How do you measure proximity from a
set of nodes?
• How do you capture collaborative (as
opposed to content and context-based
filtering).
• How do you succinctly represent and
manipulate similarities?
Graph-based models
• Each node an entity, associated with a
set of features.
• Pairwise similarities based on feature
matches.
• Issues:
› Not easy to do proximity from sets of
nodes.
› Have to maintain (quadratic) pairwise
information.
› Consistency.
Tensor-based models
• Turn every entity into a vector.
• Axes are terms, profile features, …
• Combination of user, context++
becomes a tensor.
• Measure proximity to tensors of a
certain type (e.g., user, doc
recommendation).
Context with content
• Docs’ content captured in term axes.
• Other attributes (user profile, etc.)
captured in other axes.
• A probe consists of
1 : a tensor
t (say, a user vector plus a
query)
2 : a type of vector to be retrieved (say, a
user).
• Result = vectors of chosen type closest
to t.
“Standard” mining tricks
• Dimensionality reduction - for
collaborative filtering.
• Hierarchical clustering - for fast nearneighbor search.
• Incremental indexes - real-time
updates.
Upshot
• Verity social networks project
Screenshot.
• Security issues remain thorny.
• What aspects of social behavior can we
exploit in the algorithms?
› Power laws
Power laws in mining
Recurring phenomena
• Many interesting distributions
› term frequencies in a corpus
› citations
› in-links to web pages
› document access frequencies …
follow an inverse polynomial function.
Zipf versus power laws
• We call a distribution on the positive
integers
power law if it’s of the form p(i) ~ 1/i .
› a Zipf law if p(i)~1/j  where j is the rank
of i.
› a
• Typically >1.
Other Zipf/power laws
• Populations of US cities
• Degrees of internet nodes
• See
http://www.cs.berkeley.edu/~christos/games/powerlaw.ps
What leads to power laws?
• “Scale free” growth.
• “Highly optimized tolerance”.
• Behavioral models.
› Model behavior of individuals in social
network.
In-degrees on the Web graph
• Web in-degrees are distributed as p(i) ~
1/i 2.1.
› Consistently across many independent
studies.
• Erdos-Renyi random graphs would not
lead to such a power law.
• Need a new stochastic model for such
graphs.
Random replication graphs
• Central thesis - random replication in an
evolving graph.
› Some page creators create content without
regard to what exists on the web.
› Many are inspired by pre-existing content.
› i.e., some links are random, others are
copied from pre-existing pages.
Model details
•
Evolution: Nodes are created in a
sequence of discrete time steps
› e.g. at each time step, a new node is
created with d=O(1) out-links
• Probabilistic
copying
› links go to random nodes with probability
› copy
d links from a random “existing”
node with probability

Theoretical Results
• New model yields
› convergence to power-law in-degrees;
›
› number of bipartite cliques that grows with
time;
› evolution without copying would not yield
these phenomena.
R. Kumar, P. Raghavan, S. Rajagopalan, R. Sivakumar, A. Tomkins, E. Upfal (2000).
Compound structures
• “First order” structures (terms
frequencies, in-degrees, citations)
exhibit power laws.
• What about “higher order” structures
(pairs of terms, bipartite cliques, etc.)?
• Motivations:
› Criteria for mining interesting higher order
structures.
› Turning algorithms for higher order mining.
Pair frequencies for terms
• Analyzed several corpora of news items.
• Studied frequencies of k-tuples of terms
(k=1, 2, …) in
› corpus
› documents
› sentences
› windows of width
w.
Ongoing work with P. Tsaparas.
Sentence log-rank vs. logfrequency
Pair distributions
• Based on term frequencies, compute
pair frequencies under independence
assumption.
• Measure actual pair frequencies.
› Outliers under mutual information
measure.
• Higher order outliers: useful for building
clusters/concept maps in corpora.
Pairs: independence vs. actual
Computational speedup
• Inspired by pruning algorithms in
trawling.
• As higher order associations are built
up, keep discarding obviated terms.
› Docs keep getting shorter.
› Fit in memory quickly.
› Not an issue with relational tables?
A research agenda
• New ways of combining content,
context and collaboration in the social
network.
• Analyze and model structures in social
networks.
› Tune algorithms on models.
› Build on “standard” mining paradigms:
associations, clustering ...
• Incorporate enterprise constraints:
› Roles and profile information from apps
› Security and Privacy!
Download