slides - Charu Aggarwal

advertisement
On Node Classification
in Dynamic Content-based Networks
Charu Aggarwal
Nan Li
IBM T. J. Watson Research Center
charu@us.ibm.com
University of California, Santa Barbara
nanli@cs.ucsb.edu
Presented by
Nan Li
Motivation
Ke Wang
Jian Pei
“Mining”
“Sequential Pattern”
“Efficient”
“Data Mining”
“Association Rules”
“Systems”
…
“Rules”
Jiawei Han
…
“Data Mining”
“Databases”
“Clustering”
“Sequential Pattern”
…
Kenneth A. Ross
“Algorithms”
…
Year 22001
Motivation
Ke Wang
Jian Pei
“Association Rules”
“Pattern”
“Data Mining”
“Data Mining”
“Ranking”
“Stream”
“Web”
“Semantics”
Jiawei Han
…
…
Marianne Winslett
“Data Mining”
“Web”
“Parallel”
“Sequential Pattern”
“Automated”
…
“Data”
…
Xifeng Yan
Philip S. Yu
“Clustering”
“Distributed”
“Pattern Mining”
“Databases”
…
“Mining”
…
Year 32002
Motivation
Ke Wang
Jian Pei
“Mining”
“Sequential Pattern”
“Efficient”
“Mining”
“Association”
“Systems”
“Rules”
Jiawei Han
…
…
Charu Aggarwal
“Clustering”
“Mining”
“Indexing”
“Databases”
“Knowledge”
“Clustering”
“XML”
“Sequential Pattern”
…
…
Xifeng Yan
Philip S. Yu
“Algorithms”
“Association Rules”
“Graph”
“Clustering”
“Databases”
“Wireless”
“Sequential Mining”
“Web”
…
…
Year 42003
Motivation

Networks annotated with an increasing amount of text
• Citation networks, co-authorship networks, product databases with
large amounts of text content, etc.
• Highly dynamic

Node classification Problem
• Often arises in the context of many network scenarios in which the
underlying nodes are associated with content.
• A subset of the nodes in the network may be labeled.
• Can we use these labeled nodes in conjunction with the structure
for the classification of nodes which are not currently labeled?

Applications
5
Challenges

Information networks are very large
• Scalable and efficient

Many such networks are dynamic
• Updatable in real time
• Self-adaptable and robust

Such networks are often noisy
• Intelligent and selective

Heterogeneous correlations in such networks
A
B
A
C
B
A
C
B
C
6
Outline




Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
7
Related Works

Link-based classification (Bhagat et al., WebKDD 2007)
•
•

Content-only classification (Nigam et al. Machine Learning 2000)
•

Each object’s own attributes only
Relational classification (Sen et al., Technical Report 2004)
•
•

Local iterative
Global nearest neighbor
Each object’s own attributes
Attributes and known labels of the neighbors
Collective classification (Macskassy & Provost, JMLR 2007, Sen et
al., Technical Report 2004, Chakrabarti, SIGMOD 1998)
•
•
Local classification
• Flexible: ranging from a decision tree to an SVM
Approximate inference algorithms
• Iterative classification
• Gibbs sampling
• Loopy belief propagation
• Relaxation labeling
8
Outline




Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
9
DYCOS in A Nutshell

Node classification in a dynamic environment
 Dynamic network: the entire network is denoted by Gt = (Nt, At, Tt) at time t.
 Problem statement:
 Classify the unlabeled nodes (Nt \ Tt) using both the content and structure
of the network for all the time stamps in an efficient and accurate manner
t
t+1
t+2
Both the structure and the content of the network change over time!
10
Semi-bipartite Transformation

Text-augmented
representation
• Leveraged for a random
walk-based classification
model that uses both links
and text
• Two partitions: structural
nodes and word nodes

• Semi-bipartite: one
partition of nodes is
allowed to have edges
either within the set, or
to nodes in the other
partition.
Efficient updates upon
dynamic changes
11
Random Walk-Based Classification

Random walks over
augmented structure
• Starting node: the unlabeled
node to be classified.
• Structural hop
• A random jump from a
structural node to one of
its neighbors
• Content-based multi-hop

• A jump from a
structural node to
another through implicit
common word nodes
• Structural parameter: ps
Classification
• Classify the starting node
with the most frequently
encountered class label
during the random walks
12
Gini-Index & Inverted Lists

Discriminative keywords
• A set Mt of the top m words with the highest discriminative power are
used to construct the word node partition.
• Gini-index
k
pi (w)  ni (w) /  n j (w)
j 1
k
G(w)   p j (w) 2
j 1
 of G(w) lies in the range (0, 1).
• The value
• Words with a higher value of gini-index are more discriminative
for classification
purposes.


Inverted lists
• Inverted list of keywords for each node
• Inverted list of nodes for each keyword
13
Analysis

Why do we care?
• DYCOS is essentially using Monte-Carlo sampling to sample various paths
from each unlabeled node.
• Advantage: fast approach
• Disadvantage: loss of accuracy
• Can we present analysis on how accurate DYCOS sampling is?

Probabilistic bound: bi-class classification
• Two classes C1 and C2
• E[Pr[C1]] = f1, E[Pr[C2]] = f2, f1 - f2 = b ≥ 0
• Pr[mis-classification] ≤ exp{-lb2/2}

Probabilistic bound: multi-class classification
• k classes {C1, C2, …, Ck}
• b-accurate
• Pr[b-accurate] ≥ 1 - (k-1)exp{-lb2/2}
14
Outline




Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
15
Experimental Results

Data sets
• CORA: a set of research papers and the citation relations among them.
• Each node is a paper and each edge is a citation relation.
• A total of 12,313 English words are extracted from the paper titles.
• We segment the data into 10 synthetic time periods.
• DBLP: a set of authors and their collaborations
• Each node is an author and each edge is a collaboration.
• A total of 194 English words in the domain of computer science are used.
• We segment the data into 36 annual graphs from year 1975 to year 2010.
Name
Node #
Edge #
Class #
Labeled Node #
CORA
19,396
75,021
5
14,814
DBLP
806,635
4,414,135
5
18,999
16
Experimental Results

NetKit-SRL toolkit
•
An open-source and publicly available toolkit for statistical relational learning in
networked data (Macskassy and Provost, 2007).
• Instantiations of previous relational and collective classification algorithms
• Configuration
• Local classifier: domain-specific class prior
• Relational classifier: network-only multinomial Bayes classifier
• Collective inference: relaxation labeling

Parameters
•
1) The number of most discriminative words, m; 2) The size constraint of the
inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4)
The number of random walks, l; 5) The length of each random walk, h; 6) Structure
parameter, ps.
The results demonstrate that DYCOS improves the classification accuracy over NetKit
by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit.
17
Experimental Results
DYCOS vs. NetKit on CORA
Classification Accuracy Comparison
Classification Time Comparison
18
Experimental Results
Parameter Sensitivity of DYCOS
Sensitivity to m, l and h (a=30, ps=70%)
CORA Data
Sensitivity to a, m and ps (l=3, h=5)
DBLP Data
19
Experimental Results
Dynamic Updating Time: CORA
Time Period
1
2
3
4
5
Update Time (Sec.) 0.019
0.013
0.015
0.013
0.023
Time Period
7
8
9
10
0.014
0.014
0.013
0.011
6
Update Time (Sec.) 0.015
Dynamic Updating Time: DBLP
Time Period
1975-1989
1990-1999
2000-2010
Update Time (Sec.)
0.03107
0.22671
1.00154
20
Outline




Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
21
Conclusion

We propose an efficient, dynamic and scalable method for
node classification in dynamic networks.

We provide analysis on how accurate the proposed method
will be in practice.

We present experimental results on real data sets, and
show that our algorithms are more effective and efficient
than competing algorithms.
22
23
Challenges
Information networks are very large
• Scalable and efficient
Many such networks are dynamic
• Updatable in real time
• Self-adaptable and robust
Such networks are often noisy
• Intelligent and selective
Heterogeneous correlations in such networks
• The correlations between the label of o and the content of o
• The correlations between the label of o and the contents and labels of
objects in the neighborhood of o
• The correlations between the label of o and the unobserved labels of
objects in the neighborhood of o
24
Analysis
Lemma Consider two classes with expected visit probabilities of f1 and f2
respectively, such that b = f1−f2 > 0. Then, the probability that the class which is
visited the most during the sampled random walk process is reversed to class 2, is
given by at most
e
l b 2 / 2
Definition Consider the node classification problem with a total of k classes. We
define the sampling process to be b-accurate, if none of the classes whose expected
visit probability is less than b of the class with the largest expected visit probability
turns out have the largest sampled visit probability.

Theorem The probability that the sampling process results in a b-accurate reported
majority class is given by at least
2
1 (k 1)  elb
/2
25
Experimental Results
Classification accuracy comparison: DBLP
Classification time comparison: DBLP
26
Experimental Results
Sensitivity to m, l and h
Sensitivity to a, l and h
Sensitivity to m, a and ps
Sensitivity to a, m and ps
27
Download