Slides

advertisement
Active Learning for Streaming Networked Data
Zhilin Yang, Jie Tang, Yutao Zhang
Computer Science Department, Tsinghua University
Introduction
Mining streaming data becomes an important topic.
 Challenge 1: the expensiveness of labeled data
Related work: active learning for streaming data [28, 6, 5, 29]
 Challenge 2: network correlation between data instances
Related work: active learning for networked data [23, 25, 3, 4, 10, 27, 8, 22]
 A novel problem: active learning for streaming networked data
To deal with both challenges 1 & 2.
Problem Formulation
Streaming Networked Data
When a new instances yi arrives, new edges are added to connect the new instance and existing instances.
Problem Formulation
Active Learning for Streaming Networked Data

Our output is a data stream   { i }i 0 . At any time, we maintain a classifier
instances.
based on arrived
At any time ti , we go through the following steps:
1. Predict the label for xi based on
2. Decide whether to query for the true label yi
3. Update the model to be
Our goal is to use a small number of queries, to control (minimize) the accumulative error rate.
Challenges
Challenges
Concept drift.
The distribution of input data and network structure change over time as we are
handling streaming data. How to adapt to concept drift?
Network correlation.
In the networked data, there is correlation among instances. How to model the
correlation in the streaming data?
Online query.
We must decide whether to query an instance at the time of its appearance,
which makes it infeasible to optimize a global objective function. How to develop
online query algorithms?
Modeling Networked Data
Time-Dependent Network
At any time ti , we can construct a time-dependent network Gi based on all the
arrived instances before and at time ti .
Gi  ( Xi , Ei , y , y )
L
i
Xi
Ei
y iL
yUi
U
i
th
A matrix, with an element xij indicating the j feature of instance xi
The set of all edges between instances.
A set of labels of instances that we have already actively queried before.
A set of unknown labels for all the other instances.
Modeling Networked Data
The Basic Model: Markov Random Field
Given the graph Gi , we can write the energy as
QGi (y i , y ; θ)   y y L yU f (x j , y j , λ )  el Ei g (el , β)
L
U
i
True labels of
queried instances
j
i
i
The energy defined
for instance xi
The energy
associated with the
edge el  ( y j , yk , cl )
Modeling Networked Data
Model Inference
U
We try to assign labels to y i such that we can minimize the following energy
L
min yU QGi (yi , yUi ;q )
i
Usually intractable to directly solve the above problem.
Apply dual decomposition [17] to decompose the original problems into a set of tractable
subproblems. The dual optimization problem is as follows:
(
LGi = maxs å el min yU |yL g(el , b ) + s (y j ) + s (yk )
l
l
l
j
Local optimization
Subject to
Global constraint
We can solve the above objective function with projected subgradient [13].
l
k
Dual variables
)
Modeling Networked Data
Model Learning
Applying max margin learning paradigm, the objective function for parameter learning is written as
where
{
L
}
xq = max y ,y QG (yi , y ;q ) - QG (y , y ;q ) + Dy (yi , yi )
L
i
A slack variable
U
i
i
U
i
i
The margin between two
configurations
L
i
U
i
Dissimilarity
measure between
two configurations
Modeling Networked Data
Model Learning
Applying dual decomposition, we have the dual optimization objective function as follows:
Dual variables
Dual variables
The optimization problem becomes
We can solve the above problem with projected subgradient method.
Streaming Active Query
Structural Variability
Intuition: control the gap between the energy of the inferred configuration and that of any other possible
configuration.
We define the structural variability as follows:
The energy of any other configuration
The energy of the inferred configuration
Streaming Active Query
Properties of Structural Variability
y1L and y 2L are two sets of instance labels. Given q ,
1. Monotonicity. Suppose
if
, then we have
The structural variability will not increase as we label more instances in the MRF.
2. Normality. If yUi = Æ, we have
If we label all instances in the graph, we incur no structural variability at all.
Streaming Active Query
Properties of Structural Variability
3. Centrality
Under certain circumstances, minimizing structural variability leads to querying instances with high
network centrality.
Streaming Active Query
Decrease Function
We define a decrease function for each instance
Structural variability
before querying y_i
yi
Structural variability
after querying y_i
The second term is in general intractable. We estimate the second term by expectation
The true probability
We approximate the true probability by
Streaming Active Query
Decrease Function
We define a decrease function for each instance
Structural variability
before querying y_i
yi
Structural variability
after querying y_i
The first term can be computed by dual decomposition. The dual problem is
Streaming Active Query
The algorithm
k
Given the constant threshold , we query
yi if and only if
f ³k
i
Analysis
Enhancement by Network Sampling
Basic Idea
Maintain an instance reservoir of a fixed size, and update the reservoir sequentially on the arrival
of streaming data.
Which instances to discard when the size of the reservoir is exceeded?
Simply discard early-arrived instances may deteriorate the network correlation. Instead, we consider
the loss of discarding an instance in two dimensions:
1. Spatial dimension: the loss in a snapshot graph based on network correlation deterioration
2. Temporal dimension: integrating the spatial loss over time
Enhancement by Network Sampling
Spatial Dimension
Use dual variables as indicators of network correlation.
The violation for instance can be written as
Then the spatial loss is
Measure how much the
optimization constraint is violated
after remove the instance
Intuition
1. Dual variables can be viewed as the message sent from the edge factor to each
instance
2. The more serious the optimization constraint is violated, the more we need to adjust
the dual variables
Enhancement by Network Sampling
Temporal Dimension
The streaming network is evolving dynamically, we should not only consider the current spatial loss.
To proceed, we assume that for a given instance y j, dual variables of its neighbors
distribution with an expectation m j and that the dual variables are independent.
We obtain an unbiased estimator for
s kl (yk ) have a
mj
Integrating the spatial loss over time, we obtain
Suppose edges are added according to preferential attachment [2], the loss function is written as
Enhancement by Network Sampling
The algorithm
At time t i , we receive a new instance from the data stream, and update the graph.
If the number of instances exceed the reservoir size, we remove the instance with the least loss
function and its associated edges from the MRF model.
Interpretation
The first term
 Enables us to leverage the spatial loss function in the network.
 Instances that are important to the current model are also likely to remain important in the
successive time stamps.
The second term
 Instances with larger t j are reserved.
 Our sampling procedure implicitly handled concept drift, because later-arrived instances
are more relevant to the current concept [28].
The Framework
Step 1: MRF-based inference
Step 2: Streaming active query
Step 3: MRF-based parameter update
Step 4: Network sampling
Experiments
Datasets
 Weibo [26] is the most popular microblogging service in China.
 View the retweeting flow as a data stream.
 Predict whether a user will retweet a microblog.
 3 types of edge factors: friends; sharing the same user; sharing the same tweet
 Slashdot is an online social network for sharing technology related news.
 Treat each follow relationship as an instance.
 Predict “friends” or “foes”.
 3 types of edge factors: appearing in the same post; sharing the same follower; sharing
the same followee.
 IMDB is an online database of information related to movies and TVs.
 Each movie is treated as an instance.
 Classify movies into categories such as romance and animation.
 Edges indicate common-star relationships.
 ArnetMiner [19] is an academic social network.
 Each publication is treated as an instance.
 Classify publications into categories such as machine learning and data mining.
 Edges indicate co-author relationships.
Experiments
Datasets
Experiments
Active Query Performance
Suppress the network sampling method by
setting the reservoir size to be infinite.
Compare different streaming active query
algorithms.
(F1 score v.s. labeling rate)
Experiments
Concept Drift
First row: data stream
Second row: shuffled data
(F1 score v.s. data chunk index)
1. Clearly found some evidence about the existence of concept drift
2. Our algorithm is robust because it not only better adapts to concept drift (upper
row) but also performs well without concept drift (lower row).
Experiments
Streaming Network Sampling
Speedup Performance (Running time v.s. reservoir size)
The decrease of the reservoir size leads to minor decrease in
performance but significantly less running time.
F1 v.s. labeling rate (with varied reservoir size)
Experiments
Streaming Network Sampling
We fix the labeling rate, and compare different streaming network sampling algorithms with varied
reservoir sizes.
Experiments
Performance of Hybrid Approach
We fix the labeling rate and reservoir size, and compare different combinations of active query
algorithms and network sampling algorithms.
Conclusions
 Formulate a novel problem of active learning for streaming networked
data
 Propose a streaming active query algorithm based on the structural
variability
 Design a network sampling algorithm to handle large volume of streaming
data
 Empirically evaluate the effectiveness and efficiency of our algorithm
Thanks
Zhilin Yang, Jie Tang, Yutao Zhang
Computer Science Department, Tsinghua University
Download