Uploaded by 2103362275

online distributed gradient descent

advertisement
Decentralized Online Learning: Take Benefits from Others’ Data
without Sharing Your Own to Track Global Trend
Yawei Zhao∗, 2 Chen Yu∗ , 3 Peilin Zhao, 2 Hanlin Tang, 4 Shuang Qiu, and 2 Ji Liu
1
National University of Defense Technology, Changsha, China
2
University of Rochester, Rochester, USA
3
Tencent AI Lab, Shenzhen, China
zhaoyawei@nudt.edu.cn, yuchen92328@gmail.com,
peilinzhao@hotmail.com, htang14@ur.rochester.edu, ji.liu.uwisc@gmail.com
arXiv:1901.10593v4 [cs.LG] 28 May 2019
1,2
May 30, 2019
Abstract
Decentralized Online Learning (online learning in decentralized networks) attracts more and more
attention, since it is believed that Decentralized Online Learning can help the data providers cooperatively
better solve their online problems without sharing their private data to a third party or other providers.
Typically, the cooperation is achieved by letting the data providers exchange their models between
neighbors, e.g.,recommendation
model. However, the best regret bound for a decentralized online learning
√ algorithm is O n T , where n is the number of nodes (or users) and T is the number of iterations. This
is clearly insignificant since this bound can be achieved without any communication in the networks. This
reminds us to ask a fundamental question: Can people really get benefit from the decentralized online
learning by exchanging information? In this paper, we studied when and why the communication can help
the decentralized online learning to reduce the regret. Specifically, each loss function is characterized by
two components: the adversarial component and the stochastic component. Under
√ this characterization,
we show that decentralized online gradient (DOG) enjoys a regret bound O
n2 T G2 + nT σ 2 , where
G measures the magnitude of the adversarial component in the private data (or equivalently the local
loss function) and σ measures the randomness within the private data. This regret suggests that people
can get benefits from the randomness in the private data by exchanging private information. Another
important contribution of this paper is to consider the dynamic regret – a more practical regret to track
users’ interest dynamics. Empirical studies are also conducted to validate our analysis.
1
Introduction
Decentralized online learning receives extensive attentions in recent years [Shahrampour and Jadbabaie, 2018,
Kamp et al., 2014, Koppel et al., 2018, Zhang et al., 2018a, 2017b, Xu et al., 2015, Akbari et al., 2017, Lee
et al., 2016, Nedić et al., 2015, Lee et al., 2018, Benczúr et al., 2018, Yan et al., 2013]. It assumes that
computational nodes in a network can communicate between neighbors to minimize an overall cumulative
regret. Each computational node, which could be a user in practice, will receive a stream of online losses
that are usually determined by a sequence of examples that arrive sequentially. Formally, we can denote
fi,t as the loss received by the i-th computational node among the networks at the t-th iteration. The goal
of decentralized online learning usually is to minimize its static regret, which is defined as the difference
between the cumulative loss (the sum of all the online loss over all the nodes and steps ) suffered by the
learning algorithm and that of the best model which can observe all the loss functions beforehand.
∗ Equal
contribution.
1
Decentralized online learning attracts more and more attentions recently, mainly because it is believed by
the community that it enjoys the following advantages for real-world large-scale applications:
• (Utilize all computational resource) It can utilize the computational resource (of edging devices)
by avoiding collecting all the loss functions (or equivalently data) to one central node and putting all
computational burden on a single node.
• (Protect data privacy) It can help many data providers collaborate to better minimize their cumulative
loss, while at the same time protecting the data privacy as much as possible.
However, the current theoretical study does not explain why people need to use
decentralized
online
√
learning, since the currently best regret result for decentralized online learning O n T for convex loss
functions [Hosseini et al., 2013, Yan et al., 2013]) is equal to the overall regret if each node (user) only
runs local online gradient without any communication with others1 . Therefore, this reminds us to ask a
fundamental question: Can people really get benefit with respect to the regret from the decentralized online
learning by exchanging information?
In this paper, we mainly study when can the communication really help decentralized online learning
to minimize its regret. Specifically, we distinguish two components in the loss function fi,t : the adversary
component and the √
stochastic compoent.
Then we prove that decentralized online gradient can achieve a static
regret bound of O n2 T G2 + nT σ 2 (G represents the bound of gradient. σ measures the randomness of the
private data), where the first component of the bound is due to the adversary loss while the second component
is due to the stochastic loss. Moreover, if a dynamic sequence of models with
pa budget M is used as the
reference points, the dynamic regret of the decentralized online gradient is O
(n2 T G2 + nT σ) (M + 1) .
The dynamic regret is a more suitable performance metric for real-world applications where the optimal
model changes over time, such as people’s favorite style of pop musics usually change along with time as the
global environment. This shows the communication can help to minimize the stochastic losses, rather than
the adversary losses. This result is further verified empirically by extensive experiments.
Notations. In the paper, we make the following notations.
• For any i ∈ [n] and t ∈ [T ], the random variable ξi,t is subject to a distribution Di,t , that is, ξi,t ∼ Di,t . A
set of random variables Ξn,T and their corresponding distributions are defined by
Ξn,T = {ξi,t }1≤i≤n,1≤t≤T ,
Dn,T = {Di,t }1≤i≤n,1≤t≤T ,
respectively. For math brevity, we use the notation Ξn,T ∼ Dn,T to represent that ξi,t ∼ Di,t holds for any
i ∈ [n] and t ∈ [T ]. E represents mathematical expectation.
• ‘∇’ represents gradient operator. ‘k·k’ represents the `2 norm in default. ‘.’ represents “less than equal up
to a constant factor". ‘A’ represents the set of all online algorithms. ‘1’ and ‘0’ represent all the elements
of a vector is 1 and 0, respectively.
2
Related work
Online learning has been studied
√ for
decades of years. An online convex optimization method can achieve a
static regret bound of order O
T and O (log T ) for convex and strongly convex loss functions, respectively
[Hazan, 2016, Shalev-Shwartz, 2012, Bubeck, 2011].
Decentralized online learning. Online learning in a decentralized network has been studied in
[Shahrampour and Jadbabaie, 2018, Kamp et al., 2014, Koppel et al., 2018, Zhang et al., 2018a, 2017b, Xu
et al., 2015, Akbari et al., 2017, Lee et al., 2016, Nedić et al., 2015, Lee et al., 2018, Benczúr et al., 2018,
√ is the number of nodes or users and T is the total number of iterations. The regret of an online algorithm is O
T for
√
convex loss functions [Hazan, 2016, Shalev-Shwartz, 2012]. Therefore, the overall regret is n T if all users do not communicate.
1n
2
Yan et
2013].
al.,
Shahrampour and Jadbabaie [2018] provides a dynamic regret (defined in Eq. (1)) bound
√
of O n nT M for decentralized online mirror descent, where n, T , and M represent the number of nodes
in the newtork, the number of iterations, and the budget of dynamics, respectively. When the Bregman
divergence in the decentralized online mirror descent is chosen appropriately, the decentralized online mirror
descent becomes identical tothe decentralized
online gradient descent. In this paper, we achieve a better
√
dynamic regret bound of O n T M for a decentralized online gradient descent method, which mainly
benefits from a better
of network error (see Lemma 5). Moreover, Kamp et al. [2014] presents a
√ bound
static regret of O
nT for decentralized online prediction. However, it assumes that all the loss functions
are generated from an unknown identical distribution, this assumption is too strong to be practical in the
dynamic environment and be applied for a general online learning task. Additionally, many decentralized
online optimization methods are proposed, for example, decentralized online multi-task learning [Zhang
et al., 2018a], decentralized online ADMM [Xu et al., 2015], decentralized online gradient descent [Akbari
et al., 2017], decentralized continuous-time online saddle-point method [Lee et al., 2016], decentralized online
Nesterov’s primal-dual method [Nedić et al., 2015, Lee et al., 2018], and online distributed dual
√averaging
[Hosseini et al., 2013]. However, these previous work only studied the static regret bounds O
T of the
decentralized online learning algorithms, while they did not provide any theoretical analysis for dynamic
environments. Besides, Yan et al. [2013] provides necessary and sufficient conditions to preserve privacy for
decentralized online learning methods, which be studied to extend our method in our future work.
Dynamic regret. The dynamic regret of online learning algorithms has been widely studied for
decades [Zinkevich, 2003, Hall and Willett, 2015, 2013, Jadbabaie et al., 2015, Yang et al., 2016, Bedi et al.,
2018, Zhang et al., 2017a, Mokhtari et al., 2016, Zhang et al., 2018c, György
and Szepesvári, 2016, Wei
PT
et al., 2016, Zhao et al., 2018]. The first dynamic regret is defined as t=1 (ft (xt ) − ft (x∗t )) subject to
PT −1 ∗
xt+1 − x∗t ≤ M where M is a budget for the change over the reference points [Zinkevich,
2003]. For
t=1
√
√ this definition, the online gradient descent can achieve a dynamic regret of O
T M + T , by selecting
an appropriate learning rate. Later, other types of dynamic regrets are also introduced, by using different
types of reference points. For example, Hall and Willett [2015, 2013] choose the reference points {x∗t }Tt=1
PT −1
satisfying t=1 x∗t+1 − Φ(x∗t ) ≤ M , where Φ(x∗t ) is the predictive optimal model. When the function Φ
predicts accurately, the budget M can decrease significantly so that the dynamic regret effectively decreases.
Jadbabaie et al. [2015], Yang et al. [2016], Bedi et al. [2018], Zhang et al. [2017a], Mokhtari et al. [2016],
Zhang et al. [2018c] chooses the reference points {yt∗ }Tt=1 with yt∗ = argminz∈X ft (z), where ft is the loss
function atthe t-th
iteration. György and Szepesvári [2016] provides a new analysis framework, which
√
achieves O
T M dynamic regret2 for all the above reference points. Recently, the lower bound of the
√
dynamic regret was shown to be Ω
T M [Zhang et al., 2018b, Zhao et al., 2018], which indicates that
the above algorithms are optimal in terms of dynamic regret. In this paper, we propose a new definition of
dynamic regret, which covers all the previous ones as special cases.
In some literatures, the regret in a dynamic environment is measured by the number of changes of the
reference points over time, which is usually termed as shifting regret or tracking regret [Herbster and Warmuth,
1998, György et al., 2005, Gyorgy et al., 2012, György and Szepesvári, 2016, Mourtada and Maillard, 2017,
Adamskiy et al., 2016, Wei et al., 2016, Cesa-Bianchi et al., 2012, Mohri and Yang, 2018, Jun et al., 2017].
Both the shifting regret and the tracking regret are usually studied in the setting of “learning with expert
advice" while the dynamic regret is more often studied in the general setting of online learning.
2 György and Szepesvári [2016] uses the notation of “shifting regret" instead of “dynamic regret". In the paper, we keep using
“dynamic regret" as used in most previous literatures.
3
3
Problem formulation
In decentralized online learning, the topological structure of the network can be represented by an undirected
graph G = (nodes:[n], edges:E) with vertex set [n] = {1, . . . , n} and edge set E ⊂ [n] × [n]. In real applications,
each node i ∈ [n] is associated with a separate learner, for example an mobile device of one user, which
maintains a local predictive model. Users would like to cooperatively better minimize their regret without
sharing their private data. They typically share private models to their neighbors (or friends), which are
directly adjacent nodes in G for each node.
Let xi,t denote the local model for user i at iteration t. In iteration t user i predicts the local model xi,t
for an unknown loss, and then receives the loss fi,t (·; ξi,t ). As a result, the decentralized online learning
algorithm suffered a instantaneous loss fi,t (xi,t ; ξi,t ). ξi,t ’s are independent to each other in terms of i and t,
charactering the stochastic component in the function fi,t (·; ξi,t ), while the subscripts i and t of f indicate
the adversarial component, for example, the user’s profile, location, local time, and etc. The stochastic
component in the function is usually caused by the potential relation among local models. For example, users’
perference to music may be impacted by a popular trend in the Internet at the same time.
To measure the efficacy of a decentralized online learning algorithm A ∈ A, a commonly used performance
measure is the static regret which is defined as
#
" n T
XX
A
∗
e
R :=
(fi,t (xi,t ; ξi,t ) − fi,t (x ; ξi,t ) ,
E
T
Ξn,T ∼Dn,T
Pn
i=1 t=1
PT
where x∗ = arg minx EΞn,T ∼Dn,T i=1 t=1 fi,t (x; ξi,t ). The static regret essentially assumes that the optimal
model would not change over time. However, in many practical online learning application scenarios, the
optimal model may evolve over time. For example, when we want to conduct music recommendation to a
user, user’s preference to music may change over time as his/her situation. Thus, the optimal model x∗
should change over time. It leads to the dynamics of the optimal recommendation model. Therefore, for any
online algorithm A ∈ A, we choose to use the dynamic regret as the metric:
" n T
#
" n T
#
XX
XX
A
∗
RT :=
fi,t (xi,t ; ξi,t ) −
min
fi,t (xt ; ξi,t ) ,
(1)
E
E
Ξn,T ∼Dn,T
T
T
{x?
t }t=1 ∈LM Ξn,T ∼Dn,T
i=1 t=1
where LTM is defined by LTM =
{zt }Tt=1 :
TP
−1
t=1
i=1 t=1
kzt+1 − zt k ≤ M . LTM restricts how much the optimal model
eA
may change over time. Obviously, RA
T degenerates to RT when M = 0.
4
Decentralized Online Gradient (DOG) algorithm
In the section, we introduce the DOG algorithm, followed by the analysis for the dynamic regret.
Algorithm 1 DOG: Decentralized Online Gradient method.
Require: Learning rate η, number of iterations T , and the confusion matrix W.
1: Initialize xi,1 = 0 for all i ∈ [n];
2: for t = 1, 2, ..., T do
3:
// For all users (say the i-th node i ∈ [n])
4:
Query the neighbors’ local models {xj,t }j∈user i’s neighbor set ;
5:
Apply the local model and suffer loss fi,t (xi,t ; ξi,t );
6:
Compute the gradient ∇fi,t (xi,t ; ξi,tP
);
7:
Update the local model by xi,t+1 = j∈user i’s neightbours Wi,j xj,t − η∇fi,t (xi,t ; ξi,t );
8: end for
4
4.1
Algorithm description
In the DOG algorithm, users exchange their local models periodically. In each iteration, each user runs
the following steps: 1) Query the local models from his/her all neighbors; 2) Apply the local model to
fi,t (·; ξi,t ) and compute the gradient; 3) Update the local model by taking average with neighbors’ models
followed by a gradient step. The detailed description of the DOG algorithm can be found in Algorithm 1.
W ∈ Rn×n is the confusion matrix defined on an undirected graph G = (nodes:[n], edges:E). It is assumed to
be a doubly stochastic matrix [Wu et al., 2018, Zeng and Yin, 2018, Yuan et al., 2016], but not necessarily
symmetric. Given a decentralized network G, there are many approaches to obtain a doubly stochastic W, for
example, Sinkhorn matrix scaling [Knight, 2008]. The following are two naive ways to construct such a doubly
stochastic matrix W. Let Ni be the number of user i’s neighbors (exclusive itself), and Nmax := maxi Ni .
We can obtain a doubly stochastic matrix by: 1) Wi,j = 0, if i and j are not connected (i 6= j) in E; 2)
Wi,j = n1 , if j are i are connected (i 6= j) in E; 3) Wi,i = 1 − Nni . When Nmax := maxi∈[n] Ni is known, an
1
alternative method is: 1) Wi,j = 0, if i and j are not connected (i 6= j) in E; 2) Wi,j = Nmax
+1 , if j are i
Ni
are connected (i 6= j) in E; 3) Wi,i = 1 − Nmax +1 . To take a closer look at the algorithm’s updating rule, we
Pn
Pn
define x̄t = n1 i=1 xi,t . It is not hard to verify that x̄t+1 = x̄t − nη i=1 ∇fi,t (xi,t ; ξi,t ) The detailed proofs
are provided in Lemma 1 (See Supplemental Materials.)
4.2
Dynamic regret Analysis
Next we show the dynamic regret of DOG in the following. Before that, we first make some common
assumptions used in our analysis.
Assumption 1. We make following assumptions throughout this paper:
• For any i ∈ [n], t ∈ [T ], and x, there exist constants G and σ such that Eξi,t ∼Di,t ∇fi,t (x; ξi,t ) ≤ G, and
2
Eξi,t ∼Di,t ∇fi,t (x; ξi,t ) − Eξi,t ∼Di,t ∇fi,t (x) ≤ σ 2 .
2
• For given vectors x and y, we assume kx − yk ≤ R. Besides, for any i ∈ [n] and t ∈ [T ], we assume fi,t
is convex, and has L-Lipschitz gradient.
• Let W be doubly stochastic and ρ := W −
11>
n
. Assume ρ < 1.
G essentially gives the upper bound for the adversarial component in fi,t (·; ξi,t ). The stochastic component
is bounded by σ 2 . Note that if there is no stochastic component, G is nothing but the upper bound of the
gradient like the setting in many online learning literature. It is important for our analysis to split these two
components, which will be clear very soon.
The last assumption about W is an essential assumption for the decentralized setting. The largest
eigenvalue for a doubly stochastic matrix is 1. 1 − ρ measures how fast the information can propagate within
the network (the larger the faster). Now we are ready to present the dynamic regret for DOG.
2
2
2
2
2
+σ )
4L (G +σ )
Theorem 1. Denote constants C0 , C1 , and C2 by C0 := 2L(G
, C2 := 2 +
(1−ρ)2 , C1 :=
(1−ρ)2
Choosing η > 0 in Algorithm 1, under Assumption 1 we have
n √
RDOG
≤ ηT σ 2 + C0 nT η 2 + C1 nT η 3 +
4 RM + R + C2 nηT G2 .
T
2η
By choosing an appropriate learning rate η, we obtain sublinear regret as follows.
r
√
(1−ρ)(nM R+nR)
Corollary 1. Choosing η =
in Algorithm 1, under Assumption 1 we have
2
nT G +T σ 2
RDOG
T
v u
√ √ √ 5
u T M + R (n2 G2 + nσ 2 ) n2 M + R
2
n
M
+
R
t
.
+
+ p
.
1−ρ
1−ρ
(1 − ρ)T
5
1
1−ρ .
For simpler discussion, let us treat R, G, and 1 − ρ as constants. The dynamic regret can be simp
plified into O
(n2 T G2 + nT σ 2 ) (M + 1) . If M = 0, the dynamic regret degenerates the static regret
√
O n2 T G2 + nT σ 2 . More discussion is conducted in the following aspects.
• (Tightness.) To see the tightness, we consider a few special cases:
– (σ = 0 and n = 1.) It degenerates
√
to the vanilla online learning setting but with dynamic regret. The
implied static regret O
T M is consistent with the dynamic regret result in Zhao et al. [2018], which
is proven to be optimum.
√
√
– (M = 0.) It degenerates to the static regret O n2 T G2 + nT σ 2 . When G < σ/ n, that is, the
√
stochastic component dominates the adversarial component, the static regret O
nT σ implies the
√ √
1 DOG
average regret n RT
to be σ T / n, and the convergence rate (from the stochastic optimization
√
perspective) to be σ/ nT . It is consistent
with the ergodic convergence rate in Tang et al. [2018], and
√
improves the non-ergodic result σ/ T in Nedic and Ozdaglar [2009]3 .
√
• (Insight.) Setting M = 0, we obtain the static regret O n2 T G2 + nT σ 2 . Consider the baseline that all
users do not communicate but only
√ run local online
gradient. It is not hard to verify that the static regret
for this baseline approach is O n2 T G2 + n2 T σ 2 . Comparing with the baseline, the improvement of our
new bound is only on the stochastic component. Denote that G measures the magnitude of the adversarial
component and σ measures the stochastic component. This result reveals an important observation that
the communication does not really help improve the adversarial component, only the stochastic component
can benefit from the communication. This observation makes quite sense, since if the users’ private data
are totally arbitrary, there is no reason they can benefit to each other by exchanging anything.
• (Improve existing dynamic regret in decentralized
setting.)
If we also treat G and σ to be
√
constants, our analysis leads to a dynamic regret O n M T . Shahrampour and Jadbabaie [2018] only
3√
considers the adversary loss, and provides O n 2 M T regret for DOG. Compared with their result, our
regret enjoys the state-of-the-art dependence on T and M , and meanwhile improves the dependence on n.
Next we discuss how close all local models xi,t ’s close to their average at each time. The following result
suggests that xi,t ’s are getting closer and closer over iterations.
r
√
Pn
(1−ρ)(nM R+nR)
1
in Algorithm 1, under Assumption
Theorem 2. Denote x̄t = n i=1 xi,t . Choosing η =
2
nT G +T σ 2
1 we have
1
nT
"
E
Ξn,T ∼Dn,T
T
n X
X
#
kxi,t − x̄t k
i=1 t=1
2
√
n(M + R)
.
.
(1 − ρ)T
The result suggests that xi,t approaches to x̄t roughly in the
O (1/T ) (treat M and ρ as constants.),
rate
√ which is faster than the convergence of the averaged regret O 1/ T from Corollary 1.
5
Empirical studies
We consider online logistic regression with squared `2 norm regularization. In the task, fi,t (x; ξi,t ) =
γ
2
−3
log 1 + exp(−yi,t A>
is a given hyper-parameter. ξi,t is the randomness of the
i,t x) + 2 kxk , where γ = 10
function fi,t , which is caused by the randomness of data in the experiment. Under this setting, we compare
the performance of the proposed Decentralized Online Gradient method (DOG) and that of the Centralized
Online Gradient method (COG).
3 See
the non-ergodic convergent rate in Proposition 3(b) in Nedic and Ozdaglar [2009].
6
Local OGD (M=10)
DOG(M=10)
0.67
0.665
200 400 600 800 1000
T
(a) β = 0.9
0.62
0.6
0.58
Local OGD (M=10)
DOG(M=10)
0.56
200
400
600
800 1000
0.5
0.45
Local OGD (M=10)
DOG(M=10)
0.4
200
400
600
800 1000
T
T
(b) β = 0.7
(c) β = 0.5
Average loss
0.675
Average loss
0.68
Average loss
Average loss
0.55
0.685
0.4
0.3
Local OGD (M=10)
DOG(M=10)
0.2
200
400
600
800 1000
T
(d) β = 0.3
Figure 1: Local OGD vs. DOG on synthetic data with different ratios of the adversarial component. (10000
nodes with ring topology)
The dynamic budget M is fixed as 10 to determine the space of reference points. The learning rate η is tuned
to be
for each dataset separately. We evaluate the learning performance
measuring the average loss:
Poptimal
Pn byP
n PT
T
1
∗
f
(x
;
ξ
),
instead
of
using
the
dynamic
regret
E
Ξn,T ∼Dn,T
i=1
t=1 i,t i,t i,t
i=1
t=1 (fi,t (xi,t ; ξi,t ) − fi,t (xt ))
nT
∗ T
directly, since the optimal reference point {xt }t=1 is the same for both DOG and COG. We perform emperical
evaluation on a toy dataset and several real-world datasets, whose details are presented as follows.
Synthetic Data. For the i-th node, a data matrix Ai ∈ R10×T is generated, s.t. Ai = β Ãi + (1 − β)Âi ,
where Ãi represents the adversary part of data, and Âi represents the stochastic part of data. β with
0 < β < 1 is used to make balance between the adversary and stochastic components. A large β represents
the adversary component is significant, and the stochastic component becomes significant with the decrease
of β. Specifically, elements of Ãi is uniformly sampled from the interval [−0.5 + sin(i), 0.5 + sin(i)]. Note that
Ãi and Ãj with i 6= j are drawn from different distributions. Âi,t is generated according to yi,t ∈ {1, −1}
which is generated uniformly. When yi,t = 1, Âi,t is generated by sampling from a time-varying distribution
N ((1 + 0.5 sin(t)) · 1, I). When yi,t = −1, Âi,t is generated by sampling from another time-varying distribution
N ((−1 + 0.5 sin(t)) · 1, I). Due to this correlation, yi,t can be considered as the label of the instance Âi,t .
Real Data. The real public datasets include SUSY 4 (5, 000, 000 samples), room-occupancy 5 (20, 560
samples), usenet2 6 (1, 500 samples), and spam 7 (9, 324 samples). SUSY is a large-scale binary classification
dataset, and we use the whole dataset to general two kinds of data: the stochastic data and the adversarial
data. The stochastic data is generated by using some instances, e.g., 80% of the whole dataset, and then
allocating them to nodes randomly and uniformly. The adversarial data is generated by using the other
instances, conducting clustering on those data to yield n clusters, and then allocate every cluster to a node.
room-occupancy is a time-series dataset, which is from a natural dynamic enviroment. Both usenet2 and
spam are “concept drift” [Katakis et al., 2010] datasets, for which the optimal model changes over time. For
all dataset, all values of every feature have been normalized to be zero mean and one variance. We present
the numerical results for the dataset SUSY here, and place other results in supplementary materials.
DOG is effective to reduce the stochastic component of regret. It is compared with the local
online gradient descent (Local OGD), where every node trains a local model without communication with
others. We vary β to generate different kinds of synthetic data, to obtain different balance between adversary
and stochastic components. As shown before, the stochastic component of data becomes significant with the
decrease of β. Figure 1 shows that DOG becomes significantly effective to reduce the stochastic component of
regret for small β. It validates that exchanging models in a decentralized network is necessary and important
to reduce regret, which matches with our theoretical result. In the following empirical studies, we generate
synthetic data in the setting of β = 0.1.
DOG yields comparable performance with COG. Figure 2 summarizes the performance of DOG
compared with COG. For the synthetic dataset, we simulate a network consisting of 10000 nodes, where every
node is randomly connected with other 15 nodes. Similarly, we simulate a ring network for the other real
4 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#SUSY
5 https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
6 http://mlkd.csd.auth.gr/concept_drift.html
7 http://mlkd.csd.auth.gr/concept_drift.html
7
0.2
500
1000
1500
0.58
0.56
1000
2000
T
3000
0.68
DOG(M=10)
COG(M=10)
0.67
Average loss
0.4
DOG(M=10)
COG(M=10)
0.6
Average loss
Average loss
Average loss
DOG(M=10)
COG(M=10)
0.6
0.66
0.65
1000
5000
3000
DOG(M=10)
COG(M=10)
0.68
0.66
0.64
1000
5000
3000
5000
T
T
T
(a) synthetic, 10000 nodes, (b) SUSY, 100% stochastic (c) SUSY, 80% stochastic data (d) SUSY, 50% stochastic data
β = 0.1, random topology
data
n=5x10 3 , =2
n=1x10 4 , =2
2
1
200
600
n=1000
n=800
n=600
n=500
0.6
0.58
0.56
1000
1000
T
3000
n=1000
n=800
n=600
n=500
0.68
0.66
Average loss
n=1x10 4 , =1
Average loss
n=5x10 3 , =1
3
Average loss
Average loss
Figure 2: DOG vs. COG (1000 nodes with ring topolgy for real data).
0.64
5000
1000
T
3000
n=1000
n=800
n=600
n=500
0.68
0.67
0.66
0.65
5000
1000
T
3000
5000
T
(a) synthetic, β = 0.1, random (b) SUSY, 100% stochastic (c) SUSY, 80% stochastic data, (d) SUSY, 50% stochastic data,
topology
data, ring topology
ring topology
ring topology
1
500
0.58
0.56
1000 1500
0.67
Disconnected
WattsStrog(0.5)
WattsStrog(1)
Ring
Fully connected
0.66
0.65
Disconnected
WattsStrog(0.5)
WattsStrog(1)
Ring
Fully connected
0.68
0.67
0.66
0.65
0.64
1000 2000 3000 4000 5000
1000 2000 3000 4000 5000
T
0.68
Average loss
2
Disconnected
WattsStrog(0.5)
WattsStrog(1)
Ring
Fully connected
0.6
Average loss
Disconnected
WattsStrog(0.5)
WattsStrog(1)
Ring
Fully connected
3
Average loss
Average loss
Figure 3: The robustness of DOG wrt the network size.
1000 2000 3000 4000 5000
T
T
T
(a) synthetic, 10000 nodes, (b) SUSY, 1000 nodes, 100% (c) SUSY, 1000 nodes, 80% (d) SUSY, 1000 nodes, 50%
β = 0.1
stochastic data
stochastic data
stochastic data
Figure 4: The sensitivity of DOG wrt the topology of the network.
datasets, and every node is randomly connected with other 3 nodes. Under these settings, we can observe that
both DOG and COG are effective for the online learning tasks on all the datasets. Besides, DOG achieves
slightly worse performance than COG. It is significant when the adversarial component becomes large, which
can be verified by Figures 2(c) and 2(d).
The performance of DOG is robust to the network size, but is sensitive to the variance of
the stochastic data. Figure 3 summarizes the effect of the network size on the performance of DOG. We
change the number of nodes different datasets. The synthetic dataset is tested by using the random topology,
and those real datasets are tested by using the ring topology. Figure 3 draws the curves of average loss over
time steps. We observe that the average loss curves are mostly overlapped with different nodes. It shows
that DOG is robust to the network size (or number of users), which validates our theory, that is, the average
regret8 does not increase with the number of nodes. Furthermore, we observe that the average loss becomes
large with the increase of the variance of stochastic data, which validates our theoretical result nicely.
The performance of DOG can be improved in a well-connected network. Figure 4 shows the
effect of the topology of the network on the performance of DOG, where five different topologies are used.
8 The
‘average regret’ equals to
regret
.
nT
Our theoretical analysis shows that the average regret of DOG is O
8
q
G2
T
+
σ2
nT
.
Besides, the ring topology, the Disconnected topology means there are no edges in the network, and every
node does not share its local model to others. The Fully connected topology means all nodes are connected,
where DOG de-generates to be COG. The topology WattsStrogatz represents a Watts-Strogatz small-world
graph, for which we can use a parameter to control the number of stochastic edges (set as 0.5 and 1 in
this paper). The result shows Fully connected enjoys the best performance, because that ρ = 0 for it, and
Disconnected suffers the worst performance due to ρ = 1 for it. Other topologies owns 0 < ρ < 1 for them.
6
Conclusion
We investigate decentralized online learning problem, where the loss is incurred by both adversary and
stochastic components. We define a new dynamic regret, and analyze a decentralized online gradient method
theoretically. It shows that the communication is only effective to decrease the regret caused by the stochastic
component, and thus users can benefit from sharing their private models, instead of private data.
References
D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. Journal of Machine
Learning Research, 17(23):1–21, 2016.
M. Akbari, B. Gharesifard, and T. Linder. Distributed online convex optimization on time-varying directed
graphs. IEEE Transactions on Control of Network Systems, 4(3):417–428, Sep. 2017.
A. S. Bedi, P. Sarma, and K. Rajawat. Tracking moving agents via inexact online gradient descent algorithm.
IEEE Journal of Selected Topics in Signal Processing, 12(1):202–217, Feb 2018.
A. A. Benczúr, L. Kocsis, and R. Pálovics. Online Machine Learning in Big Data Streams. CoRR, 2018.
S. Bubeck. Introduction to online optimization, December 2011.
N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror Descent Meets Fixed Share (and feels no
regret). In NIPS 2012, page Paper 471, 2012.
F. Dufossé and B. Uccar. Notes on Birkhoff-von Neumann decomposition of doubly stochastic matrices.
Research Report RR-8852, Inria - Research Centre Grenoble – Rhône-Alpes, Feb. 2016.
A. György and C. Szepesvári. Shifting regret, mirror descent, and matrices. In Proceedings of the 33rd
International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages
2943–2951. JMLR.org, 2016.
A. György, T. Linder, and G. Lugosi. Tracking the Best of Many Experts. Proceedings of Conference on
Learning Theory (COLT), 2005.
A. Gyorgy, T. Linder, and G. Lugosi. Efficient tracking of large classes of experts. IEEE Transactions on
Information Theory, 58(11):6709–6725, Nov 2012.
E. C. Hall and R. Willett. Dynamical Models and tracking regret in online convex programming. In Proceedings
of International Conference on International Conference on Machine Learning (ICML), 2013.
E. C. Hall and R. M. Willett. Online Convex Optimization in Dynamic Environments. IEEE Journal of
Selected Topics in Signal Processing, 9(4):647–662, 2015.
E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):
157–325, 2016.
M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178, Aug 1998.
9
S. Hosseini, A. Chapman, and M. Mesbahi. Online distributed optimization via dual averaging. In 52nd
IEEE Conference on Decision and Control, pages 1484–1489, Dec 2013.
A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online Optimization : Competing with
Dynamic Comparators. In Proceedings of International Conference on Artificial Intelligence and Statistics
(AISTATS), pages 398–406, 2015.
K.-S. Jun, F. Orabona, S. Wright, and R. Willett. Improved strongly adaptive online learning using coin
betting. In A. Singh and J. Zhu, editors, Proceedings of the 20th International Conference on Artificial
Intelligence and Statistics (AISTATS), volume 54, pages 943–951, 20–22 Apr 2017.
M. Kamp, M. Boley, D. Keren, A. Schuster, and I. Sharfman. Communication-efficient distributed online
prediction by dynamic model synchronization. In Proceedings of the 2014th European Conference on
Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’14, pages 623–639,
Berlin, Heidelberg, 2014. Springer-Verlag.
I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurring contexts using ensemble classifiers: An
application to email filtering. Knowledge and Information Systems, 22(3):371–391, 2010.
P. Knight. The sinkhorn–knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis
and Applications, 30(1):261–275, 2008.
A. Koppel, S. Paternain, C. Richard, and A. Ribeiro. Decentralized online learning with kernels. IEEE
Transactions on Signal Processing, 66(12):3240–3255, June 2018.
S. Lee, A. Ribeiro, and M. M. Zavlanos. Distributed continuous-time online optimization using saddle-point
methods. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4314–4319, Dec 2016.
S. Lee, A. Nedić, and M. Raginsky. Coordinate dual averaging for decentralized online optimization with
nonseparable global objectives. IEEE Transactions on Control of Network Systems, 5(1):34–44, March
2018.
M. Mohri and S. Yang. Competing with automata-based expert sequences. In A. Storkey and F. Perez-Cruz,
editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics,
volume 84, pages 1732–1740, 09–11 Apr 2018.
A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro. Online optimization in dynamic environments:
Improved regret rates for strongly convex problems. In Proceedings of IEEE Conference on Decision and
Control (CDC), pages 7195–7201. IEEE, 2016.
J. Mourtada and O.-A. Maillard. Efficient tracking of a growing number of experts. arXiv.org, Aug. 2017.
A. Nedic and A. E. Ozdaglar. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans.
Automat. Contr., 54(1):48–61, 2009.
A. Nedić, S. Lee, and M. Raginsky. Decentralized online optimization with global objectives and local
communication. In 2015 American Control Conference (ACC), pages 4497–4503, July 2015.
S. Shahrampour and A. Jadbabaie. Distributed online optimization in dynamic environments using mirror
descent. IEEE Transactions on Automatic Control, 63(3):714–725, March 2018.
S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends R in Machine
Learning, 4(2):107–194, 2012.
H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication Compression for Decentralized Training.
arXiv.org, Mar. 2018.
10
C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Tracking the best expert in non-stationary stochastic environments. In
D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Proceedings of Advances in
Neural Information Processing Systems, pages 3972–3980, 2016.
T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with asynchrony
and delays. IEEE Transactions on Signal and Information Processing over Networks, 4(2):293–307, June
2018.
H.-F. Xu, Q. Ling, and A. Ribeiro. Online learning over a decentralized network through admm. Journal of
the Operations Research Society of China, 3(4):537–562, Dec 2015.
F. Yan, S. Sundaram, S. V. N. Vishwanathan, and Y. Qi. Distributed autonomous online learning: Regrets
and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering, 25
(11):2483–2493, Nov 2013.
T. Yang, L. Zhang, R. Jin, and J. Yi. Tracking Slowly Moving Clairvoyant - Optimal Dynamic Regret of
Online Learning with True and Noisy Gradient. In Proceedings of the 34th International Conference on
Machine Learning (ICML), 2016.
K. Yuan, Q. Ling, and W. Yin. On the Convergence of Decentralized Gradient Descent. SIAM Journal on
Optimization, 2016.
J. Zeng and W. Yin. On nonconvex decentralized gradient descent. IEEE Transactions on Signal Processing,
66(11):2834–2848, June 2018.
C. Zhang, P. Zhao, S. Hao, Y. C. Soh, B. S. Lee, C. Miao, and S. C. H. Hoi. Distributed multi-task
classification: a decentralized online learning approach. Machine Learning, 107(4):727–747, Apr 2018a.
L. Zhang, T. Yang, J. Yi, R. Jin, and Z.-H. Zhou. Improved Dynamic Regret for Non-degenerate Functions.
In Proceedings of Neural Information Processing Systems (NIPS), 2017a.
L. Zhang, S. Lu, and Z.-H. Zhou. Adaptive online learning in dynamic environments. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information
Processing Systems 31, pages 1323–1333. 2018b.
L. Zhang, T. Yang, rong jin, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods. In Proceedings
of the 35th International Conference on Machine Learning (ICML), pages 5882–5891, 10–15 Jul 2018c.
W. Zhang, P. Zhao, W. Zhu, S. C. H. Hoi, and T. Zhang. Projection-free distributed online learning in
networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on
Machine Learning, pages 4054–4062, International Convention Centre, Sydney, Australia, 06–11 Aug 2017b.
Y. Zhao, S. Qiu, and J. Liu. Proximal Online Gradient is Optimum for Dynamic Regret. CoRR, cs.LG, 2018.
M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of
International Conference on Machine Learning (ICML), pages 928–935, 2003.
11
Supplementary materials for theoretical analysis
For math brevity, we denote the function Fi,t by Fi,t (·) := Eξi,t ∼Di,t fi,t (·; ξi,t ) throughout proofs.
Proof to Theorem 1:
Proof. From the regret definition, we have
n
n
E
1X
1X
(fi,t (xi,t ; ξi,t ) − fi,t (x∗t ; ξi,t )) ≤
h∇fi,t (xi,t ; ξi,t ), xi,t − x∗t i
E
n i=1
Ξn,t ∼Dn,t n
i=1
E
1X
(h∇fi,t (xi,t ; ξi,t ), xi,t − x̄t i + h∇fi,t (xi,t ; ξi,t ), x̄t − x̄t+1 i)
n i=1
{z
}
Ξn,t ∼Dn,t
n
=
Ξn,t ∼Dn,t
|
I1 (t)
*
+
E
Ξn,t ∼Dn,t
|
+
n
1X
∗
∇fi,t (xi,t ; ξi,t ), x̄t+1 − xt .
n i=1
{z
}
I2 (t)
Now, we begin to bound I1 (t).
n
I1 (t) =
E
Ξn,t ∼Dn,t
|
1X
h∇fi,t (xi,t ; ξi,t ), xi,t − x̄t i +
E
n i=1
Ξn,t ∼Dn,t
{z
} |
J1 (t)
*
+
n
1X
∇fi,t (xi,t ; ξi,t ), x̄t − x̄t+1 .
n i=1
{z
}
J2 (t)
For J1 (t), we have
J1 (t)
=
n
X
1
h∇fi,t (xi,t ; ξi,t ), xi,t − x̄t i
E
n Ξn,t ∼Dn,t i=1
=
n
X
1
h∇Fi,t (xi,t ), xi,t − x̄t i
E
n Ξn,t−1 ∼Dn,t−1 i=1
n
n
X
X
1
1
=
h∇Fi,t (xi,t ) − ∇Fi,t (x̄t ), xi,t − x̄t i +
h∇Fi,t (x̄t ), xi,t − x̄t i
E
E
n Ξn,t−1 ∼Dn,t−1 i=1
n Ξn,t−1 ∼Dn,t−1 i=1
=
n
n
X
X
L
1
2
kxi,t − x̄t k +
h∇Fi,t (x̄t ), xi,t − x̄t i .
E
E
n Ξn,t−1 ∼Dn,t−1 i=1
n Ξn,t−1 ∼Dn,t−1 i=1
Consider the last term, and we have
n
X
1
h∇Fi,t (x̄t ), xi,t − x̄t i
E
n Ξn,t−1 ∼Dn,t−1 i=1
1 1
tr ∇Ft (X̄t )> Xt − X̄t
E
n Ξn,t−1 ∼Dn,t−1
=
!!
t−1
1X
t−1−s
>
−
ηGs Ws
v1 v1
n s=1
s=1
!!
t−1
t−1
X
1X
t−1−s
t−1−s
>
η∇Fs (Xs )Ws
−
η∇Fs (Xs )Ws
v1 v1
n s=1
s=1
1
=
tr ∇Ft (X̄t )>
E
n Ξn,t−1 ∼Dn,t−1
1
= tr ∇Ft (X̄t )>
n
t−1
X
ηGs Wst−1−s
12
(2)
t−1
ηX
1
>
t−1−s
>
=
tr ∇Ft (X̄t ) ∇Fs (Xs )Ws
In − v 1 v 1
n s=1
n
t−1
1
ηX
>
t−1−s
>
tr ∇Ft (X̄t ) ∇Fs (Xs ) Ws
− v1 v1
=
n s=1
n
t−1
1
1
η X t−1−s
2
∇Ft (X̄t ) F + t−1−s ∇Fs (Xs ) Wst−1−s − v1 v1>
ρ
≤
2n s=1
ρ
n
2 η
2n
t−1 X
ρt−1−s ∇Ft (X̄t )
2
F
+ ρt−1−s k∇Fs (Xs )kF
t−1
η X t−1−s ∇Ft (X̄t )
ρ
≤
2n s=1
2
F
+ k∇Fs (Xs )kF
≤
2
!
F
2
s=1
≤ηG2
t−1
X
2
ρt−1−s
s=1
ηG2
≤
.
1−ρ
1 holds due to Xt = [x1,t ; x2,t ; · · · ; xn,t ], and X̄t = [x̄t ; x̄t ; · · · ; x̄t ]. 2 holds due to Lemma 3.
Substitute it into (2), and we thus have
J1 (t) =
n
X
ηG2
L
2
kxi,t − x̄t k +
.
E
n Ξn,t−1 ∼Dn,t−1 i=1
1−ρ
For J2 (t), we have
J2 (t)
*
=
E
Ξn,t ∼Dn,t
η
≤
2
η
≤
2
n
1X
∇fi,t (xi,t ; ξi,t ), x̄t − x̄t+1
n i=1
n
2
1
2η
+
E
1X
∇fi,t (xi,t ; ξi,t )
n i=1
E
1X
(∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t ) + ∇Fi,t (xi,t ))
n i=1
E
1X
(∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t ))
n i=1
Ξn,t ∼Dn,t
+
2
E
Ξn,t ∼Dn,t
kx̄t − x̄t+1 k
n
Ξn,t ∼Dn,t
n
≤η
Ξn,t ∼Dn,t
2
+
2
1
2η
n
+η
E
Ξn,t ∼Dn,t
n
n
η
1X
(∇Fi,t (xi,t ) − ∇Fi,t (x̄t ))
≤ σ 2 + 2η
E
n
Ξn,t−1 ∼Dn,t−1 n
i=1
n
+ 2η
E
Ξn,t−1 ∼Dn,t−1
η
2η
≤ σ2 +
n
n
E
Ξn,t−1 ∼Dn,t−1
n
X
2
+
1
2η
2
kx̄t − x̄t+1 k
1X
∇Fi,t (xi,t )
n i=1
1X
1 η 2
σ +η
(∇Fi,t (xi,t ) − ∇Fi,t (x̄t ) + ∇Fi,t (x̄t ))
E
≤
n
Ξn,t−1 ∼Dn,t−1 n
i=1
1X
∇Fi,t (x̄t )
n i=1
E
Ξn,t ∼Dn,t
2
+
1
2η
E
2
Ξn,t ∼Dn,t
+
1
2η
2
kx̄t − x̄t+1 k
2
2
kx̄t − x̄t+1 k
E
Ξn,t ∼Dn,t
2
k∇Fi,t (xi,t ) − ∇Fi,t (x̄t )k + 2ηG2 +
i=1
13
1
2η
E
Ξn,t ∼Dn,t
2
E
Ξn,t ∼Dn,t
kx̄t − x̄t+1 k
2
kx̄t − x̄t+1 k
2 η 2 2ηL2
σ +
≤
n
n
n
X
E
Ξn,t−1 ∼Dn,t−1
2
kxi,t − x̄t k + 2ηG2 +
i=1
1
2η
2
E
Ξn,t ∼Dn,t
kx̄t − x̄t+1 k .
1 holds due to
n
E
Ξn,t ∼Dn,t
1X
(∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t ))
n i=1
n
X
1
= 2
n
Ξn,t ∼Dn,t
2
n2
Ξn,t ∼Dn,t
+
=
1
n2
E
E
E
2
!
E
ξ ∼Di,t
i=1 i,t
n
n
X X
k∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t )k
2
h∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t ), ∇fj,t (xj,t ; ξj,t ) − ∇Fj,t (xj,t )i
i=1 j=1,j6=i
n
X
Ξn,t−1 ∼Dn,t−1
i=1
E
ξi,t ∼Di,t
2
k∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t )k + 0
1
≤ σ2 .
n
2 holds due to Fi,t has L Lipschitz gradients.
Therefore, we obtain
I1 (t) = (J1 (t) + J2 (t))
!
n
n
X
X
η 2 2ηL2
L
2
2
kxi,t − x̄t k + σ +
kxi,t − x̄t k
=
E
E
n Ξn,t−1 ∼Dn,t−1 i=1
n
n Ξn,t−1 ∼Dn,t−1 i=1
1
1
2
2
+
2+
ηG +
kx̄t − x̄t+1 k
E
1−ρ
2η Ξn,t ∼Dn,t
n
X
1
L 2ηL2
ησ 2
1
2
2
kx
−
x̄
k
+
2
+
≤
+
ηG2 +
+
kx̄t − x̄t+1 k .
E
E
i,t
t
n
n
1
−
ρ
n
2η
Ξn,t−1 ∼Dn,t−1
Ξn,t ∼Dn,t
i=1
Therefore, we have
T
X
I1 (t)
t=1
≤
L 2ηL2
+
n
n
E
Ξn,t−1 ∼Dn,t−1
n X
T
X
2
kxi,t − x̄t k + 2 +
i=1 t=1
1
1−ρ
T ηG2 +
T ησ 2
1
+
n
2η
E
Ξn,t ∼Dn,t
T
X
2
kx̄t − x̄t+1 k .
t=1
Now, we begin to bound I2 (t). Denote that the update rule is
xi,t+1 =
n
X
Wij xj,t − η∇fi,t (xi,t ; ξi,t ).
j=1
According to Lemma 1, we have
x̄t+1 = x̄t − η
!
n
1X
∇fi,t (xi,t ; ξi,t ) .
n i=1
Denote a new auxiliary function φ(z) as
* n
+
1X
1
2
∇fi,t (xi,t ; ξi,t ), z +
kz − x̄t k .
φ(z) =
n i=1
2η
14
(3)
It is trivial to verify that (3) satisfies the first-order optimality condition of the optimization problem:
minz∈Rd φ(z), that is,
∇φ(x̄t+1 ) = 0.
We thus have
*
x̄t+1 = argmin φ(z) = argmin
z∈Rd
z∈Rd
n
1X
∇fi,t (xi,t ; ξi,t ), z
n i=1
+
+
1
2
kz − x̄t k .
2η
Furthermore, denote a new auxiliary variable x̄τ as
x̄τ = x̄t+1 + τ (x∗t − x̄t+1 ) ,
where 0 < τ ≤ 1. According to the optimality of x̄t+1 , we have
0 ≤ φ(x̄τ ) − φ(x̄t+1 )
* n
+
1X
1 2
2
=
kx̄τ − x̄t k − kx̄t+1 − x̄t k
∇fi,t (xi,t ; ξi,t ), x̄τ − x̄t+1 +
n i=1
2η
* n
+
1X
1 2
2
=
∇fi,t (xi,t ; ξi,t ), τ (x∗t − x̄t+1 ) +
kx̄t+1 + τ (x∗t − x̄t+1 ) − x̄t k − kx̄t+1 − x̄t k
n i=1
2η
* n
+
1X
1 2
=
∇fi,t (xi,t ; ξi,t ), τ (x∗t − x̄t+1 ) +
kτ (x∗t − x̄t+1 )k + 2 hτ (x∗t − x̄t+1 ) , x̄t+1 − x̄t i .
n i=1
2η
Note that the above inequality holds for any 0 < τ ≤ 1. Divide τ on both sides, and we have
* n
+
1X
∗
I2 (t) =
∇fi,t (xi,t ; ξi,t ), x̄t+1 − xt
E
n i=1
Ξn,t ∼Dn,t
1
2
∗
∗
lim τ k(xt − x̄t+1 )k + 2 hxt − x̄t+1 , x̄t+1 − x̄t i
≤
E
2η Ξn,t ∼Dn,t τ →0+
1
=
hx∗ − x̄t+1 , x̄t+1 − x̄t i
E
η Ξn,t ∼Dn,t t
1
2
2
2
=
kx∗t − x̄t k − kx∗t − x̄t+1 k − kx̄t − x̄t+1 k .
E
2η Ξn,t ∼Dn,t
(4)
Besides, we have
x∗t+1 − x̄t+1
= x∗t+1
2
2
2
− kx∗t − x̄t+1 k
2
− kx∗t k − 2 x̄t+1 , −x∗t + x∗t+1
= x∗t+1 − kx∗t k
x∗t+1 + kx∗t k − 2 x̄t+1 , −x∗t + x∗t+1
≤ x∗t+1 − x∗t
x∗t+1 + kx∗t k + 2 kx̄t+1 k x∗t+1 − x∗t
√
≤4 R x∗t+1 − x∗t .
√
√
The last inequality holds due to our assumption, that is, x∗t+1 = x∗t+1 − 0 ≤ R, kx∗t k = kx∗t − 0k ≤ R,
√
and kx̄t+1 k = kx̄t+1 − 0k ≤ R.
Thus, telescoping I2 (t) over t ∈ [T ], we have
!
T
T
T
X
X
√ X
1
1
2
2
2
∗
∗
∗
∗
I2 (t) ≤
4 R
xt+1 − xt + kx̄1 − x̄1 k − kx̄T − x̄T +1 k −
kx̄t − x̄t+1 k
E
E
2η
2η
Ξn,T ∼Dn,T
Ξn,T ∼Dn,T
t=1
t=1
t=1
15
≤
1 √
1
4 RM + R −
2η
2η
T
X
E
Ξn,T ∼Dn,T
2
kx̄t − x̄t+1 k .
t=1
Here, M the budget of the dynamics.
Combining those bounds of I1 (t), and I2 (t) together, we finally obtain
E
Ξn,T ∼Dn,T
T X
n
X
fi,t (xi,t ; ξi,t ) − fi,t (x∗t ; ξi,t ) ≤ n
t=1 i=1
T
X
(I1 (t) + I2 (t))
t=1
T X
n
X
nT ηG2
n √
4 RM + R + 2nηT G2 +
2η
1−ρ
Ξn,T ∼Dn,T
t=1 i=1
2L + 4ηL2 nT η 2 (G2 + σ 2 )
n √
1
1
2
+
4 RM + R + 2 +
nηT G2 .
≤ ηT σ +
2
(1 − ρ)
2η
1−ρ
≤ηT σ 2 + L + 2ηL2
E
2
kx̄t − xi,t k +
1 holds due to Lemma 5
E
n X
T
X
Ξn,T ∼Dn,T
2
kxi,t − x̄t k ≤
i=1 t=1
nT η 2 (2G2 + 2σ 2 )
.
(1 − ρ)2
Rearranging items, we finally completes the proof.
Pn
Lemma 1. Denote x̄t = n1 i=1 xi,t . We have
x̄t+1 = x̄t − η
!
n
1X
∇fi,t (xi,t ; ξi,t ) .
n i=1
Proof. Denote
Xt =[x1,t , x2,t , ..., xn,t ] ∈ Rd×n ,
Gt =[∇f1,t (x1,t ; ξ1,t ), ∇f2,t (x2,t ; ξ2,t ), ..., ∇fn,t (xn,t ; ξn,t )] ∈ Rd×n .
Denote that
xi,t+1 =
n
X
Wij xj,t − η∇fi,t (xi,t ; ξi,t ).
j=1
Equivalently, we re-formulate the update rule as
Xt+1 = Xt W − ηGt .
Since the confusion matrix W is doublely stochastic, we have
W1 = 1.
Thus, we have
n
x̄t+1 =
1X
xi,t+1
n i=1
1
n
1
1
=Xt W − ηGt
n
n
=Xt+1
16
1
1
− ηGt
n
n
!
n
1X
=x̄t − η
∇fi,t (xi,t ; ξi,t ) .
n i=1
=Xt
It completes the proof.
Lemma 2. For any doubly stochastic matrix W, its norm kWk = 1.
Proof. According to Birkhoff-von Neumann theorem [Dufossé and Uccar, 2016], W is a convex combination
of some, e.g. k, permutation matrices {Mi }ki=1 , that is,
W=
k
X
ζi Mi ,
(5)
i=1
Pk
where 0 ≤ ζi ≤ 1 for any 1 ≤ i ≤ k and i=1 ζi = 1.
For any a vector u such that kuk = 1, we have kMi k = supkuk=1 kMi uk = supkuk=1 kuk = 1. Therefore,
we have
kWk ≤
k
X
ζi kMi k =
i=1
k
X
ζi = 1.
(6)
i=1
2
Since kWk is the maximal eigenvalue of WW> , and WW> has a eigenvalue 1, thus kWk ≥ 1.
Therefore, kWk = 1, and the proof is completed.
Lemma 3. Denote v1 =
√1 .
n
Given any matrix X and doubly stochastic matrix W, we have
XWt − Xv1 v1>
where ρ = kW − Vn k and Vn =
2
F
≤ ρt kXkF
2
,
1
>
n 11 .
Proof. We start from the left hand side:
XWt − XVn
2
F
=tr (XWt − XVn )(XWt − XVn )>
>
>
=tr XWt Wt X> − XWt Vn> X> − XVn Wt X> + XVn Vn> X>
1 tr XWt Wt > X> − XV X> .
n
=
‘tr’ represents the trace operator. 1 holds due to Vn = Vn> ,
Wt Vn> =
1
1 t−1
W W11> = Wt−1 11> = · · · = Vn ,
n
n
similarly
Vn W t
>
=
>
>
1 >
1
11 Wt = 11> Wt−1 = Vn ,
n
n
and Vn Vn> = Vn .
Additionally, since W is a doubly stochastic matrix, we have
Wt Wt
> 1
t−1 > 1
t−1 1
1
√ = Wt W>
√ = ··· = √ .
W √ = Wt W>
n
n
n
n
17
>
(1)
Thus, 1 is one of eigenvalues of Wt (Wt ) , and its largest eigenvalue λWt (Wt )> ≥ 1. According to Lemma 2,
we have
(1)
λWt (Wt )> = Wt W>
t
>
(1)
Thus, λWt (Wt )> = 1 is its largest eigenvalue, and v1 =
t
t
≤ W (W)
t
W>
≤ kWk
√1 1
n
t
≤ 1.
(7)
is the corresponding eigenvector. Since
t >
>
>
W (W ) is a symmetric and positive semi-definite matrix, we decompose Wt (Wt ) as Wt (Wt ) =
Pn
(i)
>
>
n×n
. vi is the normalized eigenvector
i=1 λWt (Wt )> vi vi = PΛP , where P = [v1 , v2 , ..., vn ] ∈ R
(i)
corresponding to the i-th eigenvalue λWt (Wt )> . Denote the absolute value of the i-th largest eigenvalue of
Wt (Wt )
>
(i)
(1)
(i)
(n)
by λWt (Wt )> , that is, λWt (Wt )> ≥ · · · ≥ λWt (Wt )> ≥ · · · ≥ λWt (Wt )> . Λ is a diagonal matrix,
(i)
and λWt (Wt )> is its i-th element.
Pn
>
(i)
Due to Wt (Wt ) = Vn + i=2 λWt (Wt )> vi vi> , we have
XWt −
2
XVn F
n
X
=tr X
!
(i)
λWt (Wt )> vi vi>
!
X>
i=2
=
n
X
(i)
λWt (Wt )> tr X vi vi> X>
i=2
=
n
X
(i)
λWt (Wt )> kXvi k
2
i=2
(2)
≤λWt (Wt )>
n
X
2
kXvi k
i=1
(2)
=λWt (Wt )> kXPk2F
(2)
=λWt (Wt )> tr XPP> X
(2)
=λWt (Wt )> tr XX>
(due to P’s definition)
>
(2)
=λWt (Wt )> kXk2F .
>
Recall that Vn = Vn> , Wt Vn> = Vn (Wt ) = Vn , and Vn Vn = Vn . We thus have
(W − Vn )t (W> − Vn )t = (W − Vn )t−1 (WW> − WVn − Vn W> + Vn Vn )(W> − Vn )t−1
=(W − Vn )t−1 (WW> − Vn )(W> − Vn )t−1
= · · · = Wt (W> )t − Vn .
Since ρ = kW − Vn k = W> − Vn , we have
(2)
t
λWt (Wt )> = Wt (W> )t − Vn ≤ kW − Vn k
W > − Vn
We finally have
XWt − XVn
2
F
It completes the proof.
18
≤ ρt kXkF
2
.
t
= ρ2t .
∞
Lemma 4 (Lemma 6 in [Tang et al., 2018]). Given two non-negative sequences {at }∞
t=1 and {bt }t=1 that
satisfying
at =
t
X
ρt−s bs ,
s=1
with ρ ∈ [0, 1), we have
k
X
a2t ≤
t=1
k
X
1
b2 .
(1 − ρ)2 s=1 s
Shahrampour and Jadbabaie [2018] investigates the dynamic regret of DOG, and provide the following
sublinear regret.
Theorem
3 (Implied by Theorem 3 and Corollary 4 in Shahrampour and Jadbabaie [2018]). Choose
q
3q
(1−ρ)M
MT
DOG
2
η=
in
Algorithm
1.
Under
Assumption
1,
the
dynamic
regret
R
is
bounded
by
O
n
T
T
1−ρ .
√
As illustrated in Theorem 3, Shahrampour and Jadbabaie [2018] has provided a O n nT M regret
for DOG. Comparing with the regret in Shahrampour and Jadbabaie [2018], our analysis improves the
dependence on n, which benefits from the following better bound of difference between xi,t and x̄t .
Lemma 5. Setting η > 0 in Algorithm 1, under Assumption 1 we have
E
Ξn,T ∼Dn,T
n X
T
X
2
kxi,t − x̄t k ≤
i=1 t=1
2nT η 2 (G2 + σ 2 )
.
(1 − ρ)2
Proof. Denote that
xi,t+1 =
n
X
Wij xj,t − η∇fi,t (xi,t ; ξi,t ),
j=1
and according to Lemma 1, we have
x̄t+1 = x̄t − η
!
n
1X
∇fi,t (xi,t ; ξi,t ) .
n i=1
Denote
Xt =[x1,t , x2,t , ..., xn,t ] ∈ Rd×n ,
Gt =[∇f1,t (x1,t ; ξ1,t ), ∇f2,t (x2,t ; ξ2,t ), ..., ∇fn,t (xn,t ; ξn,t )] ∈ Rd×n .
By letting xi,1 = 0 for any i ∈ [n], the update rule is re-formulated as
Xt+1 = Xt W − ηGt = −
t
X
ηGs Wt−s .
s=1
Similarly, denote Ḡt =
1
n
Pn
i=1
∇fi,t (xi,t ; ξi,t ), and we have
x̄t+1 = x̄t − η
!
n
t
X
1X
∇fi,t (xi,t ; ξi,t ) = −
η Ḡs .
n i=1
s=1
19
Therefore, we obtain
n
X
2
kxi,t − x̄t k =1
i=1
n
t−1
X
X
s=1
i=1
t−1
X
2
2
η Ḡs − ηGs Wt−s−1 ei
=
2
ηGs v1 v1>
− ηGs W
t−s−1
s=1
=η
2
F
t−1
X
!2
Gs v1 v1>
− Gs W
t−s−1
s=1
≤η
2
F
!2
t−1
X
Gs v1 v1>
− Gs W
t−s−1
F
s=1
3
≤
η
2
t−1
X
!2
ρ
t−s−1
kGs kF
s=1
1 holds due to ei is a unit basis vector, whose i-th element is 1 and other elements are 0s. 2 holds due to
1n
v1 = √
. 3 holds due to Lemma 3.
n
Thus, we have
T
n X
X
E
Ξn,T ∼Dn,T
≤
E
≤
t=1
!2
ηρt−s−1 kGs kF
s=1
T
X
2
η
(1 − ρ)2
η2
=
(1 − ρ)2
≤
i=1 t=1
T
t−1
X
X
Ξn,T ∼Dn,T
1
2
kxi,t − x̄t k
2η 2
(1 − ρ)2
E
Ξn,T ∼Dn,T
E
t=1
n
T X
X
Ξn,T ∼Dn,T
E
Ξn,T ∼Dn,T
!
2
kGt kF
!
k∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t ) + ∇Fi,t (xi,t )k
2
t=1 i=1
n
T X
X
2
k∇fi,t (xi,t ; ξi,t ) − ∇Fi,t (xi,t )k +
t=1 i=1
2η 2
(1 − ρ)2
E
Ξn,T ∼Dn,T
n
T X
X
k∇Fi,t (xi,t )k
2
t=1 i=1
nT η 2 (2G2 + 2σ 2 )
≤
.
(1 − ρ)2
1 holds due to Lemma 4. It completes the proof.
Proof to Theorem 2:
r
√
(1−ρ)(nM R+nR)
Proof. Setting η =
into Lemma 5, we finally complete the proof.
2
nT G +T σ 2
Supplementary materials for empirical studies
The dynamics of time-varying distributions are illustrated in Figure 5, which shows the change of the optimal
learning model over time and the importance of studying the dynamic regret. We use such time-varying
distribution to simulate the dynamics of the synthetic data.
More numerical results are presented as Figures 6-8.
20
Density
0.4
0.2
0
-5
0
5
10
15
0.3
0.2
0.1
500
1000
20
DOG(M=10)
COG(M=10)
15
10
5
1500
Average loss
DOG(M=10)
COG(M=10)
Average loss
Average loss
Variable
Figure 5: An illustration of the dynmaics caused by the time-varying distributions of data. Data distributions
1 and 2 satisify N (1 + sin(t), 1) and N (−1 + sin(t), 1), respectively. Suppose we want to conduct classification
between data drawn from distributions 1 and 2, respectively. The optimal classification model should change
over time.
250
200
150
100
50
500
50 100 150 200 250
T
DOG(M=10)
COG(M=10)
T
T
(a) room-occupancy, 5 nodes, ring topology
1000 1500
(b) usenet2, 5 nodes, ring topology
(c) spam, 5 nodes, ring topology
n=5
n=10
n=20
1.5
1
0.5
100
200
300
n=5
n=10
n=20
20
15
10
20
40
(a) room-occupancy, ring topology
60
T
T
(b) usenet2, ring topology
250
Average loss
Average loss
2
Average loss
Figure 6: The average loss yielded by DOG is comparable to that yielded by COG.
n=5
n=10
n=20
200
150
100
50
100
200
(c) spam, ring topology
Figure 7: The average loss yielded by DOG is insensitive to the network size.
21
300
T
400
0
150 200 250 300 350
Disconnected
WattsStrogatz(0.5)
WattsStrogatz(1)
Ring
Fully connected
40
20
20
40
T
T
(a) room-occupancy, 20 nodes
60
(b) usenet2, 20 nodes
Average loss
2
60
Average loss
Average loss
4
Disconnected
WattsStrogatz(0.5)
WattsStrogatz(1)
Ring
Fully connected
80
60
40
Disconnected
WattsStrogatz(0.5)
WattsStrogatz(1)
Ring
Fully connected
20
100 200 300 400
T
(c) spam, 20 nodes
Figure 8: The average loss yielded by DOG is insensitive to the topology of the network.
22
Download