Uploaded by 1554273827

federated learning

advertisement
1
ODE: An Online Data Selection Framework for
Federated Learning with Limited Storage
Chen Gong1 , Zhenzhe Zheng1 , Yunfeng Shao2 , Bingshuai Li2 , Fan Wu1 , Guihai Chen1
1 Shanghai Jiao Tong University, Shanghai, China
2 Huawei Noah’s Ark Lab, Beijing China
{gongchen, zhengzhenzhe}@sjtu.edu.cn, {shaoyunfeng, libingshuai}@huawei.com, {fwu, gchen}@cs.sjtu.edu.cn
Abstract—Machine learning models have been deployed in
mobile networks to deal with massive data from different layers
to enable automated network management and intelligence on
devices. To overcome high communication cost and severe privacy
concerns of centralized machine learning, federated learning
(FL) has been proposed to achieve distributed machine learning
among networked devices. While the computation and communication limitation has been widely studied, the impact of ondevice storage on the performance of FL is still not explored.
Without an effective data selection policy to filter the massive
streaming networked data on devices, classical FL can suffer from
much longer model training time (4×) and significant inference
accuracy reduction (7%), observed in our experiments. In this
work, we take the first step to consider the online data selection
for FL with limited on-device storage. We first define a new
data valuation metric for data evaluation and selection in FL
with theoretical guarantee for speeding up model convergence
and enhancing final model accuracy, simultaneously. We further
design ODE, an Online Data sElection framework for FL, to
coordinate networked devices to store valuable data samples
collaboratively. Experimental results on one industrial dataset
and three public datasets show the remarkable advantages of
ODE over the state-of-the-art approaches. Particularly, on the
industrial dataset, ODE achieves as high as 2.5× speedup of
training time and 6% increase in inference accuracy, and is robust
to various factors in practical environments, including number
of local training epochs, client participation rate, device storage
capacity, mini-batch size and data heterogeneity among clients.
Index Terms—Federated Learning, Limited On-Device Storage, Online Data Selection
I. I NTRODUCTION
T
HE next-generation mobile computing systems require
effective and efficient management of mobile networks
and devices in various aspects, including resource provisioning [1], [2], security and intrusion detection [3], quality of service guarantee [4], and performance monitoring [5]. Analyzing
and controlling such an increasingly complex mobile network
with traditional human-in-the-loop approaches [6] will not be
possible any more, due to low-latency requirement [7], massive
real-time data and complicated correlation among data [8]. For
example, in network traffic analysis, a fundamental task in
mobile networks, devices such as routers and ONTs (Optical
Network Terminal) can receive/send as many as 5000 packets
(≈ 5MB) per second. It is impractical to manually analyze
such a huge quantity of high-dimensional data within milliseconds. Thus, machine learning models have been widely
applied to discover pattern behind high-dimensional networked
data, enable data-driven network control, and fully automate
the mobile network operation [9], [10], [11].
Despite that ML model overcomes the limitations of humanin-the-loop approaches, its good performance highly relies on
the huge amount of high quality data for model training [12],
which is hard to obtain in mobile networks as the data is
resided on heterogeneous devices in a distributed manner.
On the one hand, an on-device ML model trained locally
with limited data and computational resources is unlikely
to achieve desirable inference accuracy and generalization
ability [13]. On the other hand, directly transmitting data from
distributed networked devices to a cloud server for centralized
learning (CL) will bring prohibitively high communication
cost and severe privacy concerns [14], [15]. Recently, federated
learning (FL) [16] emerges as a distributed privacy-preserving
ML paradigm to resolve the above concerns, which allows
networked devices to upload local model updates instead of
raw data, and a central server to aggregate these local models
into a global model, solving the on-device data limitation,
communication bottlenecks and potential privacy leakage.
Motivation and New Problem. For applying FL to mobile
networks, we identify two unique properties of networked
devices: limited on-device storage and streaming networked
data, which have not been fully considered in previous FL
literature. (1) Limited on-device storage: due to hardware
constraints, mobile devices have restricted storage volume
for each mobile application, and can reserve only a small
space to store training data samples for model training without
compromising the quality of other services. For example, most
smart home routers have only 9-32MB storage to support
various kinds of services [17], such as buffering packets
awaiting transmission or storing configuration information for
routing, and thus only tens of training data samples can be
stored in this scenario. (2) Streaming networked data: data
samples are continuously generated or received by mobile
devices in a streaming manner, and we need to make online
decisions on whether to store each arrived data sample. For
instance, each router can receive over 5, 000 packets per
minute, 14.58 million photos were uploaded on Facebook by
mobile users per hour in 2018 [18], and Alibaba reached a
peak of 583, 000 sales transactions per hour in 2020 [19].
Within this circumstance, the training data stored locally can
be dynamic with time.
Without a carefully designed data selection policy to maintain the data samples in storage, the empirical distribution
(a) Convergence Time
and local models, which can be called as spatial information.
This is because the valuable data samples selected locally
could be overlapped with each other, and the local valuable
data may not be the global valuable one.
(3) The on-device data selection needs to be low
computation-and-memory cost due to the conflict of limited
hardware resources and requirement on quality of user experience. As the additional time delay and memory costs
introduced by online data selection process would degrade
the performance of mobile network and user experience, the
real-time data samples must be evaluated in a computation
and memory efficient way. However, increasingly complex
ML models with huge amount of parameters lead to high
computation complexity as well as large memory footprint for
storing intermediate model outputs during data selection.
Limitations of Related Works. The prior works on data
evaluation and selection in ML failed to solve the above
challenges. (1) The data selection methods in CL, such as
leave-one-out test [27], Data Shapley [28] and Importance
Sampling [15], [29], are not appropriate for FL due to the
first challenge: they could only measure the value of each
data sample corresponding to the local model training process,
instead of the global model in FL. (2) The prior works on data
selection in FL did not consider the two new properties of FL
devices. Mercury [13], FedBalancer [25] and the work from
Li et al. [26] adopted importance sampling framework [29] to
select the data samples with high loss or gradient norm for
reducing the training time per epoch, but failed to solve the
second challenge: these methods all need to inspect the whole
dataset for normalized sampling weight computation as well
as noise and outliers removal [30], [26].
Our Solutions. To solve the above challenges, we design
ODE, an online data selection framework that coordinates
networked devices to select and store valuable data samples
locally and collaboratively in FL, with theoretical guarantees
for accelerating model convergence and enhancing inference
accuracy, simultaneously.
In ODE, we first theoretically analyze the impact of an
individual local data sample on the convergence rate and
final accuracy of the global model in FL. We discover a
common dominant term in these two analytical expressions,
i.e., ⟨∇w l(w, x, y), ∇w F (w)⟩ (please refer to Section III-A for
detailed notation definition), which implies that the projection
of the gradient of a local data sample to the global gradient is
a reasonable metric for data selection in FL. As this metric is a
deterministic value, we can simply maintain a priority queue
for each device to store the valuable data samples. Second,
considering the lack of temporal and spatial information, we
propose an efficient method for clients to approximate this
data selection metric by maintaining a local gradient estimator
on each device and a global one on the server. Third, to
overcome the potential overlap of the stored data caused by
data selection in a distributed manner, we further propose a
strategy for the server to coordinate each device to store high
valuable data from different data distribution regions. Finally,
to achieve the computation and memory efficiency, we propose
a simplified version of ODE, which replaces the full model
gradient with partial model gradient to concurrently reduce the
(b) Inference Accuracy
Fig. 1. To investigate the impact of on-device storage on FL model training,
we conduct experiments on an industrial traffic classification dataset with 30
mobile devices and 35, 000+ data samples under different storage capacities.
of stored data could deviate from the true data distribution,
which further complicates the notorious problem of not independent and identically distributed (Non-IID) data distribution
in FL [20], [21]. Specifically, the naive random selection
policy significantly degrades the performance of classic FL
algorithms in both model training and inference, with more
than 4× longer training time and 7% accuracy reduction,
observed in our experiments with an industrial network traffic
classification dataset shown in Figure 1 (detailed discussion is
presented in Section IV-B). This is unacceptable in modern
mobile networks, because the longer training time reduces
the timeliness and effectiveness of ML models in dynamic
environments, and accuracy reduction results in failure to
guarantee the quality of service [4] and incurs extra operational expenses [22] as well as security breaches [23], [24].
Therefore, a fundamental problem when applying FL to mobile
networks is how to filter valuable data samples from on-device
streaming networked data to simultaneously accelerate model
training convergence and enhance inference accuracy of the
final global model?
Design Challenges. The design of such an online data
selection framework for FL involves three key challenges:
(1) There is still no theoretical understanding about the
impact of local on-device data on the training speedup and
accuracy enhancement of global model in FL. Lacking information about raw data and local models of the other devices,
it is challenging for one device to individually figure out the
impact of its local data sample on the performance of the
global model. Furthermore, the data sample-level correlation
between convergence rate and model accuracy is still not
explored in FL, and it is non-trivial to simultaneously improve
these two aspects through one unified data valuation metric.
(2) The lack of temporal and spatial information complicates the online data selection in FL. For streaming networked
data, we could not access the data samples coming from
the future or discarded in the past. Lacking such temporal
information, one device is not able to leverage the complete
statistical information (e.g., unbiased local data distribution)
for accurate data quality evaluation1 , such as outliers and
noise detection [25], [26]. Additionally, due to the distributed
paradigm in FL, one device cannot conduct effective data
selection without the knowledge of other devices’ stored data
1 In this work, we use data quality and data value interchangeably, as both
of them reflect the sample-level contribution to the model training in FL.
2
computational complexity of model backpropagation and save
the memory footprint for storing extra intermediate outputs of
the complex ML model.
System Implementation and Experimental Results. We
evaluated ODE on three public tasks: synthetic task (ST) [31],
Image Classification (IC) [32] and Human Activity Recognition (HAR) [33], [34], as well as one industrial mobile traffic
classification dataset (TC) collected from our 30-days deployment on 30 ONTs in practice, consisting of 560, 000+ packets
from 250 mobile applications. We compare ODE against three
categories of data selection baselines: random sampling [35],
classic data sampling approaches in CL setting [25], [26], [13]
and previous data selection methods for FL [25], [26]. The
experimental results show that ODE outperforms all these baselines, achieving as high as 9.52× speedup of model training
and 7.56% increase in final model accuracy on ST, 1.35× and
1.4% on IC, 2.22× and 6.38% on HAR, 2.5× and 6% on TC,
with minor additional time delay and memory costs. Moreover,
ODE is robust to different environmental factors including
local epoch number, client participation rate, on-device storage
volume, mini-batch size and data heterogeneity across devices.
We also conduct ablation experiments to demonstrate the
importance of each component in ODE.
Summary of Contributions. Our contributions in this work
can be summarized as follows:
(1) To the best of our knowledge, we are the first to identify
two new properties of applying FL in mobile networks:
limited on-device storage and streaming networked data, and
demonstrate its enormity on the effectiveness and efficiency
of model training in FL.
(2) We provide analytical formulas on the impact of an
individual local data sample on the convergence rate and the
final inference accuracy of the global model, based on which
we propose a new data valuation metric for data selection in FL
with theoretical guarantee for accelerating model convergence
and improving inference accuracy, simultaneously. Further
more, we propose ODE, an online data selection framework
for FL, to realize the on-device data selection and cross-device
collaborative data storage.
(3) We conduct extensive experiments on three public
datasets and one industrial traffic classification dataset to
demonstrate the remarkable advantages of ODE against existing methods in terms of the convergence rate and inference
accuracy of the final global FL model.
performance withS respect to the underlying unbiased data
distribution P = c∈C Pc :
X
minn F (w) =
ζc · Fc (w),
w∈R
denotes the normalized weight of each
where ζc =
c′ ∈C vc′
client, n is the dimension
of model parameters, Fc (w) =
E(x,y)∼Pc l(w, x, y) is the expected loss of the model w
over the truePdata distribution of client c. We also use
F̃c (w) = |B1c | x,y∈Bc l(w, x, y) to denote the empirical loss
of model over the data samples stored by client c.
In this work, we investigate the impacts of each client’s limited storage on FL, and consider the widely adopted algorithm
Fed-Avg [16] for easy illustration3 . Under the synchronous
FL framework, the global model is trained by repeating the
following two steps for each communication round t in [1, T ]:
(1) Local Training: In the round t, the server selects a client
subset Ct ⊆ C to participate in the model training process.
Each participating client c ∈ Ct downloads the current global
t−1
model wfed
(the ending global model from the last round),
and performs model updates with the locally stored data for
m epochs:
wct,i ← wct,i−1 − η∇w F̃c (wct,i−1 ), i = 1, · · · , m,
(1)
t−1
, and
where the starting local model wct,0 is initialized as wfed
η denotes the learning rate.
(2) Model Aggregation: Each participating client c ∈ Ct
uploads the updated local model wct,m , and the server aggret
gates them to generate a new global model wfed
by taking a
weighted average:
X
t
wfed
←
ζct · wct,m ,
(2)
c∈Ct
ζct
vc
is the normalized weight of client c in
where = P ′
c ∈Ct vc′
the current round t.
In the scenario of FL with limited on-device storage and
streaming networked data, we have an additional data selection
step for clients:
(3) Data Selection: In each round t, once receiving a new
data sample, the client has to make an online decision on
whether to store the new sample (in place of an old one if the
storage area is fully occupied) or discard it. The goal of this
data selection process is to select valuable data samples from
streaming networked data for model training in the coming
rounds.
We list the frequently used notations in Table I.
II. P RELIMINARIES
In this section, we present the learning model and training process of FL. We consider the synchronous FL framework [16], [36], [20], where a server coordinates a set of
mobile devices/clients2 C to conduct distributed model training. Each client c ∈ C generates data samples in a streaming
manner with a velocity vc . We use Pc to denote the client c’s
underlying distribution of her local data, and P̃c to represent
the empirical distribution of the data samples Bc stored in
her local storage. The goal of FL is to train
S a global model
w from the locally stored data P̃ = c∈C P̃c with good
2 We
c∈C
P vc
III. D ESIGN OF ODE
In this section, we first quantify the impact of a local data
sample on the performance of global FL model in terms
of convergence rate and inference accuracy. Based on the
common dominant term in the two analytical expressions,
we propose a new data valuation metric for data evaluation
and selection in FL (Section III-A), and develop a practical
method to estimate this metric with low extra computation and
3 Our results for limited on-device storage can be extended to other FL
algorithms, such as FedBoost [37], FedNova [38], FedProx [36].
will use mobile devices and clients interchangeably in this work.
3
Theorem 1. (Global Loss Reduction) With Assumption 1, for
an arbitrary set of clients Ct ⊆ C selected by the server in
round t, the reduction of global loss F (w) is bounded by:
TABLE I
F REQUENTLY U SED N OTATIONS .
Notations
Remark
C, Ct , c
Set of all clients, set of participants in
round t, one specific client.
Loss of model w over underlying global
data distribution.
Loss of model w over true data distribution and locally stored data of client c.
Loss of model w over a data sample (x, y).
Data samples stored by client c and the
corresponding data size.
Local model of participant c in round t
after i local updates.
Models in round t trained through federated learning and centralized learning.
The velocity of data generated by client c.
Weight of client c ∈ C in global objective
function of FL.
Weight of client c ∈ Ct in the model
aggregation of round t.
Local gradient estimator of client c, global
gradient estimator in round t.
Number of labels that are assigned to
client c for data selection and storage.
Number of clients that are coordinated to
select and store data with label y.
F (w)
Fc (w), F̃c (w)
l(w, x, y)
Bc , |Bc |
wct,i
t
t
wfed
,wcen
vc
ζc
ζct
ĝc , ĝ
t
nclient
c
nlabel
y
X m−1
X X t−1
t
−αc ∥ ∇w l(wct,i , x, y) ∥22
F (wfed
)−F (wfed
)⩾
{z
} c∈C i=0
|
{z
}
|
(x,y)∈Bc
t
global loss reduction
term 1
t−1
+ βc ⟨∇w l(wct,i , x, y), ∇w F (wfed
)⟩ ,
|
{z
}
term 2
Lζct
where αc =
·
2
η
|Bc |
2
and βc = ζct ·
η
|Bc |
(3)
.
Proof. From the L-Lipschitz continuity of global loss function
F (w), we have
t
t−1
F (wfed
) − F (wfed
)
t−1
t
t−1
⩽ ∇w F (wfed
), wfed
− wfed
+
L
t
t−1 2
∥ wfed
− wfed
∥ .
2
(4)
Using the aggregation formula of Fed-Avg [16], we can
t
decompose wfed
into the weighted sum of the updated model
t,m
wc of each participant c ∈ Ct :
(4)
X
t−1
),
= ∇w F (wfed
L X t t,m
t−1 2
∥
∥
ζc wc − wfed
2 c∈C
t
L X t
t−1 2
+
ζc ∥ wct,m − wfed
∥ .
2 c∈C
t−1
+
ζct wct,m −wfed
c∈Ct
⩽
X
ζct
t−1
t−1
∇w F (wfed
), wct,m −wfed
c∈Ct
t
(5)
We can further express the local updated model wct,m of each
client c ∈ Ct as the difference between the original global
t−1
model wfed
and m local model updates:
communication overhead (Section III-B). We further design a
policy for the server to coordinate cross-client data selection
process, avoiding the potential overlapped data selected and
stored by clients (Section III-C). Finally, we summarize the
detailed procedure of ODE (Section III-D).
(5) ⩽
X
ζct
h
t−1
), −η
∇w F (wfed
m−1
X
i=0
c∈Ct
L
∥
2
m−1
X
ζct
hL
2
c∈Ct
We evaluate the impact of a local data sample on FL
model from the perspectives of convergence rate and inference
accuracy, which are two critical aspects for the success of FL.
The convergence rate quantifies the reduction of loss function
in each training round, and determines the communication cost
of FL. The inference accuracy reflects the effectiveness of a
FL model on guaranteeing the quality of service and user
experience. For theoretical analysis, we follow one typical
assumption on the FL models, which is widely adopted in
the literature [36], [39], [40].
η
η∇w F̃ (wct,i ) ∥2
i
i=0
X
=
A. Data Valuation Metric
∇w F̃c (wct,i ) +
m−1
X
∥
m−1
X
(6)
η∇w F̃ (wct,i )
2
∥ −
i=0
i
t−1
∇w F (wfed
), ∇w F̃c (wct,i ) ,
i=0
where the ith local model update of each client c can be
expressed as the average gradient of the samples stored locally
in Bc :
(6) =
X
ζct
hL
c∈Ct
η
|Bc |
Assumption 1. (Lipschitz Gradient) For each client c ∈
C, the loss function Fc (w) is Lc -Lipschitz gradient, i.e.,
∥ ∇w Fc (w1 ) − ∇w Fc (w2 ) ∥2 ⩽ Lc ∥ w1 − w2 ∥2 , which
implies that the global
loss function F (w) is L-Lipschitz
P
gradient with L = c∈C ζc Lc .
2
m−1
X
∥
m−1
X
η
i=0
X
X
x,y∈Bc
1
∇w l(wct,i , x, y) ∥2 −
|Bc |
t−1
∇w F (wfed
), ∇w l(wct,i , x, y)
i=0 (x,y)∈Bc
X m−1
X X ⩽
αc ∥ ∇w l(wct,i , x, y) ∥2
c∈Ct
i=0 x,y∈Bc
t−1
− βc ∇w F (wfed
), ∇w l(wct,i , x, y)
Convergence Rate. We provide a lower bound on the reduction of loss function of global model after model aggregation
in each communication round.
where αc =
4
Lζct
2
·
η
|Bc |
2
and βc = ζct ·
.
η
|Bc |
.
i
Theorem 1 shows that the loss reduction of the global FL
t
model F (wfed
) in each round is closely related to two terms:
local gradient magnitude of a data sample (term 1) and the
projection of the local gradient of a data sample onto the
direction of global gradient over global data distribution (term
2). Due to the different magnitude orders of coefficients αc
and βc 4 and also the different orders of the values of terms
1 and 2 as shown in Figure 2(a), we can focus on the term
2 (i.e., projection of the local gradient of a data sample onto
the global gradient) to evaluate a local data sample’s impact
on the convergence rate.
We next briefly describe how to evaluate data samples based
on the term 2. The local model parameter wct,i in term 2 is
computed from (1), where the gradient ∇w F̃c (wct,i−1 ) depends
on all the locally stored data samples. In other words, the value
of term 2 depends on the “cooperation” of all the locally stored
data samples. Thus, we can formulate the computation of term
2 via cooperative game theory [41], where each data sample
represents a player and the utility of each sample set is the
value of term 2. Within this cooperative game, we can regard
the individual contribution of each data sample to term 2 as
its value, and quantify it through leave-one-out [27], [42] or
Shapley Value [28], [43]. As these methods require multiple
model retraining to compute the marginal contribution of each
individual data sample, we propose a one-step look-ahead
policy to approximately evaluate each sample’s value by only
focusing on the first local training epoch (m = 1).
Inference Accuracy. We can assume that the optimal FL
model can be obtained by gathering all clients’ generated data
and conducting CL. Moreover, as the testing dataset and the
corresponding testing accuracy are hard to obtain in FL, we
use the weight divergence between the models trained through
t
mt
FL and CL, i.e., ∥ wfed
− wcen
∥, to quantify the accuracy of
the FL model in each round t. With t → ∞, we can evaluate
the final accuracy of FL model.
which can be further decomposed according to the formula of
model update in (1):
X
t
mt
∥ wfed
−wcen
∥= ∥
h
i
ζct wct,m−1 − η∇w F̃c (wct,m−1 ) −
c∈Ct
mt−1
mt−1
wcen
+ η∇w F (wcen
)∥
X t t,m−1
mt−1
⩽∥
ζc wc
− wcen ∥ +
(8)
c∈Ct
η∥
X
mt−1
ζct ∇F̃c (wct,m−1 )−∇F (wcen
)∥ .
c∈Ct
Then, we leverage the triangle inequality to upper bound (8):
X
(8) ⩽
mt−1
ζct ∥ wct,m−1 −wcen
∥ +η ∥
c∈Ct
X
h
ζct ∇w F̃c (wct,m−1 )
c∈Ct
mt−1
∇w F (wct,m−1 )+∇w F (wct,m−1 )−∇w F (wcen
)
−
X
=
i
∥
mt−1
ζct ∥ wct,m−1 − wcen
∥+
(9)
c∈Ct
X
η∥
t
ζc ∇w F̃c (wct,m−1 ) − ∇w F (wct,m−1 ) ∥ +
c∈Ct
X
η∥
mt−1 ζct ∇w F (wct,m−1 ) − ∇w F (wcen
) ∥.
c∈Ct
According to the Lipschitz continuity of global model F in
Assumption 1, we can further decompose the equation (9):
(9) ⩽
X
mt−1
ζct (1 + ηL) ∥ wct,m−1 − wcen
∥+
c∈Ct
X
η∥
ζct ∇w F̃c (wct,m−1 ) − ∇w F (wct,m−1 ) ∥
(10)
c∈Ct
⩽
X
h
i
mt−1
ζct (1 + ηL) ∥ wct,m−1 −wcen
∥ +ηGc (wct,m−1 ) ,
c∈Ct
where Gc (w) =∥ ∇w F̃c (w)−∇w F (w) ∥ intuitively represents
the divergence between empirical losses over the local data
distribution and global data distribution.
mt−1
Next, we derive the upper bound of ∥ wct,m−1 − wcen
∥
for each participating client c ∈ Ct :
Theorem 2. (Model Weight Divergence) With Assumption
1, for an arbitrary participating client set Ct , we have the
following inequality for the weight divergence between the
models trained through FL and CL after the tth training round.
mt−1
∥ wct,m−1 − wcen
∥
mt−2
mt−2
= ∥ wct,m−2 −η∇w F̃c (wct,m−2 )−wcen
+η∇w F (wcen
)∥
mt−2
⩽ ∥ wct,m−2 −wcen
∥ +η ∥ ∇w F (wct,m−2 )−
(11)
∇ F (wmt−2 ) ∥ +η ∥ ∇ F̃ (wt,m−2 )−∇ F (wt,m−2 ) ∥,
t−1
t
mt
m(t−1)
w
w c
w
cen
c
c
∥ wfed
− wcen
∥2 ⩽(1 + ηL)m ∥ wfed
− wcen
∥2
" m−1
# which can be further bounded using Assumption 1:
X
X
mt−2
+
ζct η
(1 + ηL)m−1−i Gc (wct,i ) ,
(11) ⩽(1 + ηL) ∥ wct,m−2 − wcen
∥ +ηGc (wct,m−2 )
c∈Ct
i=0
(t−1)m
⩽(1 + ηL)m−1 ∥ wct,0 − wcen
∥+
(7)
where Gc (w) =∥ ∇w F̃c (w) − ∇w F (w) ∥2 .
η
m−2
X
(12)
(1 + ηL)m−2−i Gc (wct,i )
i=0
Combining inequalities (10) and (12), we have:
Proof. From the aggregation formula of Fed-Avg, we can
t
express the global model wfed
as the weighted sum of the
local updated model of each participating client c ∈ Ct :
X
t
mt
mt
∥ wfed
− wcen
∥ =∥
ζct wct,m − wcen
∥,
t
mt
∥ wfed
− wcen
∥⩽
h
(t−1)m
ζct (1 + ηL)m ∥ wct,0 − wcen
∥
X
c∈Ct
+η
m−1
X
c∈Ct
4 αc
βc
10.
∝
η
|Bc |
(1 + ηL)m−1−i Gc (wct,i )
i
i=0
t−1
(t−1)m
=(1 + ηL)m ∥ wfed
− wcen
∥+
i
X t h m−1
X
(1 + ηL)m−1−i Gc (wct,i ) .
ζc η
≈ 10−4 with common learning rate 10−3 and storage size
c∈Ct
5
i=0
(a) Term 1 vs Term 2
(b) Round 0 vs Round 30
(c) Full model vs Last layer
Fig. 2. Empirical results to support some claims, and the experiment setting could be found in Section IV-A. (a): Comparison of the normalized values of two
terms in (3) and (13). (b): Comparison of the data values computed in round 0 and round 30. (c): Comparison of the data values computed using gradients
of full model layers and the last layer.
and the inference accuracy mainly through term 2 in (13),
which happens to be the same as the term 2 in the bound of
global loss reduction in (3).
Data Valuation. Based on the above analysis, we define
a new data valuation metric for FL, and provide theoretical
understanding as well as intuitive interpretation.
Hence, we bound the divergence between models trained
through FL and CL by the initial model difference and the
additional divergence caused by m local model updates of
heterogeneous participating clients.
The following lemma further shows the impact of a local
t
mt
∥2 through Gc (wct,i ).
data sample on ∥ wfed
− wcen
Definition 1. (Data Valuation Metric) In the tth round, for a
client c ∈ C, the value of a data sample (x, y) is defined as
the projection of its local gradient ∇l(w, x, y) onto the global
gradient of the current global model over the unbiased global
data distribution:
Lemma 1. (Gradient Divergence) For an arbitrary client c ∈
C, Gc (w) =∥ ∇F̃c (w) − ∇F (w) ∥2 is bounded by:
X
1 Gc (w) ⩽ δ +
∥ ∇w l(w, x, y) ∥22 −
{z
}
|Bc | |
(x,y)∈Bc
term 1
1/2
2 ∇w l(w, x, y), ∇w F (w)
|
{z
}
(13)
def
t
t
v(x, y) = ∇w l(wfed
, x, y), ∇w F (wfed
) ⟩.
,
(14)
Based on this new data valuation metric, once a client
receives a new data sample, she can make an online decision
on whether to store this sample by comparing the data value
of the new sample with those of old samples in storage, which
can be easily implemented as a priority queue.
Theoretical Understanding. On the one hand, maximizing
the above data valuation metric of the selected data samples
is a one-step greedy strategy for minimizing the loss of the
updated global model in each training round according to (3),
because it optimizes the dominant term (term 2) in the lower
bound of global loss reduction in (3). The one-step-look-ahead
policy means that we only consider the first epoch of the local
t−1
model training in (3), i.e., ⟨∇w l(wct,i , x, y), ∇w F (wfed
)⟩,
t−1
t,0
where the index i is set as 0 and wc = wfed . On the other
hand, this metric also improves the inference accuracy of the
final global model by narrowing the gap between the models
trained through FL and CL, as it reduces the high-weight term
of the dominant part in (7), i.e., (1 + ηL)m−1 Gc (wkt,0 ).
Intuitive Interpretation. Under the proposed new data valuation metric, large data value of one data sample indicates
that its impact on the global FL model is similar to that of
the unbiased global data distribution, guiding the clients to
select the data samples which not only follow their own local
data distribution but also have similar effect with the global
data distribution. In this way, the personalized information of
local data distribution is preserved and the data heterogeneity
across clients is also reduced, which have been demonstrated
to improve FL performance [44], [45], [38], [46].
term 2
where δ =∥ ∇w F (w) ∥22 is a constant term for all data
samples.
Proof. The main idea of the proof is to express the local
gradient ∇w F̃c (w) as the average gradient of the locally stored
data samples in Bc :
Gc (w) =∥ ∇w F̃c (w) − ∇w F (w) ∥
h
i1/2
= ∥ ∇w F (w) ∥2 + ∥ ∇w F̃c (w) ∥2 −2 ∇w F (w), ∇w F̃c (w)
h
X ∇w l(w, x, y) 2
∥ −
= ∥ ∇w F (w) ∥2 + ∥
|Bc |
(x,y)∈Bc
X ∇w l(w, x, y) i1/2
2 ∇w F (w),
|Bc |
(x,y)∈Bc
X
1 ⩽ ∥ ∇w F (w) ∥2 +
∥ ∇w l(w, x, y) ∥2 −
|Bc |
(x,y)∈Bc
1/2
2 ∇w F (w), ∇w l(w, x, y)
,
where ∥ ∇w F (w) ∥2 is a constant term for all clients and
local data samples.
Intuitively, due to different coefficients, the twofold projection (term 2) has a larger mean and variance than the gradient
magnitude (term 1) among different data samples, which is
also demonstrated in our experiments in Figure 2(a). Thus,
we can quantify the impact of a local data sample on Gc (w)
6
memory footprint for storing intermediate model outputs. To
reduce these costs, we only use the gradients of the last few
network layers of ML models instead of the whole model, as
partial model gradient is also able to reflect the trend of the
full gradient, which is also verified in Figure 2(c).
Privacy Concern. The transmission of local gradient estimators may disclose the local gradient of each client to some
extent, which can be avoided by adding Guassian noise to each
local gradient estimator before uploading, as in differential
privacy [48], [49].
B. On-Client Data Selection
In practice, it is non-trivial for one client to directly utilize
the above data valuation metric for online data selection due
to the following two challenges:
(1) lack of the latest global model: due to the partial
participation of clients in FL [47], each client c ∈ C does
t−1
not receive the global FL model wfed
in the rounds that she
is not selected for model training, and only has the outdated
global FL model from the previous participating round, i.e.,
tc,last −1
wfed
.
(2) lack of unbiased global gradient: the accurate global
gradient over the unbiased global data distribution can only
be obtained by aggregating all the clients’ local gradients over
their unbiased local data distributions. This is hard to achieve
because only partial clients participate in each communication
round, and the locally stored data distribution could become
biased during the on-client data selection process.
We can consider that the challenge (1) does not affect the
online data selection process too much as the value of each
data sample remains relatively stable across a few training
rounds, which is also demonstrated with the experiment results
in Figure 2(b), and thus clients can simply use the old global
model for data valuation.
To overcome the challenge (2), we propose a gradient
estimation method. First, to solve the issue of skew local
gradient, we require each client c ∈ C to maintain a local
gradient estimator ĝc , which will be updated whenever the
client receives the nth new data sample (x, y) from the last
participating round:
C. Cross-Client Data Storage
Since the local data distributions of clients may overlap with
each other, independently conducting data selection process
for each client may lead to distorted distribution of data stored
by clients.
One potential solution is to divide the global data distribution into several regions, and coordinate each client to store
valuable data samples for one specific distribution region,
while the union of all stored data can still follow the unbiased
global data distribution. In this work, we consider the label
of data samples as the dividing criterion5 . Thus, before the
training process, the server needs to instruct each client the
labels and the corresponding quantity of data samples to store.
Considering the partial client participation and heterogeneous
data distribution among clients, the cross-client coordination
policy need to satisfy the following four desirable properties:
(1) Efficient Data Selection: To improve the efficiency of
data selection, the label y ∈ Y should be assigned to the clients
who generate more data samples with this label, following
the intuition that there is a higher probability to select more
valuable data samples from a larger pool of candidate dataset.
(2) Redundant Label Assignment: To ensure that all the
labels are likely to be covered in each round even with partial
client participation, we require each label y ∈ Y to be assigned
clients, which is a hyperparameter decided
to more than nlabel
y
by the server.
(3) Limited Storage: Due to limited on-device storage, each
labels to ensure
client c should be assigned to less than nclient
c
a sufficient number of valuable data samples stored for each
assigned label within limited storage volume, and nclient
is
c
also a hyperparameter decided by the server;
(4) Unbiased Global Distribution: The weighted average of
all clients’ stored data distribution is expected to be equal to
the unbiased global data distribution, i.e., P̃ (y) = P (y), y ∈ Y .
We formulate the cross-client data storage with the above
four properties by representing the coordination strategy as
a matrix D ∈ N|C|×|Y | , where Dc,y denotes the number of
data samples with label y that client c should store. We use
matrix V ∈ R|C|×|Y | to denote the statistical information of
each client’s generated data, where Vc,y = vc Pc (y) is the
average speed of the data samples with label y generated
by client c. The cross-client coordination algorithm can be
obtained by solving the following optimization problem, where
1
n−1
tc,last −1
gˆc + ∇w l(wfed
, x, y),
(15)
n
n
where tc,last denotes the last participating round of client c,
tc,last −1
and thus wfed
stands for the latest global model that
client c received. When the client c is selected to participate
in FL at a certain round t, the client uploads the current local
gradient estimator gˆc to the server, and resets the local gradient
estimator, i.e., gˆc ← 0, n ← 0, because a new global FL
t−1
model wfed
is received. Second, to solve the problem of skew
global gradient due to the partial client participation, the server
also maintains a global gradient estimator ĝ t , which
is an
P
aggregation of the local gradient estimators, ĝ t = c∈C ζc ĝc .
As it would incur high communication cost to collect gˆc from
all the clients, the server only uses ĝc of the participating
clients to update global gradient estimator ĝ t in each round t:
X
ĝ t ← ĝ t−1 +
ζc (ĝc − ĝctlast ).
(16)
ĝc ←
c∈Ct
Thus, in each training round t, the server needs to distribute
t−1
both the current global FL model wfed
and the latest global
t−1
gradient estimator ĝ
to each selected client c ∈ Ct , who
will conduct local model training, and upload both locally
updated model wct,m and local gradient estimator gˆc back to
the server.
Simplified Version. In both of the local gradient estimation
in (15) and data valuation in (14), for a new data sample,
we need to backpropagate the entire model to compute its
gradient, which will introduce high computation cost and
5 There are some other methods to divide the data distribution, such as Kmeans [50] and hierarchical clustering [51], and our results are independent
on these methods.
7
the condition (1) is formulated as the objective, and conditions
(2), (3), and (4) are described by the constraints (18), (19), and
(20), respectively:
XX
max
Dc,y · Vc,y =∥ DV ∥1 ,
(17)
D
s.t.
c∈C y∈Y
∥ DyT ∥0 ⩾
nlabel
,
y
∀y ∈ Y,
(18)
∥ Dc ∥0 ⩽
nclient
,
c
∀c ∈ C,
(19)
∥ Dc ∥1 = |Bc |,
P
P
Vc,y
c∈C Dc,y
= c∈C
,
∥ B ∥1
∥ V ∥1
Algorithm 1: Greedy cross-client coordination
Input: Label vector Y , client vector C, data velocity
matrix V , storage size matrix B, redundant
label assignment nlabel , limited storage nclient
Output: Coordination matrix D
1 // Sort labels by number of owners
2 Y ← Sort Y in an increasing order of the number of
non-zero elements in V [:, y];
3 for y in Y do
4
// Sort clients by data velocity
5
C tmp ← Sort c ∈ C in an decreasing order of
the data velocity V [c, y];
6
for c in C tmp do
7
// Limited storage requirement
8
if Sum(D[c, :])< nclient
then
c
9
D[c, y] ← 1;
10
end
11
// Redundant label requirement
then
12
if Sum(D[:, y])≥ nlabel
y
13
Break;
14
end
15
end
16 end
17 // Divide each device storage evenly
18 for c in C do
19
cnt ←Sum(D[c, :]);
20
for y in Y do
21
D[c, y] ← B[c, y] ∗ D[c, y]/cnt;
22
end
23 end
24 return D;
∀c ∈ C,
∀y ∈ Y.
(20)
Complexity Analysis. We can verify that the above optimization problem with l0 norm is a general convex-cardinality
problem, which is NP-hard [52], [53]. To solve this problem,
we divide it into two subproblems: (1) decide which elements
of matrix D are non-zero, i.e., S = {(c, y)|Dc,y ̸= 0},
that is to assign labels to clients under the constraints of
(18) and (19); (2) determine the specific values of the nonzero elements of matrix D by solving a simplified convex
optimization problem:
X
Dc,y · Vc,y
max
D
c∈C,y∈Y,(c,y)∈S
X
s.t.
Dc,y = |Bc |,
∀c ∈ C,
(21)
y∈Y,(c,y)∈S
P
c∈C,(c,y)∈S
∥ D ∥1
Dc,y
P
=
c∈C Vc,y
,
∥ V ∥1
∀y ∈ Y.
As the number of possible S can be exponential to |C| and |Y |,
it is still prohibitively expensive to derive the globally optimal
solution of D with large |C| (massive clients in FL). The
classic approach is to replace the non-convex discontinuous
l0 norm constraints with the convex continuous l1 norm
regularization terms in the objective function [53], which fails
to work in our scenario because simultaneously minimizing as
many as |C| + |Y | non-differentiable l1 norm in the objective
function will lead to high computation and memory costs as
well as unstable solutions [54], [55]. Thus, we propose a
greedy algorithm to solve this complicated problem.
Greedy Cross-Client Coordination Algorithm. We
achieve the four desirable properties through the following
three steps, which are summarized in Algorithm 24:
(1) Information Collection: Each client c ∈ C sends the
rough information about local data to the server, including the
storage capacity |Bc | and data velocity Vc,y of each label y,
which can be obtained from the statistics of previous time
periods. Then, the server can construct the vector of storage
size B ∈ N|C| and the matrix of data velocity V ∈ R|C|×|Y |
of all clients.
(2) Label Assignment: The server sorts labels according to
a non-decreasing order of the total number of clients having
this label. We prioritize the labels with top rank in label-client
assignment, because these labels are more difficult to find
enough clients to meet Redundant Label Assignment property
(line 1-2). Specifically, for each considered label y ∈ Y in the
order, there could be multiple clients to be assigned, and the
server allocates label y to the clients c who generates the data
samples with label y in a higher data velocity (line 5-6). By
doing this, we attempt to satisfy the property of Efficient Data
Selection (line 6 − 15). Once the number of labels assigned to
, this client will be removed from
client c is larger than nclient
c
the order due to Limited Storage property (line 7-10).
(3) Quantity Assignment: With the above two steps, we have
decided the non-zero elements of the client-label matrix D,
i.e., the set S. To further reduce computational complexity and
avoid imbalanced on-device data storage for each label, we do
not directly solve the optimization problem in (21). Instead,
we require each client to divide the storage capacity evenly to
the assigned labels, and compute a weight γy for each label
y ∈ Y to guarantee that the weighted distribution of the stored
data approximates the unbiased global data distribution (line
18-23), i.e., satisfying γy P̃ (y) = P (y). Accordingly, we can
derive the weight γy for each label y by setting
γy =
P (y)
∥ V T [y] ∥1 / ∥ V ∥1
=
.
∥ DT [y] ∥1 / ∥ D ∥1
P̃ (y)
(22)
Thus, each client c only needs to maintain one priority queue
with a size |B[c]|/ ∥ D[c] ∥0 for each assigned label. In the
local model training, each participating client c ∈ Ct updates
8
Cross-Client Data Storage. Before the FL model training
process, the central server collects data distribution information and storage capacity from all clients (①), and solving the
optimization problem in (9) through our greedy algorithm (②).
On-Client Data Evaluation. During the FL process, clients
train the local model in participating rounds, and utilize idle
computation and memory resources to conduct on-device data
selection in non-participating rounds. In the tth training round,
non-selected clients, selected clients and the server perform
different operations:
•
Fig. 3. A simple example to illustrate the greedy coordination. ① Sort labels
according to #owners and obtain label order {2, 1, 3}; ② Sort and allocate
clients for each label under the constraint of nclient
= 2 and nclass
= 2; ③
c
y
Obtain coordination matrix D and compute class weight γ according to (22).
•
Fig. 4. Overview of ODE framework.
•
the local model using the weighted stored data samples:
η X
wct,i ← wct,i−1 −
γy ∇w l(wct,i−1 , x, y),
(23)
ζc
(x,y)∈Bc
P
where ζc =
(x,y)∈Bc γy denotes the new weight of each
client, and the normalized weight of client
P c ∈ Ct for model
aggregation in round t becomes ζct = c∈Ct P ′ ζc ζ ′ .
c ∈Ct c
We illustrate a simple example in Figure 3 for better
understanding of the above procedure.
Privacy Concern. The potential privacy leakage of uploading rough local information is tolerable in practice, and can be
further avoided through homomorphic encryption [56], which
enables to sort k encrypted data samples with computational
complexity O(k log k 2 ) [57].
Non-selected clients:
Data Evaluation (③): each non-selected client c ∈ C \ Ct
continuously evaluates and selects the data samples according to the data valuation metric in (14), within which
the clients use the estimated global gradient received in
last participating round instead of the accurate one.
Local Gradient Estimator Update (④): the client also
timely updates the local gradient estimator ĝc using (15).
Selected Clients:
Local Model Update (⑤): after receiving the new global
t−1
model wfed
and new global gradient estimator ĝ t−1 from
the server, each selected client c ∈ Ct performs local
model updates based on the stored data samples by (23).
Local Model and Estimator Transmission (⑥): each selected client sends the updated model wct,m and local
gradient estimator ĝc to the server. The local estimator
ĝc will be reset to 0 for approximating the local gradient
t−1
of the newly received global model wfed
.
Server:
Global Model and Estimator Transmission (⑦): At the
beginning of each training round, the server distributes
t−1
the global model wfed
and the global gradient estimator
ĝ t−1 to the selected clients.
Local Model and Estimator Aggregation (⑧): at the end
of each training round, the server collects the updated
local models wct,m and local gradient estimators ĝc from
participating clients c ∈ Ct , which will be aggregated
to obtain a new global model by (2) and a new global
gradient estimator by (16).
IV. E VALUATION R ESULTS
In this section, we first introduce experiment setting, baselines and evaluation metrics (Section IV-A). Second, we show
the individual and combined impacts of the limited on-device
storage and streaming networked data on FL to demonstrate
the motivation of our work (Section IV-B). Third, we present
the overall performance of ODE and different data selection
methods on model training speedup and inference accuracy
improvement, as well as the memory footprint and data evaluation delay (Section IV-C). Next, we show the robustness of
ODE against various environment factors (Section IV-D), such
as the number of local training epochs m, client participation
t|
rate |C
|C| , on-device storage capacity Bc , mini-batch size and
data heterogeneity across clients. Finally, we analyze the
individual effect of different components of ODE (Section
IV-E).
D. Overall Procedure of ODE
ODE incorporates cross-client data storage and on-client
data evaluation to coordinate mobile devices to store valuable
samples, speeding up model training process and improving
inference accuracy of the final global model. The overall
procedure is shown in Figure 4.
Key Idea. Intuitively, the cross-client data storage component coordinates clients to store the high-quality data samples
with different labels to avoid overlapped data stored by clients.
The on-client data evaluation component instructs each client
to select the data having similar local gradient with the global
gradient, which reduces data heterogeneity among clients
while also preserves the personalized data distribution.
9
TABLE II
I NFORMATION OF DIFFERENT TASKS , DATASETS , MODELS AND DEFAULT EXPERIMENT SETTINGS .
Tasks
ST
IC
HAR
TC
Datasets
Synthetic Dataset [31]
Fashion-MNIST [32]
HARBOX [33]
Industrial Dataset
Models
LogReg
LeNet [58]
Customized DNN
Customized CNN
#Samples
1, 016, 442
70, 000
34, 115
37, 707
#Labels #Devices nlabel
y
10
200
5
10
50
5
5
120
5
20
30
5
|Ct |
|C|
η
5% 1e−4
10% 1e−3
10% 1e−3
20% 5e−3
m
5
5
5
5
|Bc |
10
5
5
10
A. Experiment Setting
Learning Tasks, Datasets and ML Models. To demonstrate the ODE’s advanced performance and generalization
across various learning tasks, datasets and ML models. we
evaluate ODE on one synthetic dataset, two real-world datasets
and one industrial dataset, all of which vary in data quantities,
distributions and model outputs, and cover the tasks of Synthetic Task (ST), Image Classification (IC), Human Activity
Recognition (HAR) and mobile Traffic Classification (TC).
The statistics of the tasks are summarized in Table II, and
introduced as follows:
•
•
•
•
Fig. 5. Unbalanced data quantity of clients.
Synthetic Task. The synthetic dataset we used is proposed
in LEAF benchmark [31] and is also described in details
in [36]. It contains 200 clients and 1 million data samples,
and a Logistic Regression model is trained for this 10class task.
Image Classification. Fashion-MNIST [32] contains
60, 000 training images and 10, 000 testing images, which
are divided into 50 clients according to labels [16]. We
train LeNet [58] for this 10-class image classification.
Human Activity Recognition. HARBOX [33] is the 9-axis
OMU dataset collected from 121 users’ smartphones in
a crowdsourcing manner, including 34,115 data samples
with 900 dimension. Considering the simplicity of the
dataset and task, a lightweight customized DNN with two
dense layers followed by a SoftMax layer is deployed for
this 5-class human activity recognition task [59].
Traffic Classification. The industrial dataset of mobile
application classification is collected by our deployment
of 30 ONT devices in a simulated network environment
from May 2019 to June 2019. Generally, the dataset
contains more than 560, 000 data samples and has more
than 250 applications as labels, which cover the application categories of videos (such as YouTube and TikTok),
games (such as LOL and WOW), files downloading
(such as AppStore and Thunder) and communication
(such as WhatsApp and WeChat). We manually label the
application of each data sample. The model we applied
is a CNN consisting of 4 convolutional layers with
kernel size 1×3 to extract features and 2 fully-connected
layers for classification, which is able to achieve 95%
accuracy through CL, and satisfy the on-device limited
resource requirement due to the small number of model
parameters. To reduce the training time caused by the
large scale of dataset, we randomly select 20 out of 250
applications as labels with a total number of 37, 707 data
samples, whose distribution is shown in Figure 6.
Fig. 6. Data distribution in traffic classification dataset.
Parameters Configurations. For all the experiments, we
use SGD as the optimizer and decay the learning rate per
100 rounds by ηnew = 0.95 × ηold . To simulate the setting of
streaming networked data, we set the on-device data velocity
samples
to be vc = #training
, which means that each device
500
c ∈ C will receive vc data samples one by one in each
communication round, and the samples would be shuffled and
appear again per 500 rounds. Other default configurations are
shown in Table II. Note that the participating clients in each
round are randomly selected, and for each experiment, we
repeat 5 times and show the average results [14], [59].
Baselines. In our experiments, we compare two versions of
ODE, ODE-Exact (using exact global gradient) and ODE-Est
(using estimated global gradient), with four categories of data
selection methods, including random sampling methods (RS),
importance-based methods for CL (HL and GN), the existing
data selection methods for FL (FB and SLD) and the ideal
method with unlimited on-device storage (FD):
• Random sampling (RS) methods including Reservoir
Sampling [35] and First-In-First-Out (storing the latest
|Bc | data samples). The experimental results of these two
10
methods are similar, and thus we show the results of FirstIn-First-Out for RS only.
• Importance sampling-based methods including HighLoss
(HL), which uses the loss of each data sample as data
value to reflect the informativeness of data [60], [61],
[62], and GradientNorm (GN), which quantifies the impact of each data sample on model update through its
gradient norm [63], [29].
• Existing data selection methods for canonical FL including FedBalancer(FB) [25] and SLD (Sample-Level
Data selection) [26] which are revised slightly to
adapt to streaming networked setting: (i) We store the
loss/gradient norm of the latest 50 samples for noise
removal; (ii) For FedBalancer, we ignore the data samples
with loss larger than top 10% loss value, and for SLD,
we remove the samples with gradient norm larger than
the median norm value.
• Ideal method with unlimited on-device storage, denoted
as FullData (FD), using the entire dataset of each client
for training to simulate the unlimited storage scenario.
Metrics for Training Performance. We use two metrics
to evaluate the performance of each data selection method:
(1) Time-to-Accuracy Ratio: we measure the model training
speedup by the ratio of training time of RS and the considered
method to reach the same target accuracy, which is set to
be the final inference accuracy of RS. As the time of one
communication round is usually fixed in practical FL scenario,
we can quantify the training time with the number of communicating rounds. (2) Final Inference Accuracy: we evaluate the
inference accuracy of the final global model on each device’s
testing data, and report the average accuracy for evaluation.
the biased update direction, slowing down the convergence
time by 3.92× and decreasing the final model accuracy by as
high as 6.7%.
Individual Impact. To show the individual impact of
streaming networked data, we ignore limited storage and
assume that each device can utilize all the generated data
instead of only stored data for online data evaluation and
selection. To show the individual impact of limited storage,
we ignore streaming networked data and suppose that in
each epoch, all the new data are generated concurrently and
accessed arbitrarily.
The experimental results are shown in Table IV, where −
denotes failing to reach the target inference accuracy. It shows
that limited on-device storage is the essential cause of the
degeneration of FL training process, because: (1) The property
of streaming networked data mainly prevents previous methods
from making accurate online decisions on data selection, as
they select each sample according to a normalized probability
depending on both discarded and upcoming samples, which are
not available in streaming setting but can be estimated; (2) The
property of limited storage prevents previous methods from
obtaining full local and global data information, and guides
clients to select suboptimal or even low-value data from an
insufficient candidate dataset, which is more harmful to FL
process.
C. Overall Performance
We compare the performance of ODE with six baselines on
all the datasets, and show the results in Table V.
ODE significantly speeds up the model training process.
We observed that ODE improves time-to-accuracy performance
over the existing data selection methods on all the four
datasets. Compared with baselines, ODE achieves the target
accuracy 5.88× ∼ 9.52× faster on ST; 1.20× ∼ 1.35× faster
on IC; 1.55× ∼ 2.22× faster on HAR; 2.5× faster on TC.
Also, we observe that the largest speedup is achieved on the
datast ST, because the high non-i.i.d degree across clients and
large data divergence within clients leave a great potential for
ODE to reduce data heterogeneity and improve training process
through the designed on-device data selection policy.
ODE largely improves the inference accuracy of the final
model. Table V shows that in comparison with baselines with
the same storage, ODE enhances final accuracy on all the
datasets, achieving 3.24% ∼ 7.56% higher on ST, 3.13% ∼
6.38% increase on HAR, and around 6% rise on TC. We
also notice that ODE has a marginal accuracy improvement
(≈ 1.4%) on IC, because the FashionMNIST has less data
variance within each label, and a randomly selected subset is
sufficient to represent the entire data distribution for model
training.
Importance-based data selection methods perform poorly.
Table V shows that these methods even cannot reach the target
final accuracy on tasks IC, HAR and TC, as these datasets are
collected from real world and contain noise data, making such
importance sampling methods fail to work [25], [26].
Previous data selection methods for FL outperform importance based methods but worse than ODE. As is shown in Table
B. Experiment Results for Motivation
In this subsection, we provide the comprehensive experimental results for the motivation of our work. First, we show
that the properties of limited on-device storage and streaming
networked data can deteriorate classic FL model training
process significantly in various settings, such as different
numbers of local training epochs and characteristics of clients’
data. Then, we analyze the separate impacts of theses two
properties on FL with different data selection methods.
Overall Impact. To investigate how the properties of limited on-device storage and steaming networked data impact
the classic FL with different data characteristics, such as the
range and variance of the local data distribution, we conduct
experiments over three settings with different numbers of
labels owned by each client, which is constructed from the
industrial traffic classification dataset (TC).
The experimental results are shown in Table III where
average variance represents the average local data variance
of clients. The results demonstrate that: (1) With local data
variance increasing, the reduction of convergence rate and
model accuracy is becoming larger. This is because the stored
data samples are more likely to be biased when the data
distibution has a wide range; (2) When the number of local
epoch m increases, the negative impact of limited on-device
storage is becoming more serious due to larger steps towards
11
TABLE III
I MPACT OF LIMITED ON - DEVICE STORAGE IN THE SETTINGS WITH DIFFERENT N ON -IID DATA DISTRIBUTION AND THE TRAFFIC CLASSIFICATION TASK .
Settings
#Label per
Client
Average
Variance
Setting 1
5
0.227
Setting 2
10
0.234
Setting 3
15
0.235
#Local
Epoch
2
5
2
5
2
5
Convergence Rate
TRS /TFullData
2.18
1.70
2.50
2.84
3.46
3.92
Inference Accuracy
RS
FullData
0.867
0.912
0.855
0.890
0.894
0.932
0.874
0.926
0.906
0.960
0.890
0.957
TABLE IV
I NDIVIDUAL IMPACTS OF ON - DEVICE STORAGE AND STREAMING NETWORKED DATA ON DATA SELECTION IN FL WITH THE SYNTHETIC TASK .
Setting
Limited Storage
Streaming Data
Both
RS
79.6
79.6
79.6
Final Accuracy (%)
HL
GN
FB SLD
78.4 83.3 77.6 76.9
87.2 87.6 87.2 87.0
78.4 83.3 78.5 82.3
ODE
87.1
88.2
87.1
RS
1.00
1.00
1.00
Training Speedup
HL
GN
FB
−
4.87
−
2.86 2.30 2.86
−
4.87
−
(×)
SLD
−
2
4.08
ODE
9.52
2.67
9.52
TABLE V
T HE OVERALL PERFORMANCE OF DIFFERENT DATA SELECTION METHODS ON MODEL TRAINING SPEEDUP AND INFERENCE ACCURACY. T HE SYMBOL
’−’ MEANS THAT THE METHOD FAILS TO REACH THE TARGET ACCURACY.
Task
ST
IC
HAR
TC
Task
ST
IC
HAR
TC
RS
1.0
1.0
1.0
1.0
HL
−
−
−
−
79.56
71.31
67.25
89.30
78.44
51.95
48.16
69.00
Model Training Speedup (×)
FB
SLD ODE-Exact
−
4.08
9.52
−
−
1.35
−
−
2.22
−
−
2.51
Inference Accuracy (%)
83.28 78.56 82.38
87.12
41.45 60.43 69.15
72.71
51.02 48.33 56.24
73.63
69.3 72.19 72.30
95.30
GN
4.87
−
−
−
ODE-Est
5.88
1.20
1.55
2.50
FD
2.67
1.01
4.76
3.92
82.80
72.70
70.39
95.30
88.14
71.37
77.54
96.00
V, FedBalancer and SLD perform better than HL and GN,
but worse than RS in a large degree, which is different from
the phenomenon in traditional settings [64], [26], [25]. This
is because (1) their noise reduction steps, such as removing
samples with top loss or gradient norm, highly rely on the
complete statistical information of full dataset, and (2) their
on-client data valuation metrics fail to work for the global
model training in FL, as discussed in Section I.
Simplified ODE reduces computation and memory costs significantly with marginal performance degradation. We conduct
another two experiments which consider only the last 1 and
2 layers (5 layers in total) for data valuation on the industrial
TC dataset. Empirical results shown in Figure 7 demonstrate
that the simplified version reduces as high as 44% memory
and 83% time delay, with only 0.1× and 1% degradation on
speedup and accuracy.
ODE introduces small extra memory footprint and data
processing delay during data selection process. Empirical
results in Table VI demonstrate that simplified ODE brings only
tiny evaluation delay and memory burden to mobile devices
(a) Inference Accuracy
(b) Training Speedup
(c) Memory Footprint
(d) Evaluation Latency
Fig. 7. Performance and cost of simplified ODE with different numbers of
model layers for gradient computation.
12
TABLE VI
T HE MEMORY FOOTPRINT AND EVALUATION DELAY PER SAMPLE
VALUATION OF BASELINES ON THREE REAL - WORLD TASKS .
IC
HAR
TC
RS
1.70
1.92
0.75
Memory Footprint (MB)
HL
GN
ODE-Est ODE-Sim
11.91 16.89
18.27
16.92
7.27 12.23
13.46
12.38
10.58 19.65
25.15
14.47
Task
IC
HAR
TC
0.05
0.05
0.05
Evaluation Time (ms)
11.1
21.1
22.8
0.36
1.04
1.93
1.03
9.06
9.69
Task
(a) Synthetic Dataset
11.4
0.53
1.23
(b) Image Classification
(c) Human Activity Recognition
(a) Epoch=2
(b) Epoch=10
Fig. 9. Robustness of ODE and baselines to local epoch number on three
public datasets.
Fig. 8. The training process of different data selection methods with various
numbers of local training epochs on the industrial traffic classification dataset.
Participation Rate. The results in Figure 10 demonstrate
that ODE can improve the FL process significantly even with
small client participation rate, accelerating the model training
2.57× and increasing the inference accuracy by 6.6% in TC
task, 9× and 8% in ST, 2× and 3% in IC, and 2× and
6% in HAR. This demonstrates the practicality of ODE in
the real-world environments, where only a small proportion
of mobile devices could be able to participate in each FL
round. We also notice that in some cases (IC and HAR), the
effectiveness of ODE on speeding up convergence is weakened
with a high client participation rate, this is because increasing
such rate can reduce the negative impact of cross-client data
heterogeneity on global model convergence, which has been
achieved by ODE.
Storage Capacity. We also conduct experiments to discover
how on-device storage impacts ODE on the indistrial dataset.
For each of the four data selection methods (RS, ODE-Exact,
ODE-Est, FullData)6 , we conduct experiments on various ondevice storage capacities ranging from 5 samples to 20 samples, and plot the training time as well as the corresponding
final accuracy in Figure 11. The training time is normalized
by the time-to-accuracy of FullData to exhibit the variation
of training time more clearly. Empirical results indicate that
(1) the performance of RS (both of convergence rate and
inference accuracy) degrades rapidly with the decrease of
storage capacity; (2) ODE has stable and good performance
nearly independent on storage capacity, and thus are robust to
potential diverse storage capacity of heterogeneous mobile de-
(1.23ms and 14.47MB for the industrial TC task), and thus
can be applied to practical mobile network scenarios.
D. Robustness of ODE
In this subsection, we mainly compare the robustness of
ODE and previous methods to various factors in real-world
environments, such as the number of local training epoch m,
t|
client participation rate |C
|C| , storage capacity |Bc |, mini-batch
size and data heterogeneity across clients, on the industrial
traffic classification dataset.
Number of Local Training Epoch. Empirical results
shown in Figures 8 and 9 demonstrate that (1) ODE can
work with various local training epoch numbers m. With m
increasing from 2 to 10, both of ODE-Exact and ODE-Est
achieve higher training speedup and final inference accuracy
than existing methods with the same setting (e.g. 8.1× → 10×
on training speedup and 3.5% → 5.1% on accuracy improvement in ST), and thus more improvement on user experience;
(2) ODE-Est has the similar performance with ODE-Exact
on improving the model performance. The first phenomenon
coincides with the previous analysis in Section III-A: (1) for
convergence rate, the one-step-look-ahead strategy of ODE
only optimizes the loss reduction of the first local training
epoch in (3), and thus the influence on model convergence will
be weakened when the number of local epoch becomes larger;
(2) for inference accuracy, ODE narrows the gap between the
models trained through FL and CL by optimizing the dominant
term with the maximum weight, i.e., (1+ηL)m−1 Gc (wct,0 ), of
the gap bound in (13), leading to better final inference accuracy
with a larger m.
6 We choose to compare with RS because it outperforms importance
sampling and previous data selection methods in general cases, and here we
mainly want to investigate the robustness of ODE.
13
(a) Synthetic Task
(b) Image Classification
(c) Human Activity Recognition
(d) Traffic Classification
Fig. 10. The training speedup and final accuracy of different data selection methods with various participation rates on four tasks.
TABLE VIII
ROBUSTNESS TO DIFFERENT DATA HETEROGENEITY.
(a) Training Time
#Label
Speedup of
ODE-Est
1
5
15
1.56×
2.19×
2.07×
Final Accuracy
RS
ODE-Est
81.8%
85.5%
89.0%
85.4%
90.1%
94.6%
(b) Inference Accuracy
Fig. 11. Robustness of ODE to storage capacity on industrial dataset.
among clients has always been an obstacle for accelerating
convergence and improving accuracy in FL model training.
Experimental results in Table VIII show that ODE achieves
3.6% ∼ 5.6% increase in inference accuracy and 1.56× ∼
2.2× speedup in training time in different heterogeneous
settings. Despite that the performance of ODE seems to degrade in the extremely heterogeneous case (#Label=1), the
degradation is caused by the small number of local data
labels, which reduces the effectiveness of the cross-device
data selectioncomponent of ODE, and will not actually exist
in practical scenario as mobile devices usually have multiple
labels (e.g., applications).
TABLE VII
ROBUSTNESS OF ODE TO DIFFERENT MINI - BATCH SIZES ON INDUSTRIAL
TRAFFIC CLASSIFICATION DATASET.
Batch Size
1
2
5
Speedup of
ODE-Est
2.42×
2.07×
1.97×
Final Accuracy
RS
ODE-Est
92.0%
95.0%
92.4%
95.3%
92.9%
95.5%
vices in practice; (3) Compared with ODE, RS needs more than
twice storage space to achieve the same model performance.
Mini-batch Size. As mini-batch is widely used in model
training of FL to reduce training time, we conduct the experiments with different mini-batch sizes on the industrial TC
dataset, and adjust the learning rate from 0.005 to 0.001 and
local epoch number from 5 to 2 to reduce the instability caused
by the multiple local model updates. We show the performance
of ODE-Est and RS in Table VII, which demonstrates that
(1) ODE can improve the FL model training process with
various mini-batch sizes, increasing the model accuracy by
2.6% ∼ 3% and speeding up the model convergence by
as high as 2.42×; (2) The smaller batch size is, the more
training speedup and accuracy improvement can be obtained
by ODE, which coincides the results of different numbers of
local epochs.
Data Heterogeneity. The data heterogeneity (Non-IID data)
E. Component-wise Analysis
In this subsection, we evaluate the effectiveness of each of
the three components in ODE: on-device data selection, global
gradient estimator and cross-client coordination policy. The
results altogether show that each component is critical for the
good performance guarantee of ODE.
On-Client Data Selection. To show the effect of on-client
data selection, we compare ODE with another baseline, namely
Valuation-, which replaces the on-device data selection policy
with random sampling, but still allows the central server to
coordinate the clients. The result in Figure 12 demonstrates
that without on-device data valuation module, Valuation- performs slightly better than RS but much worse than ODE-Est,
which demonstrates the significant role of the proposed data
valuation metric and data selection policy.
14
have severe impact on the success of FL, as demonstrated in
our experiments.
Data Selection. In FL, selecting data from streaming data
can be seen as sampling batches of data from the data distribution, which is similar to mini-batch SGD [79]. To improve
the training process of SGD, existing methods quantify the
importance of each data sample (such as loss [61], [62], gradient norm [63], [29], uncertainty [80], [81], data shapley [28]
and representativeness [82], [83]) and leverage importance
sampling or priority queue to select training samples. The
previous work [25], [26] on data selection in FL simplify
conducted the above data selection methods on each client
individually for local model training without considering the
impacts on global model. Further, all of them require either
access to all the data or multiple inspections over the streaming
data, which are not satisfied in the mobile network scenarios.
Traffic Classification is a significant task in mobile networks, which associates traffic packets with specific applications for the downstream network management task, like
user experience optimization, capacity planning and resource
provisioning [84]. The techniques for traffic classification can
be classified into three categories. Port-based approaches [85]
utilized the association of ports in the TCP or UDP header with
well-known port numbers assigned by the IANA. Payloadbased approach [86] identified applications by inspecting the
packet headers or payload, which has very high computational
consumption and cannot handle encrypted traffic classification
[87]. Statistics-based approach leverages payload-independent
parameters such as packet length, inter-arrival time and flow
direction to circumvent the problem of payload encrypted and
user’s privacy.
Fig. 12. Component-wise analysis of ODE on industrial traffic classification
dataset.
Global Gradient Estimator. To show the effectiveness of
the global gradient estimator, we compare ODE-Est with a
naive estimation method, namely Estimator-, where (1) server
only utilizes the local gradient estimators of participating
clients instead of all clients and (2) each participant only
leverages the samples stored locally to calculate the local gradient instead of utilizing all the generated data samples. Figure
12 shows that the naive estimation method has the poorest
performance, as the partial clients and biased local gradient
lead to really inaccurate estimation for global gradient, which
further misleads the clients to select data samples using an
incorrect data valuation metric.
Cross-Client Coordination. We conduct another experiment where clients select and store data samples only based
on their local information without the coordination of the
server, namely Coor-. Figure 12 shows that in this situation,
the performance of ODE is largely weakened, as the clients
tend to store similar and overlapped valuable data samples,
and the other data will be under-represented.
VI. C ONCLUSION
In this work, we identify two key properties of networked
FL in mobile networks: limited on-device storage and streaming networked data, which have not been fully explored in
the literature. Then, we present the design, implementation
and evaluation of ODE, which is an online data selection
framework for FL with limited on-device storage, consisting
of two components: on-device data evaluation and crossdevice collaborative data storage. Our analysis show that ODE
improves both convergence rate and inference accuracy of the
global model, simultaneously. Empirical results on three public
and one industrial datasets demonstrate that ODE significantly
outperforms the state-of-the-art data selection methods in
terms of training time, final accuracy and robustness to various
factors in industrial environments. ODE is a significant step
towards the realization of applying FL to network scenario
with limited storage.
V. R ELATED W ORKS
Federated Learning is a distributed learning framework
that aims to collaboratively learn a global model over the
networked devices’ data under the constraint that the data
is stored and processed locally [21], [16]. Existing works
mostly focus on how to improve FL from the perspectives
of communication and computation costs.
Previous works on communication cost reduction mainly
leveraged model quantization [65], sparsification [66], [67] and
pruning [14] to decrease the amount of parameters transmitted
between clients and server, or considered hierarchical federated learning to optimize the network structure [68], [69], [70]
and improve the communication efficiency.
The existing literature improved the computation efficiency
of FL [71] from the perspective of algorithms in different
levels, such as improving the algorithms of local model
update [72], [36], [73] and global model aggregation [45],
[44], [74], [75], [38], optimizing the client selection algorithms
to mitigate the heterogeneity among clients [76], [77], [78].
However, most of them assume a static training dataset
on each device and overlook the device properties of limited
on-device storage and streaming networked data, which is
common for devices in practical mobile networks and will
R EFERENCES
[1] M. Giordani, M. Polese, M. Mezzavilla, S. Rangan, and M. Zorzi, “Toward 6g networks: Use cases and technologies,” IEEE Communications
Magazine, vol. 58, no. 3, pp. 55–61, 2020.
[2] D. Bega, M. Gramaglia, M. Fiore, A. Banchs, and X. Costa-Perez,
“Deepcog: Optimizing resource provisioning in network slicing with
ai-based capacity forecasting,” IEEE Journal on Selected Areas in
Communications, vol. 38, no. 2, pp. 361–376, 2019.
15
[3] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network anomaly
detection: methods, systems and tools,” IEEE Communications Surveys
& Tutorials, vol. 16, no. 1, pp. 303–336, 2013.
[4] R. L. Cruz, “Quality of service guarantees in virtual circuit switched
networks,” IEEE Journal on Selected areas in Communications, vol. 13,
no. 6, pp. 1048–1056, 1995.
[5] S. S. Kolahi, S. Narayan, D. D. Nguyen, and Y. Sunarto, “Performance
monitoring of various network traffic generators,” in International Conference on Computer Modelling and Simulation (UkSim), 2011.
[6] D. S. Nunes, P. Zhang, and J. S. Silva, “A survey on human-in-the-loop
applications towards an internet of all,” IEEE Communications Surveys
& Tutorials, vol. 17, no. 2, pp. 944–965, 2015.
[7] M. A. Siddiqi, H. Yu, and J. Joung, “5g ultra-reliable low-latency
communication implementation challenges and operational issues with
iot devices,” Electronics, vol. 8, no. 9, p. 981, 2019.
[8] P. K. Aggarwal, P. Jain, J. Mehta, R. Garg, K. Makar, and P. Chaudhary,
“Machine learning, data mining, and big data analytics for 5g-enabled
iot,” in Blockchain for 5G-Enabled IoT, 2021, pp. 351–375.
[9] A. Banchs, M. Fiore, A. Garcia-Saavedra, and M. Gramaglia, “Network
intelligence in 6g: challenges and opportunities,” in ACM Workshop on
Mobility in the Evolving Internet Architecture (MobiArch), 2021.
[10] D. Rafique and L. Velasco, “Machine learning for network automation:
overview, architecture, and applications,” Journal of Optical Communications and Networking, vol. 10, no. 10, pp. D126–D143, 2018.
[11] S. Ayoubi, N. Limam, M. A. Salahuddin, N. Shahriar, R. Boutaba,
F. Estrada-Solano, and O. M. Caicedo, “Machine learning for cognitive
network management,” IEEE Communications Magazine, vol. 56, no. 1,
pp. 158–165, 2018.
[12] Y. Deng, “Deep learning on mobile devices: a review,” in Mobile
Multimedia/Image Processing, Security, and Applications, 2019.
[13] X. Zeng, M. Yan, and M. Zhang, “Mercury: Efficient on-device distributed dnn training via stochastic importance sampling,” in ACM
Conference on Embedded Networked Sensor Systems (Sensys), 2021.
[14] A. Li, J. Sun, P. Li, Y. Pu, H. Li, and Y. Chen, “Hermes: an efficient federated learning framework for heterogeneous mobile clients,”
in International Conference On Mobile Computing And Networking
(MobiCom), 2021.
[15] F. Mireshghallah, M. Taram, A. Jalali, A. T. T. Elthakeb, D. Tullsen,
and H. Esmaeilzadeh, “Not all features are equal: Discovering essential
features for preserving prediction privacy,” in The ACM Web Conference
(WWW), 2021.
[16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communication-efficient learning of deep networks from decentralized
data,” in International Conference on Artificial Intelligence and Statistics
(AISTATS), 2017.
[17] UCSC, “Packet buffers,” 2020. [Online]. Available: https://people.ucsc.
edu/∼warner/buffer.html
[18] G. Fox, “60 incredible facebook statistics,” https://www.garyfox.co/
facebook-statistics/, 2018.
[19] A.
Cloud,
“Alibaba
cloud
supported
583,000
orders/second for 2020 double 11 - the highest traffic
peak
in
the
world!”
https://www.alibabacloud.com/blog/
alibaba-cloud-supported-583000-orderssecond-for-2020-double-11-\
--the-highest-traffic-peak-in-the-world 596884, 2020.
[20] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T.
Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning (ICML), 2020.
[21] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
Challenges, methods, and future directions,” IEEE Signal Processing
Magazine, vol. 37, no. 3, pp. 50–60, 2020.
[22] I. Akbari, M. A. Salahuddin, L. Ven, N. Limam, R. Boutaba, B. Mathieu,
S. Moteau, and S. Tuffin, “A look behind the curtain: traffic classification
in an increasingly encrypted web,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 5, no. 1, pp. 1–26,
2021.
[23] H. Tahaei, F. Afifi, A. Asemi, F. Zaki, and N. B. Anuar, “The rise of
traffic classification in iot networks: A survey,” Journal of Network and
Computer Applications, vol. 154, p. 102538, 2020.
[24] D. Arora, K. F. Li, and A. Loffler, “Big data analytics for classification
of network enabled devices,” in International Conference on Advanced
Information Networking and Applications Workshops (WAINA), 2016.
[25] J. Shin, Y. Li, Y. Liu, and S. Lee, “Fedbalancer: data and pace control
for efficient federated learning on heterogeneous clients,” in ACM
International Conference on Mobile Systems, Applications, and Services
(MobiSys), 2022, pp. 436–449.
[26] A. Li, L. Zhang, J. Tan, Y. Qin, J. Wang, and X.-Y. Li, “Sample-level
data selection for federated learning,” in IEEE International Conference
on Computer Communications (INFOCOM), 2021.
[27] R. D. Cook, “Detection of influential observation in linear regression,”
Technometrics, vol. 19, no. 1, pp. 15–18, 1977.
[28] A. Ghorbani and J. Zou, “Data shapley: Equitable valuation of data for
machine learning,” in International Conference on Machine Learning
(ICML), 2019.
[29] P. Zhao and T. Zhang, “Stochastic optimization with importance sampling for regularized loss minimization,” in International Conference on
Machine Learning (ICML), 2015.
[30] N. T. Hu, X. Hu, R. Liu, S. Hooker, and J. Yosinski, “When does lossbased prioritization fail?” ICML 2021 workshop on Subset Selection in
ML, 2021.
[31] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan,
V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,”
arXiv preprint arXiv:1812.01097, 2018.
[32] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image
dataset for benchmarking machine learning algorithms.
[33] X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: a
similarity-aware federated learning system for human activity recognition,” in ACM International Conference on Mobile Systems, Applications, and Services (MobiSys), 2021.
[34] C. Li, D. Niu, B. Jiang, X. Zuo, and J. Yang, “Meta-har: Federated
representation learning for human activity recognition,” in The ACM
Web Conference (WWW), 2021.
[35] J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions on
Mathematical Software, vol. 11, no. 1, pp. 37–57, 1985.
[36] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
“Federated optimization in heterogeneous networks,” in Proceedings of
Machine Learning and Systems (MLSys), 2018.
[37] J. Hamer, M. Mohri, and A. T. Suresh, “Fedboost: A communicationefficient algorithm for federated learning,” in International Conference
on Machine Learning (ICML), 2020.
[38] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,”
in Advances in neural information processing systems, 2020.
[39] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang,
and H. V. Poor, “Fast-convergent federated learning,” IEEE Journal on
Selected Areas in Communications, vol. 39, no. 1, pp. 201–218, 2020.
[40] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
[41] R. Branzei, D. Dimitrov, and S. Tijs, Models in cooperative game theory.
Springer Science & Business Media, 2008, vol. 556.
[42] P. W. Koh and P. Liang, “Understanding black-box predictions via
influence functions,” in International Conference on Machine Learning
(ICML), 2017.
[43] L. Shapley, “Quota solutions op n-person games1,” Edited by Emil Artin
and Marston Morse, p. 343, 1953.
[44] S. Zawad, A. Ali, P.-Y. Chen, A. Anwar, Y. Zhou, N. Baracaldo, Y. Tian,
and F. Yan, “Curse or redemption? how data heterogeneity affects the
robustness of federated learning,” in The AAAI Conference on Artificial
Intelligence (AAAI), 2021.
[45] Z. Chai, H. Fayyaz, Z. Fayyaz, A. Anwar, Y. Zhou, N. Baracaldo,
H. Ludwig, and Y. Cheng, “Towards taming the resource and data heterogeneity in federated learning,” in {USENIX} Conference on Operational
Machine Learning (OpML), 2019.
[46] C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu,
“Characterizing impacts of heterogeneity in federated learning upon
large-scale smartphone data,” in The ACM Web Conference (WWW),
2021.
[47] D. Jhunjhunwala, P. SHARMA, A. Nagarkatti, and G. Joshi, “Fedvarp:
Tackling the variance due to partial client participation in federated
learning,” in Conference on Uncertainty in Artificial Intelligence (UAI),
2022.
[48] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek,
and H. V. Poor, “Federated learning with differential privacy: Algorithms
and performance analysis,” IEEE Transactions on Information Forensics
and Security, vol. 15, pp. 3454–3469, 2020.
[49] C. Dwork, “Differential privacy: A survey of results,” in International
Conference on Theory and Applications of Models of Computation
(ICTAMC), 2008.
[50] K. Krishna and M. N. Murty, “Genetic k-means algorithm,” IEEE
Transactions on Systems, Man, and Cybernetics, vol. 29, no. 3, pp. 433–
439, 1999.
[51] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an
overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 2, no. 1, pp. 86–97, 2012.
16
[52] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM
Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995.
[53] D. Ge, X. Jiang, and Y. Ye, “A note on the complexity of l p
minimization,” Mathematical Programming, vol. 129, no. 2, pp. 285–
299, 2011.
[54] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for
l1 regularization: A comparative study and two new approaches,” in
European Conference on Machine Learning (ECML), 2007.
[55] J. Shi, W. Yin, S. Osher, and P. Sajda, “A fast hybrid algorithm for
large-scale l1-regularized logistic regression,” The Journal of Machine
Learning Research, vol. 11, pp. 713–741, 2010.
[56] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A survey on
homomorphic encryption schemes: Theory and implementation,” ACM
Computing Surveys, vol. 51, no. 4, pp. 1–35, 2018.
[57] S. Hong, S. Kim, J. Choi, Y. Lee, and J. H. Cheon, “Efficient sorting
of homomorphic encrypted data with k-way sorting network,” IEEE
Transactions on Information Forensics and Security, vol. 16, pp. 4389–
4404, 2021.
[58] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard,
and L. Jackel, “Handwritten digit recognition with a back-propagation
network,” in Conference on Neural Information Processing Systems
(NeurIPS), vol. 2, 1989.
[59] C. Li, X. Zeng, M. Zhang, and Z. Cao, “Pyramidfl: A fine-grained client
selection framework for efficient federated learning,” in International
Conference On Mobile Computing And Networking (MobiCom), 2022.
[60] I. Loshchilov and F. Hutter, “Online batch selection for faster training
of neural networks,” arXiv preprint arXiv:1511.06343, 2015.
[61] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based
object detectors with online hard example mining,” in The IEEE / CVF
Computer Vision and Pattern Recognition Conference (CVPR), 2016.
[62] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations
(ICLR), 2015.
[63] T. B. Johnson and C. Guestrin, “Training deep models faster with
robust, approximate importance sampling,” in Conference on Neural
Information Processing Systems (NeurIPS), 2018.
[64] A. Katharopoulos and F. Fleuret, “Not all samples are created equal:
Deep learning with importance sampling,” in International conference
on machine learning (ICML), 2018.
[65] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Federated learning with compression: Unified analysis and sharp guarantees,”
in The 24th International Conference on Artificial Intelligence and
Statistics, (AISTATS), ser. Proceedings of Machine Learning Research,
2021, pp. 2350–2358.
[66] A. Li, J. Sun, X. Zeng, M. Zhang, H. Li, and Y. Chen, “Fedmask:
Joint computation and communication-efficient personalized federated
learning via heterogeneous masking,” in ACM Conference on Embedded
Networked Sensor (SenSys), 2021, pp. 42–55.
[67] P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification
for efficient federated learning: An online learning approach,” in IEEE
International Conference on Distributed Computing Systems (ICDCS),
2020, pp. 300–310.
[68] J. Wu, Q. Liu, Z. Huang, Y. Ning, H. Wang, E. Chen, J. Yi, and B. Zhou,
“Hierarchical personalized federated learning for user modeling,” in The
ACM Web Conference (WWW), 2021.
[69] S. Hosseinalipour, S. S. Azam, C. G. Brinton, N. Michelusi, V. Aggarwal, D. J. Love, and H. Dai, “Multi-stage hybrid federated learning
over large-scale d2d-enabled fog networks,” IEEE/ACM Transactions on
Networking, vol. 30, no. 4, pp. 1569–1584, 2022.
[70] X. Liu, Z. Zhong, Y. Zhou, D. Wu, X. Chen, M. Chen, and Q. Z. Sheng,
“Accelerating federated learning via parallel servers: A theoretically
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
17
guaranteed approach,” IEEE/ACM Transactions on Networking, vol. 30,
no. 5, pp. 2201–2215, 2022.
Y. Zhang, X. Lan, J. Ren, and L. Cai, “Efficient computing resource
sharing for mobile edge-cloud computing networks,” IEEE/ACM Transactions on Networking, vol. 28, no. 3, pp. 1227–1240, 2020.
C. T. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y.
Zomaya, and V. Gramoli, “Federated learning over wireless networks:
Convergence analysis and resource allocation,” IEEE/ACM Transactions
on Networking, vol. 29, no. 1, pp. 398–409, 2021.
A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated
learning with theoretical guarantees: A model-agnostic meta-learning
approach,” 2020.
Y. J. Cho, A. Manoel, G. Joshi, R. Sim, and D. Dimitriadis, “Heterogeneous ensemble knowledge transfer for training large models in
federated learning,” in International Joint Conferences on Artificial
Intelligence Organization (IJCAI), 2022.
J. Wolfrath, N. Sreekumar, D. Kumar, Y. Wang, and A. Chandra, “Haccs:
Heterogeneity-aware clustered client selection for accelerated federated
learning,” in IEEE International Parallel & Distributed Processing
Symposium (IPDPS), 2022.
Y. J. Cho, J. Wang, and G. Joshi, “Towards understanding biased client
selection in federated learning,” in International Conference on Artificial
Intelligence and Statistics (AISTATS), 2022.
T. Nishio and R. Yonetani, “Client selection for federated learning
with heterogeneous resources in mobile edge,” in IEEE International
Conference on Communications (ICC), 2019.
F. Li, J. Liu, and B. Ji, “Federated learning with fair worker selection: A
multi-round submodular maximization approach,” in IEEE International
Conference on Mobile Ad-Hoc and Smart Systems (MASS), 2021, pp.
180–188.
B. E. Woodworth, K. K. Patel, and N. Srebro, “Minibatch vs local
SGD for heterogeneous distributed learning,” in Advances in Neural
Information Processing (NeurIPS), 2020.
H.-S. Chang, E. Learned-Miller, and A. McCallum, “Active bias:
Training more accurate neural networks by emphasizing high variance
samples,” in Conference on Neural Information Processing Systems
(NeurIPS), 2017.
C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling
matters in deep embedding learning,” in The IEEE / CVF Computer
Vision and Pattern Recognition Conference (CVPR), 2017.
B. Mirzasoleiman, J. Bilmes, and J. Leskovec, “Coresets for dataefficient training of machine learning models,” in International Conference on Machine Learning (ICML), 2020.
Y. Wang, F. Fabbri, and M. Mathioudakis, “Fair and representative subset
selection from data streams,” in The ACM Web Conference (WWW),
2021.
I. Akbari, M. A. Salahuddin, L. Ven, N. Limam, R. Boutaba, B. Mathieu,
S. Moteau, and S. Tuffin, “A look behind the curtain: Traffic classification in an increasingly encrypted web,” in Proceedings of the ACM on
Measurement and Analysis of Computing Systems (SigMetrics), 2021,
pp. 04:1–04:26.
T. Karagiannis, A. Broido, M. Faloutsos, and K. Claffy, “Transport layer
identification of p2p traffic,” in ACM Internet Measurement Conference
(IMC), 2004.
Y. Wang, Y. Xiang, and S.-Z. Yu, “Automatic application signature
construction from unknown traffic,” in IEEE International Conference
on Advanced Information Networking and Applications (AINA), 2010.
M. Finsterbusch, C. Richter, E. Rocha, J.-A. Muller, and K. Hanssgen,
“A survey of payload-based traffic classification approaches,” IEEE
Communications Surveys & Tutorials, vol. 16, no. 2, pp. 1135–1156,
2013.
Download