1 ODE: An Online Data Selection Framework for Federated Learning with Limited Storage Chen Gong1 , Zhenzhe Zheng1 , Yunfeng Shao2 , Bingshuai Li2 , Fan Wu1 , Guihai Chen1 1 Shanghai Jiao Tong University, Shanghai, China 2 Huawei Noah’s Ark Lab, Beijing China {gongchen, zhengzhenzhe}@sjtu.edu.cn, {shaoyunfeng, libingshuai}@huawei.com, {fwu, gchen}@cs.sjtu.edu.cn Abstract—Machine learning models have been deployed in mobile networks to deal with massive data from different layers to enable automated network management and intelligence on devices. To overcome high communication cost and severe privacy concerns of centralized machine learning, federated learning (FL) has been proposed to achieve distributed machine learning among networked devices. While the computation and communication limitation has been widely studied, the impact of ondevice storage on the performance of FL is still not explored. Without an effective data selection policy to filter the massive streaming networked data on devices, classical FL can suffer from much longer model training time (4×) and significant inference accuracy reduction (7%), observed in our experiments. In this work, we take the first step to consider the online data selection for FL with limited on-device storage. We first define a new data valuation metric for data evaluation and selection in FL with theoretical guarantee for speeding up model convergence and enhancing final model accuracy, simultaneously. We further design ODE, an Online Data sElection framework for FL, to coordinate networked devices to store valuable data samples collaboratively. Experimental results on one industrial dataset and three public datasets show the remarkable advantages of ODE over the state-of-the-art approaches. Particularly, on the industrial dataset, ODE achieves as high as 2.5× speedup of training time and 6% increase in inference accuracy, and is robust to various factors in practical environments, including number of local training epochs, client participation rate, device storage capacity, mini-batch size and data heterogeneity among clients. Index Terms—Federated Learning, Limited On-Device Storage, Online Data Selection I. I NTRODUCTION T HE next-generation mobile computing systems require effective and efficient management of mobile networks and devices in various aspects, including resource provisioning [1], [2], security and intrusion detection [3], quality of service guarantee [4], and performance monitoring [5]. Analyzing and controlling such an increasingly complex mobile network with traditional human-in-the-loop approaches [6] will not be possible any more, due to low-latency requirement [7], massive real-time data and complicated correlation among data [8]. For example, in network traffic analysis, a fundamental task in mobile networks, devices such as routers and ONTs (Optical Network Terminal) can receive/send as many as 5000 packets (≈ 5MB) per second. It is impractical to manually analyze such a huge quantity of high-dimensional data within milliseconds. Thus, machine learning models have been widely applied to discover pattern behind high-dimensional networked data, enable data-driven network control, and fully automate the mobile network operation [9], [10], [11]. Despite that ML model overcomes the limitations of humanin-the-loop approaches, its good performance highly relies on the huge amount of high quality data for model training [12], which is hard to obtain in mobile networks as the data is resided on heterogeneous devices in a distributed manner. On the one hand, an on-device ML model trained locally with limited data and computational resources is unlikely to achieve desirable inference accuracy and generalization ability [13]. On the other hand, directly transmitting data from distributed networked devices to a cloud server for centralized learning (CL) will bring prohibitively high communication cost and severe privacy concerns [14], [15]. Recently, federated learning (FL) [16] emerges as a distributed privacy-preserving ML paradigm to resolve the above concerns, which allows networked devices to upload local model updates instead of raw data, and a central server to aggregate these local models into a global model, solving the on-device data limitation, communication bottlenecks and potential privacy leakage. Motivation and New Problem. For applying FL to mobile networks, we identify two unique properties of networked devices: limited on-device storage and streaming networked data, which have not been fully considered in previous FL literature. (1) Limited on-device storage: due to hardware constraints, mobile devices have restricted storage volume for each mobile application, and can reserve only a small space to store training data samples for model training without compromising the quality of other services. For example, most smart home routers have only 9-32MB storage to support various kinds of services [17], such as buffering packets awaiting transmission or storing configuration information for routing, and thus only tens of training data samples can be stored in this scenario. (2) Streaming networked data: data samples are continuously generated or received by mobile devices in a streaming manner, and we need to make online decisions on whether to store each arrived data sample. For instance, each router can receive over 5, 000 packets per minute, 14.58 million photos were uploaded on Facebook by mobile users per hour in 2018 [18], and Alibaba reached a peak of 583, 000 sales transactions per hour in 2020 [19]. Within this circumstance, the training data stored locally can be dynamic with time. Without a carefully designed data selection policy to maintain the data samples in storage, the empirical distribution (a) Convergence Time and local models, which can be called as spatial information. This is because the valuable data samples selected locally could be overlapped with each other, and the local valuable data may not be the global valuable one. (3) The on-device data selection needs to be low computation-and-memory cost due to the conflict of limited hardware resources and requirement on quality of user experience. As the additional time delay and memory costs introduced by online data selection process would degrade the performance of mobile network and user experience, the real-time data samples must be evaluated in a computation and memory efficient way. However, increasingly complex ML models with huge amount of parameters lead to high computation complexity as well as large memory footprint for storing intermediate model outputs during data selection. Limitations of Related Works. The prior works on data evaluation and selection in ML failed to solve the above challenges. (1) The data selection methods in CL, such as leave-one-out test [27], Data Shapley [28] and Importance Sampling [15], [29], are not appropriate for FL due to the first challenge: they could only measure the value of each data sample corresponding to the local model training process, instead of the global model in FL. (2) The prior works on data selection in FL did not consider the two new properties of FL devices. Mercury [13], FedBalancer [25] and the work from Li et al. [26] adopted importance sampling framework [29] to select the data samples with high loss or gradient norm for reducing the training time per epoch, but failed to solve the second challenge: these methods all need to inspect the whole dataset for normalized sampling weight computation as well as noise and outliers removal [30], [26]. Our Solutions. To solve the above challenges, we design ODE, an online data selection framework that coordinates networked devices to select and store valuable data samples locally and collaboratively in FL, with theoretical guarantees for accelerating model convergence and enhancing inference accuracy, simultaneously. In ODE, we first theoretically analyze the impact of an individual local data sample on the convergence rate and final accuracy of the global model in FL. We discover a common dominant term in these two analytical expressions, i.e., ⟨∇w l(w, x, y), ∇w F (w)⟩ (please refer to Section III-A for detailed notation definition), which implies that the projection of the gradient of a local data sample to the global gradient is a reasonable metric for data selection in FL. As this metric is a deterministic value, we can simply maintain a priority queue for each device to store the valuable data samples. Second, considering the lack of temporal and spatial information, we propose an efficient method for clients to approximate this data selection metric by maintaining a local gradient estimator on each device and a global one on the server. Third, to overcome the potential overlap of the stored data caused by data selection in a distributed manner, we further propose a strategy for the server to coordinate each device to store high valuable data from different data distribution regions. Finally, to achieve the computation and memory efficiency, we propose a simplified version of ODE, which replaces the full model gradient with partial model gradient to concurrently reduce the (b) Inference Accuracy Fig. 1. To investigate the impact of on-device storage on FL model training, we conduct experiments on an industrial traffic classification dataset with 30 mobile devices and 35, 000+ data samples under different storage capacities. of stored data could deviate from the true data distribution, which further complicates the notorious problem of not independent and identically distributed (Non-IID) data distribution in FL [20], [21]. Specifically, the naive random selection policy significantly degrades the performance of classic FL algorithms in both model training and inference, with more than 4× longer training time and 7% accuracy reduction, observed in our experiments with an industrial network traffic classification dataset shown in Figure 1 (detailed discussion is presented in Section IV-B). This is unacceptable in modern mobile networks, because the longer training time reduces the timeliness and effectiveness of ML models in dynamic environments, and accuracy reduction results in failure to guarantee the quality of service [4] and incurs extra operational expenses [22] as well as security breaches [23], [24]. Therefore, a fundamental problem when applying FL to mobile networks is how to filter valuable data samples from on-device streaming networked data to simultaneously accelerate model training convergence and enhance inference accuracy of the final global model? Design Challenges. The design of such an online data selection framework for FL involves three key challenges: (1) There is still no theoretical understanding about the impact of local on-device data on the training speedup and accuracy enhancement of global model in FL. Lacking information about raw data and local models of the other devices, it is challenging for one device to individually figure out the impact of its local data sample on the performance of the global model. Furthermore, the data sample-level correlation between convergence rate and model accuracy is still not explored in FL, and it is non-trivial to simultaneously improve these two aspects through one unified data valuation metric. (2) The lack of temporal and spatial information complicates the online data selection in FL. For streaming networked data, we could not access the data samples coming from the future or discarded in the past. Lacking such temporal information, one device is not able to leverage the complete statistical information (e.g., unbiased local data distribution) for accurate data quality evaluation1 , such as outliers and noise detection [25], [26]. Additionally, due to the distributed paradigm in FL, one device cannot conduct effective data selection without the knowledge of other devices’ stored data 1 In this work, we use data quality and data value interchangeably, as both of them reflect the sample-level contribution to the model training in FL. 2 computational complexity of model backpropagation and save the memory footprint for storing extra intermediate outputs of the complex ML model. System Implementation and Experimental Results. We evaluated ODE on three public tasks: synthetic task (ST) [31], Image Classification (IC) [32] and Human Activity Recognition (HAR) [33], [34], as well as one industrial mobile traffic classification dataset (TC) collected from our 30-days deployment on 30 ONTs in practice, consisting of 560, 000+ packets from 250 mobile applications. We compare ODE against three categories of data selection baselines: random sampling [35], classic data sampling approaches in CL setting [25], [26], [13] and previous data selection methods for FL [25], [26]. The experimental results show that ODE outperforms all these baselines, achieving as high as 9.52× speedup of model training and 7.56% increase in final model accuracy on ST, 1.35× and 1.4% on IC, 2.22× and 6.38% on HAR, 2.5× and 6% on TC, with minor additional time delay and memory costs. Moreover, ODE is robust to different environmental factors including local epoch number, client participation rate, on-device storage volume, mini-batch size and data heterogeneity across devices. We also conduct ablation experiments to demonstrate the importance of each component in ODE. Summary of Contributions. Our contributions in this work can be summarized as follows: (1) To the best of our knowledge, we are the first to identify two new properties of applying FL in mobile networks: limited on-device storage and streaming networked data, and demonstrate its enormity on the effectiveness and efficiency of model training in FL. (2) We provide analytical formulas on the impact of an individual local data sample on the convergence rate and the final inference accuracy of the global model, based on which we propose a new data valuation metric for data selection in FL with theoretical guarantee for accelerating model convergence and improving inference accuracy, simultaneously. Further more, we propose ODE, an online data selection framework for FL, to realize the on-device data selection and cross-device collaborative data storage. (3) We conduct extensive experiments on three public datasets and one industrial traffic classification dataset to demonstrate the remarkable advantages of ODE against existing methods in terms of the convergence rate and inference accuracy of the final global FL model. performance withS respect to the underlying unbiased data distribution P = c∈C Pc : X minn F (w) = ζc · Fc (w), w∈R denotes the normalized weight of each where ζc = c′ ∈C vc′ client, n is the dimension of model parameters, Fc (w) = E(x,y)∼Pc l(w, x, y) is the expected loss of the model w over the truePdata distribution of client c. We also use F̃c (w) = |B1c | x,y∈Bc l(w, x, y) to denote the empirical loss of model over the data samples stored by client c. In this work, we investigate the impacts of each client’s limited storage on FL, and consider the widely adopted algorithm Fed-Avg [16] for easy illustration3 . Under the synchronous FL framework, the global model is trained by repeating the following two steps for each communication round t in [1, T ]: (1) Local Training: In the round t, the server selects a client subset Ct ⊆ C to participate in the model training process. Each participating client c ∈ Ct downloads the current global t−1 model wfed (the ending global model from the last round), and performs model updates with the locally stored data for m epochs: wct,i ← wct,i−1 − η∇w F̃c (wct,i−1 ), i = 1, · · · , m, (1) t−1 , and where the starting local model wct,0 is initialized as wfed η denotes the learning rate. (2) Model Aggregation: Each participating client c ∈ Ct uploads the updated local model wct,m , and the server aggret gates them to generate a new global model wfed by taking a weighted average: X t wfed ← ζct · wct,m , (2) c∈Ct ζct vc is the normalized weight of client c in where = P ′ c ∈Ct vc′ the current round t. In the scenario of FL with limited on-device storage and streaming networked data, we have an additional data selection step for clients: (3) Data Selection: In each round t, once receiving a new data sample, the client has to make an online decision on whether to store the new sample (in place of an old one if the storage area is fully occupied) or discard it. The goal of this data selection process is to select valuable data samples from streaming networked data for model training in the coming rounds. We list the frequently used notations in Table I. II. P RELIMINARIES In this section, we present the learning model and training process of FL. We consider the synchronous FL framework [16], [36], [20], where a server coordinates a set of mobile devices/clients2 C to conduct distributed model training. Each client c ∈ C generates data samples in a streaming manner with a velocity vc . We use Pc to denote the client c’s underlying distribution of her local data, and P̃c to represent the empirical distribution of the data samples Bc stored in her local storage. The goal of FL is to train S a global model w from the locally stored data P̃ = c∈C P̃c with good 2 We c∈C P vc III. D ESIGN OF ODE In this section, we first quantify the impact of a local data sample on the performance of global FL model in terms of convergence rate and inference accuracy. Based on the common dominant term in the two analytical expressions, we propose a new data valuation metric for data evaluation and selection in FL (Section III-A), and develop a practical method to estimate this metric with low extra computation and 3 Our results for limited on-device storage can be extended to other FL algorithms, such as FedBoost [37], FedNova [38], FedProx [36]. will use mobile devices and clients interchangeably in this work. 3 Theorem 1. (Global Loss Reduction) With Assumption 1, for an arbitrary set of clients Ct ⊆ C selected by the server in round t, the reduction of global loss F (w) is bounded by: TABLE I F REQUENTLY U SED N OTATIONS . Notations Remark C, Ct , c Set of all clients, set of participants in round t, one specific client. Loss of model w over underlying global data distribution. Loss of model w over true data distribution and locally stored data of client c. Loss of model w over a data sample (x, y). Data samples stored by client c and the corresponding data size. Local model of participant c in round t after i local updates. Models in round t trained through federated learning and centralized learning. The velocity of data generated by client c. Weight of client c ∈ C in global objective function of FL. Weight of client c ∈ Ct in the model aggregation of round t. Local gradient estimator of client c, global gradient estimator in round t. Number of labels that are assigned to client c for data selection and storage. Number of clients that are coordinated to select and store data with label y. F (w) Fc (w), F̃c (w) l(w, x, y) Bc , |Bc | wct,i t t wfed ,wcen vc ζc ζct ĝc , ĝ t nclient c nlabel y X m−1 X X t−1 t −αc ∥ ∇w l(wct,i , x, y) ∥22 F (wfed )−F (wfed )⩾ {z } c∈C i=0 | {z } | (x,y)∈Bc t global loss reduction term 1 t−1 + βc ⟨∇w l(wct,i , x, y), ∇w F (wfed )⟩ , | {z } term 2 Lζct where αc = · 2 η |Bc | 2 and βc = ζct · η |Bc | (3) . Proof. From the L-Lipschitz continuity of global loss function F (w), we have t t−1 F (wfed ) − F (wfed ) t−1 t t−1 ⩽ ∇w F (wfed ), wfed − wfed + L t t−1 2 ∥ wfed − wfed ∥ . 2 (4) Using the aggregation formula of Fed-Avg [16], we can t decompose wfed into the weighted sum of the updated model t,m wc of each participant c ∈ Ct : (4) X t−1 ), = ∇w F (wfed L X t t,m t−1 2 ∥ ∥ ζc wc − wfed 2 c∈C t L X t t−1 2 + ζc ∥ wct,m − wfed ∥ . 2 c∈C t−1 + ζct wct,m −wfed c∈Ct ⩽ X ζct t−1 t−1 ∇w F (wfed ), wct,m −wfed c∈Ct t (5) We can further express the local updated model wct,m of each client c ∈ Ct as the difference between the original global t−1 model wfed and m local model updates: communication overhead (Section III-B). We further design a policy for the server to coordinate cross-client data selection process, avoiding the potential overlapped data selected and stored by clients (Section III-C). Finally, we summarize the detailed procedure of ODE (Section III-D). (5) ⩽ X ζct h t−1 ), −η ∇w F (wfed m−1 X i=0 c∈Ct L ∥ 2 m−1 X ζct hL 2 c∈Ct We evaluate the impact of a local data sample on FL model from the perspectives of convergence rate and inference accuracy, which are two critical aspects for the success of FL. The convergence rate quantifies the reduction of loss function in each training round, and determines the communication cost of FL. The inference accuracy reflects the effectiveness of a FL model on guaranteeing the quality of service and user experience. For theoretical analysis, we follow one typical assumption on the FL models, which is widely adopted in the literature [36], [39], [40]. η η∇w F̃ (wct,i ) ∥2 i i=0 X = A. Data Valuation Metric ∇w F̃c (wct,i ) + m−1 X ∥ m−1 X (6) η∇w F̃ (wct,i ) 2 ∥ − i=0 i t−1 ∇w F (wfed ), ∇w F̃c (wct,i ) , i=0 where the ith local model update of each client c can be expressed as the average gradient of the samples stored locally in Bc : (6) = X ζct hL c∈Ct η |Bc | Assumption 1. (Lipschitz Gradient) For each client c ∈ C, the loss function Fc (w) is Lc -Lipschitz gradient, i.e., ∥ ∇w Fc (w1 ) − ∇w Fc (w2 ) ∥2 ⩽ Lc ∥ w1 − w2 ∥2 , which implies that the global loss function F (w) is L-Lipschitz P gradient with L = c∈C ζc Lc . 2 m−1 X ∥ m−1 X η i=0 X X x,y∈Bc 1 ∇w l(wct,i , x, y) ∥2 − |Bc | t−1 ∇w F (wfed ), ∇w l(wct,i , x, y) i=0 (x,y)∈Bc X m−1 X X ⩽ αc ∥ ∇w l(wct,i , x, y) ∥2 c∈Ct i=0 x,y∈Bc t−1 − βc ∇w F (wfed ), ∇w l(wct,i , x, y) Convergence Rate. We provide a lower bound on the reduction of loss function of global model after model aggregation in each communication round. where αc = 4 Lζct 2 · η |Bc | 2 and βc = ζct · . η |Bc | . i Theorem 1 shows that the loss reduction of the global FL t model F (wfed ) in each round is closely related to two terms: local gradient magnitude of a data sample (term 1) and the projection of the local gradient of a data sample onto the direction of global gradient over global data distribution (term 2). Due to the different magnitude orders of coefficients αc and βc 4 and also the different orders of the values of terms 1 and 2 as shown in Figure 2(a), we can focus on the term 2 (i.e., projection of the local gradient of a data sample onto the global gradient) to evaluate a local data sample’s impact on the convergence rate. We next briefly describe how to evaluate data samples based on the term 2. The local model parameter wct,i in term 2 is computed from (1), where the gradient ∇w F̃c (wct,i−1 ) depends on all the locally stored data samples. In other words, the value of term 2 depends on the “cooperation” of all the locally stored data samples. Thus, we can formulate the computation of term 2 via cooperative game theory [41], where each data sample represents a player and the utility of each sample set is the value of term 2. Within this cooperative game, we can regard the individual contribution of each data sample to term 2 as its value, and quantify it through leave-one-out [27], [42] or Shapley Value [28], [43]. As these methods require multiple model retraining to compute the marginal contribution of each individual data sample, we propose a one-step look-ahead policy to approximately evaluate each sample’s value by only focusing on the first local training epoch (m = 1). Inference Accuracy. We can assume that the optimal FL model can be obtained by gathering all clients’ generated data and conducting CL. Moreover, as the testing dataset and the corresponding testing accuracy are hard to obtain in FL, we use the weight divergence between the models trained through t mt FL and CL, i.e., ∥ wfed − wcen ∥, to quantify the accuracy of the FL model in each round t. With t → ∞, we can evaluate the final accuracy of FL model. which can be further decomposed according to the formula of model update in (1): X t mt ∥ wfed −wcen ∥= ∥ h i ζct wct,m−1 − η∇w F̃c (wct,m−1 ) − c∈Ct mt−1 mt−1 wcen + η∇w F (wcen )∥ X t t,m−1 mt−1 ⩽∥ ζc wc − wcen ∥ + (8) c∈Ct η∥ X mt−1 ζct ∇F̃c (wct,m−1 )−∇F (wcen )∥ . c∈Ct Then, we leverage the triangle inequality to upper bound (8): X (8) ⩽ mt−1 ζct ∥ wct,m−1 −wcen ∥ +η ∥ c∈Ct X h ζct ∇w F̃c (wct,m−1 ) c∈Ct mt−1 ∇w F (wct,m−1 )+∇w F (wct,m−1 )−∇w F (wcen ) − X = i ∥ mt−1 ζct ∥ wct,m−1 − wcen ∥+ (9) c∈Ct X η∥ t ζc ∇w F̃c (wct,m−1 ) − ∇w F (wct,m−1 ) ∥ + c∈Ct X η∥ mt−1 ζct ∇w F (wct,m−1 ) − ∇w F (wcen ) ∥. c∈Ct According to the Lipschitz continuity of global model F in Assumption 1, we can further decompose the equation (9): (9) ⩽ X mt−1 ζct (1 + ηL) ∥ wct,m−1 − wcen ∥+ c∈Ct X η∥ ζct ∇w F̃c (wct,m−1 ) − ∇w F (wct,m−1 ) ∥ (10) c∈Ct ⩽ X h i mt−1 ζct (1 + ηL) ∥ wct,m−1 −wcen ∥ +ηGc (wct,m−1 ) , c∈Ct where Gc (w) =∥ ∇w F̃c (w)−∇w F (w) ∥ intuitively represents the divergence between empirical losses over the local data distribution and global data distribution. mt−1 Next, we derive the upper bound of ∥ wct,m−1 − wcen ∥ for each participating client c ∈ Ct : Theorem 2. (Model Weight Divergence) With Assumption 1, for an arbitrary participating client set Ct , we have the following inequality for the weight divergence between the models trained through FL and CL after the tth training round. mt−1 ∥ wct,m−1 − wcen ∥ mt−2 mt−2 = ∥ wct,m−2 −η∇w F̃c (wct,m−2 )−wcen +η∇w F (wcen )∥ mt−2 ⩽ ∥ wct,m−2 −wcen ∥ +η ∥ ∇w F (wct,m−2 )− (11) ∇ F (wmt−2 ) ∥ +η ∥ ∇ F̃ (wt,m−2 )−∇ F (wt,m−2 ) ∥, t−1 t mt m(t−1) w w c w cen c c ∥ wfed − wcen ∥2 ⩽(1 + ηL)m ∥ wfed − wcen ∥2 " m−1 # which can be further bounded using Assumption 1: X X mt−2 + ζct η (1 + ηL)m−1−i Gc (wct,i ) , (11) ⩽(1 + ηL) ∥ wct,m−2 − wcen ∥ +ηGc (wct,m−2 ) c∈Ct i=0 (t−1)m ⩽(1 + ηL)m−1 ∥ wct,0 − wcen ∥+ (7) where Gc (w) =∥ ∇w F̃c (w) − ∇w F (w) ∥2 . η m−2 X (12) (1 + ηL)m−2−i Gc (wct,i ) i=0 Combining inequalities (10) and (12), we have: Proof. From the aggregation formula of Fed-Avg, we can t express the global model wfed as the weighted sum of the local updated model of each participating client c ∈ Ct : X t mt mt ∥ wfed − wcen ∥ =∥ ζct wct,m − wcen ∥, t mt ∥ wfed − wcen ∥⩽ h (t−1)m ζct (1 + ηL)m ∥ wct,0 − wcen ∥ X c∈Ct +η m−1 X c∈Ct 4 αc βc 10. ∝ η |Bc | (1 + ηL)m−1−i Gc (wct,i ) i i=0 t−1 (t−1)m =(1 + ηL)m ∥ wfed − wcen ∥+ i X t h m−1 X (1 + ηL)m−1−i Gc (wct,i ) . ζc η ≈ 10−4 with common learning rate 10−3 and storage size c∈Ct 5 i=0 (a) Term 1 vs Term 2 (b) Round 0 vs Round 30 (c) Full model vs Last layer Fig. 2. Empirical results to support some claims, and the experiment setting could be found in Section IV-A. (a): Comparison of the normalized values of two terms in (3) and (13). (b): Comparison of the data values computed in round 0 and round 30. (c): Comparison of the data values computed using gradients of full model layers and the last layer. and the inference accuracy mainly through term 2 in (13), which happens to be the same as the term 2 in the bound of global loss reduction in (3). Data Valuation. Based on the above analysis, we define a new data valuation metric for FL, and provide theoretical understanding as well as intuitive interpretation. Hence, we bound the divergence between models trained through FL and CL by the initial model difference and the additional divergence caused by m local model updates of heterogeneous participating clients. The following lemma further shows the impact of a local t mt ∥2 through Gc (wct,i ). data sample on ∥ wfed − wcen Definition 1. (Data Valuation Metric) In the tth round, for a client c ∈ C, the value of a data sample (x, y) is defined as the projection of its local gradient ∇l(w, x, y) onto the global gradient of the current global model over the unbiased global data distribution: Lemma 1. (Gradient Divergence) For an arbitrary client c ∈ C, Gc (w) =∥ ∇F̃c (w) − ∇F (w) ∥2 is bounded by: X 1 Gc (w) ⩽ δ + ∥ ∇w l(w, x, y) ∥22 − {z } |Bc | | (x,y)∈Bc term 1 1/2 2 ∇w l(w, x, y), ∇w F (w) | {z } (13) def t t v(x, y) = ∇w l(wfed , x, y), ∇w F (wfed ) ⟩. , (14) Based on this new data valuation metric, once a client receives a new data sample, she can make an online decision on whether to store this sample by comparing the data value of the new sample with those of old samples in storage, which can be easily implemented as a priority queue. Theoretical Understanding. On the one hand, maximizing the above data valuation metric of the selected data samples is a one-step greedy strategy for minimizing the loss of the updated global model in each training round according to (3), because it optimizes the dominant term (term 2) in the lower bound of global loss reduction in (3). The one-step-look-ahead policy means that we only consider the first epoch of the local t−1 model training in (3), i.e., ⟨∇w l(wct,i , x, y), ∇w F (wfed )⟩, t−1 t,0 where the index i is set as 0 and wc = wfed . On the other hand, this metric also improves the inference accuracy of the final global model by narrowing the gap between the models trained through FL and CL, as it reduces the high-weight term of the dominant part in (7), i.e., (1 + ηL)m−1 Gc (wkt,0 ). Intuitive Interpretation. Under the proposed new data valuation metric, large data value of one data sample indicates that its impact on the global FL model is similar to that of the unbiased global data distribution, guiding the clients to select the data samples which not only follow their own local data distribution but also have similar effect with the global data distribution. In this way, the personalized information of local data distribution is preserved and the data heterogeneity across clients is also reduced, which have been demonstrated to improve FL performance [44], [45], [38], [46]. term 2 where δ =∥ ∇w F (w) ∥22 is a constant term for all data samples. Proof. The main idea of the proof is to express the local gradient ∇w F̃c (w) as the average gradient of the locally stored data samples in Bc : Gc (w) =∥ ∇w F̃c (w) − ∇w F (w) ∥ h i1/2 = ∥ ∇w F (w) ∥2 + ∥ ∇w F̃c (w) ∥2 −2 ∇w F (w), ∇w F̃c (w) h X ∇w l(w, x, y) 2 ∥ − = ∥ ∇w F (w) ∥2 + ∥ |Bc | (x,y)∈Bc X ∇w l(w, x, y) i1/2 2 ∇w F (w), |Bc | (x,y)∈Bc X 1 ⩽ ∥ ∇w F (w) ∥2 + ∥ ∇w l(w, x, y) ∥2 − |Bc | (x,y)∈Bc 1/2 2 ∇w F (w), ∇w l(w, x, y) , where ∥ ∇w F (w) ∥2 is a constant term for all clients and local data samples. Intuitively, due to different coefficients, the twofold projection (term 2) has a larger mean and variance than the gradient magnitude (term 1) among different data samples, which is also demonstrated in our experiments in Figure 2(a). Thus, we can quantify the impact of a local data sample on Gc (w) 6 memory footprint for storing intermediate model outputs. To reduce these costs, we only use the gradients of the last few network layers of ML models instead of the whole model, as partial model gradient is also able to reflect the trend of the full gradient, which is also verified in Figure 2(c). Privacy Concern. The transmission of local gradient estimators may disclose the local gradient of each client to some extent, which can be avoided by adding Guassian noise to each local gradient estimator before uploading, as in differential privacy [48], [49]. B. On-Client Data Selection In practice, it is non-trivial for one client to directly utilize the above data valuation metric for online data selection due to the following two challenges: (1) lack of the latest global model: due to the partial participation of clients in FL [47], each client c ∈ C does t−1 not receive the global FL model wfed in the rounds that she is not selected for model training, and only has the outdated global FL model from the previous participating round, i.e., tc,last −1 wfed . (2) lack of unbiased global gradient: the accurate global gradient over the unbiased global data distribution can only be obtained by aggregating all the clients’ local gradients over their unbiased local data distributions. This is hard to achieve because only partial clients participate in each communication round, and the locally stored data distribution could become biased during the on-client data selection process. We can consider that the challenge (1) does not affect the online data selection process too much as the value of each data sample remains relatively stable across a few training rounds, which is also demonstrated with the experiment results in Figure 2(b), and thus clients can simply use the old global model for data valuation. To overcome the challenge (2), we propose a gradient estimation method. First, to solve the issue of skew local gradient, we require each client c ∈ C to maintain a local gradient estimator ĝc , which will be updated whenever the client receives the nth new data sample (x, y) from the last participating round: C. Cross-Client Data Storage Since the local data distributions of clients may overlap with each other, independently conducting data selection process for each client may lead to distorted distribution of data stored by clients. One potential solution is to divide the global data distribution into several regions, and coordinate each client to store valuable data samples for one specific distribution region, while the union of all stored data can still follow the unbiased global data distribution. In this work, we consider the label of data samples as the dividing criterion5 . Thus, before the training process, the server needs to instruct each client the labels and the corresponding quantity of data samples to store. Considering the partial client participation and heterogeneous data distribution among clients, the cross-client coordination policy need to satisfy the following four desirable properties: (1) Efficient Data Selection: To improve the efficiency of data selection, the label y ∈ Y should be assigned to the clients who generate more data samples with this label, following the intuition that there is a higher probability to select more valuable data samples from a larger pool of candidate dataset. (2) Redundant Label Assignment: To ensure that all the labels are likely to be covered in each round even with partial client participation, we require each label y ∈ Y to be assigned clients, which is a hyperparameter decided to more than nlabel y by the server. (3) Limited Storage: Due to limited on-device storage, each labels to ensure client c should be assigned to less than nclient c a sufficient number of valuable data samples stored for each assigned label within limited storage volume, and nclient is c also a hyperparameter decided by the server; (4) Unbiased Global Distribution: The weighted average of all clients’ stored data distribution is expected to be equal to the unbiased global data distribution, i.e., P̃ (y) = P (y), y ∈ Y . We formulate the cross-client data storage with the above four properties by representing the coordination strategy as a matrix D ∈ N|C|×|Y | , where Dc,y denotes the number of data samples with label y that client c should store. We use matrix V ∈ R|C|×|Y | to denote the statistical information of each client’s generated data, where Vc,y = vc Pc (y) is the average speed of the data samples with label y generated by client c. The cross-client coordination algorithm can be obtained by solving the following optimization problem, where 1 n−1 tc,last −1 gˆc + ∇w l(wfed , x, y), (15) n n where tc,last denotes the last participating round of client c, tc,last −1 and thus wfed stands for the latest global model that client c received. When the client c is selected to participate in FL at a certain round t, the client uploads the current local gradient estimator gˆc to the server, and resets the local gradient estimator, i.e., gˆc ← 0, n ← 0, because a new global FL t−1 model wfed is received. Second, to solve the problem of skew global gradient due to the partial client participation, the server also maintains a global gradient estimator ĝ t , which is an P aggregation of the local gradient estimators, ĝ t = c∈C ζc ĝc . As it would incur high communication cost to collect gˆc from all the clients, the server only uses ĝc of the participating clients to update global gradient estimator ĝ t in each round t: X ĝ t ← ĝ t−1 + ζc (ĝc − ĝctlast ). (16) ĝc ← c∈Ct Thus, in each training round t, the server needs to distribute t−1 both the current global FL model wfed and the latest global t−1 gradient estimator ĝ to each selected client c ∈ Ct , who will conduct local model training, and upload both locally updated model wct,m and local gradient estimator gˆc back to the server. Simplified Version. In both of the local gradient estimation in (15) and data valuation in (14), for a new data sample, we need to backpropagate the entire model to compute its gradient, which will introduce high computation cost and 5 There are some other methods to divide the data distribution, such as Kmeans [50] and hierarchical clustering [51], and our results are independent on these methods. 7 the condition (1) is formulated as the objective, and conditions (2), (3), and (4) are described by the constraints (18), (19), and (20), respectively: XX max Dc,y · Vc,y =∥ DV ∥1 , (17) D s.t. c∈C y∈Y ∥ DyT ∥0 ⩾ nlabel , y ∀y ∈ Y, (18) ∥ Dc ∥0 ⩽ nclient , c ∀c ∈ C, (19) ∥ Dc ∥1 = |Bc |, P P Vc,y c∈C Dc,y = c∈C , ∥ B ∥1 ∥ V ∥1 Algorithm 1: Greedy cross-client coordination Input: Label vector Y , client vector C, data velocity matrix V , storage size matrix B, redundant label assignment nlabel , limited storage nclient Output: Coordination matrix D 1 // Sort labels by number of owners 2 Y ← Sort Y in an increasing order of the number of non-zero elements in V [:, y]; 3 for y in Y do 4 // Sort clients by data velocity 5 C tmp ← Sort c ∈ C in an decreasing order of the data velocity V [c, y]; 6 for c in C tmp do 7 // Limited storage requirement 8 if Sum(D[c, :])< nclient then c 9 D[c, y] ← 1; 10 end 11 // Redundant label requirement then 12 if Sum(D[:, y])≥ nlabel y 13 Break; 14 end 15 end 16 end 17 // Divide each device storage evenly 18 for c in C do 19 cnt ←Sum(D[c, :]); 20 for y in Y do 21 D[c, y] ← B[c, y] ∗ D[c, y]/cnt; 22 end 23 end 24 return D; ∀c ∈ C, ∀y ∈ Y. (20) Complexity Analysis. We can verify that the above optimization problem with l0 norm is a general convex-cardinality problem, which is NP-hard [52], [53]. To solve this problem, we divide it into two subproblems: (1) decide which elements of matrix D are non-zero, i.e., S = {(c, y)|Dc,y ̸= 0}, that is to assign labels to clients under the constraints of (18) and (19); (2) determine the specific values of the nonzero elements of matrix D by solving a simplified convex optimization problem: X Dc,y · Vc,y max D c∈C,y∈Y,(c,y)∈S X s.t. Dc,y = |Bc |, ∀c ∈ C, (21) y∈Y,(c,y)∈S P c∈C,(c,y)∈S ∥ D ∥1 Dc,y P = c∈C Vc,y , ∥ V ∥1 ∀y ∈ Y. As the number of possible S can be exponential to |C| and |Y |, it is still prohibitively expensive to derive the globally optimal solution of D with large |C| (massive clients in FL). The classic approach is to replace the non-convex discontinuous l0 norm constraints with the convex continuous l1 norm regularization terms in the objective function [53], which fails to work in our scenario because simultaneously minimizing as many as |C| + |Y | non-differentiable l1 norm in the objective function will lead to high computation and memory costs as well as unstable solutions [54], [55]. Thus, we propose a greedy algorithm to solve this complicated problem. Greedy Cross-Client Coordination Algorithm. We achieve the four desirable properties through the following three steps, which are summarized in Algorithm 24: (1) Information Collection: Each client c ∈ C sends the rough information about local data to the server, including the storage capacity |Bc | and data velocity Vc,y of each label y, which can be obtained from the statistics of previous time periods. Then, the server can construct the vector of storage size B ∈ N|C| and the matrix of data velocity V ∈ R|C|×|Y | of all clients. (2) Label Assignment: The server sorts labels according to a non-decreasing order of the total number of clients having this label. We prioritize the labels with top rank in label-client assignment, because these labels are more difficult to find enough clients to meet Redundant Label Assignment property (line 1-2). Specifically, for each considered label y ∈ Y in the order, there could be multiple clients to be assigned, and the server allocates label y to the clients c who generates the data samples with label y in a higher data velocity (line 5-6). By doing this, we attempt to satisfy the property of Efficient Data Selection (line 6 − 15). Once the number of labels assigned to , this client will be removed from client c is larger than nclient c the order due to Limited Storage property (line 7-10). (3) Quantity Assignment: With the above two steps, we have decided the non-zero elements of the client-label matrix D, i.e., the set S. To further reduce computational complexity and avoid imbalanced on-device data storage for each label, we do not directly solve the optimization problem in (21). Instead, we require each client to divide the storage capacity evenly to the assigned labels, and compute a weight γy for each label y ∈ Y to guarantee that the weighted distribution of the stored data approximates the unbiased global data distribution (line 18-23), i.e., satisfying γy P̃ (y) = P (y). Accordingly, we can derive the weight γy for each label y by setting γy = P (y) ∥ V T [y] ∥1 / ∥ V ∥1 = . ∥ DT [y] ∥1 / ∥ D ∥1 P̃ (y) (22) Thus, each client c only needs to maintain one priority queue with a size |B[c]|/ ∥ D[c] ∥0 for each assigned label. In the local model training, each participating client c ∈ Ct updates 8 Cross-Client Data Storage. Before the FL model training process, the central server collects data distribution information and storage capacity from all clients (①), and solving the optimization problem in (9) through our greedy algorithm (②). On-Client Data Evaluation. During the FL process, clients train the local model in participating rounds, and utilize idle computation and memory resources to conduct on-device data selection in non-participating rounds. In the tth training round, non-selected clients, selected clients and the server perform different operations: • Fig. 3. A simple example to illustrate the greedy coordination. ① Sort labels according to #owners and obtain label order {2, 1, 3}; ② Sort and allocate clients for each label under the constraint of nclient = 2 and nclass = 2; ③ c y Obtain coordination matrix D and compute class weight γ according to (22). • Fig. 4. Overview of ODE framework. • the local model using the weighted stored data samples: η X wct,i ← wct,i−1 − γy ∇w l(wct,i−1 , x, y), (23) ζc (x,y)∈Bc P where ζc = (x,y)∈Bc γy denotes the new weight of each client, and the normalized weight of client P c ∈ Ct for model aggregation in round t becomes ζct = c∈Ct P ′ ζc ζ ′ . c ∈Ct c We illustrate a simple example in Figure 3 for better understanding of the above procedure. Privacy Concern. The potential privacy leakage of uploading rough local information is tolerable in practice, and can be further avoided through homomorphic encryption [56], which enables to sort k encrypted data samples with computational complexity O(k log k 2 ) [57]. Non-selected clients: Data Evaluation (③): each non-selected client c ∈ C \ Ct continuously evaluates and selects the data samples according to the data valuation metric in (14), within which the clients use the estimated global gradient received in last participating round instead of the accurate one. Local Gradient Estimator Update (④): the client also timely updates the local gradient estimator ĝc using (15). Selected Clients: Local Model Update (⑤): after receiving the new global t−1 model wfed and new global gradient estimator ĝ t−1 from the server, each selected client c ∈ Ct performs local model updates based on the stored data samples by (23). Local Model and Estimator Transmission (⑥): each selected client sends the updated model wct,m and local gradient estimator ĝc to the server. The local estimator ĝc will be reset to 0 for approximating the local gradient t−1 of the newly received global model wfed . Server: Global Model and Estimator Transmission (⑦): At the beginning of each training round, the server distributes t−1 the global model wfed and the global gradient estimator ĝ t−1 to the selected clients. Local Model and Estimator Aggregation (⑧): at the end of each training round, the server collects the updated local models wct,m and local gradient estimators ĝc from participating clients c ∈ Ct , which will be aggregated to obtain a new global model by (2) and a new global gradient estimator by (16). IV. E VALUATION R ESULTS In this section, we first introduce experiment setting, baselines and evaluation metrics (Section IV-A). Second, we show the individual and combined impacts of the limited on-device storage and streaming networked data on FL to demonstrate the motivation of our work (Section IV-B). Third, we present the overall performance of ODE and different data selection methods on model training speedup and inference accuracy improvement, as well as the memory footprint and data evaluation delay (Section IV-C). Next, we show the robustness of ODE against various environment factors (Section IV-D), such as the number of local training epochs m, client participation t| rate |C |C| , on-device storage capacity Bc , mini-batch size and data heterogeneity across clients. Finally, we analyze the individual effect of different components of ODE (Section IV-E). D. Overall Procedure of ODE ODE incorporates cross-client data storage and on-client data evaluation to coordinate mobile devices to store valuable samples, speeding up model training process and improving inference accuracy of the final global model. The overall procedure is shown in Figure 4. Key Idea. Intuitively, the cross-client data storage component coordinates clients to store the high-quality data samples with different labels to avoid overlapped data stored by clients. The on-client data evaluation component instructs each client to select the data having similar local gradient with the global gradient, which reduces data heterogeneity among clients while also preserves the personalized data distribution. 9 TABLE II I NFORMATION OF DIFFERENT TASKS , DATASETS , MODELS AND DEFAULT EXPERIMENT SETTINGS . Tasks ST IC HAR TC Datasets Synthetic Dataset [31] Fashion-MNIST [32] HARBOX [33] Industrial Dataset Models LogReg LeNet [58] Customized DNN Customized CNN #Samples 1, 016, 442 70, 000 34, 115 37, 707 #Labels #Devices nlabel y 10 200 5 10 50 5 5 120 5 20 30 5 |Ct | |C| η 5% 1e−4 10% 1e−3 10% 1e−3 20% 5e−3 m 5 5 5 5 |Bc | 10 5 5 10 A. Experiment Setting Learning Tasks, Datasets and ML Models. To demonstrate the ODE’s advanced performance and generalization across various learning tasks, datasets and ML models. we evaluate ODE on one synthetic dataset, two real-world datasets and one industrial dataset, all of which vary in data quantities, distributions and model outputs, and cover the tasks of Synthetic Task (ST), Image Classification (IC), Human Activity Recognition (HAR) and mobile Traffic Classification (TC). The statistics of the tasks are summarized in Table II, and introduced as follows: • • • • Fig. 5. Unbalanced data quantity of clients. Synthetic Task. The synthetic dataset we used is proposed in LEAF benchmark [31] and is also described in details in [36]. It contains 200 clients and 1 million data samples, and a Logistic Regression model is trained for this 10class task. Image Classification. Fashion-MNIST [32] contains 60, 000 training images and 10, 000 testing images, which are divided into 50 clients according to labels [16]. We train LeNet [58] for this 10-class image classification. Human Activity Recognition. HARBOX [33] is the 9-axis OMU dataset collected from 121 users’ smartphones in a crowdsourcing manner, including 34,115 data samples with 900 dimension. Considering the simplicity of the dataset and task, a lightweight customized DNN with two dense layers followed by a SoftMax layer is deployed for this 5-class human activity recognition task [59]. Traffic Classification. The industrial dataset of mobile application classification is collected by our deployment of 30 ONT devices in a simulated network environment from May 2019 to June 2019. Generally, the dataset contains more than 560, 000 data samples and has more than 250 applications as labels, which cover the application categories of videos (such as YouTube and TikTok), games (such as LOL and WOW), files downloading (such as AppStore and Thunder) and communication (such as WhatsApp and WeChat). We manually label the application of each data sample. The model we applied is a CNN consisting of 4 convolutional layers with kernel size 1×3 to extract features and 2 fully-connected layers for classification, which is able to achieve 95% accuracy through CL, and satisfy the on-device limited resource requirement due to the small number of model parameters. To reduce the training time caused by the large scale of dataset, we randomly select 20 out of 250 applications as labels with a total number of 37, 707 data samples, whose distribution is shown in Figure 6. Fig. 6. Data distribution in traffic classification dataset. Parameters Configurations. For all the experiments, we use SGD as the optimizer and decay the learning rate per 100 rounds by ηnew = 0.95 × ηold . To simulate the setting of streaming networked data, we set the on-device data velocity samples to be vc = #training , which means that each device 500 c ∈ C will receive vc data samples one by one in each communication round, and the samples would be shuffled and appear again per 500 rounds. Other default configurations are shown in Table II. Note that the participating clients in each round are randomly selected, and for each experiment, we repeat 5 times and show the average results [14], [59]. Baselines. In our experiments, we compare two versions of ODE, ODE-Exact (using exact global gradient) and ODE-Est (using estimated global gradient), with four categories of data selection methods, including random sampling methods (RS), importance-based methods for CL (HL and GN), the existing data selection methods for FL (FB and SLD) and the ideal method with unlimited on-device storage (FD): • Random sampling (RS) methods including Reservoir Sampling [35] and First-In-First-Out (storing the latest |Bc | data samples). The experimental results of these two 10 methods are similar, and thus we show the results of FirstIn-First-Out for RS only. • Importance sampling-based methods including HighLoss (HL), which uses the loss of each data sample as data value to reflect the informativeness of data [60], [61], [62], and GradientNorm (GN), which quantifies the impact of each data sample on model update through its gradient norm [63], [29]. • Existing data selection methods for canonical FL including FedBalancer(FB) [25] and SLD (Sample-Level Data selection) [26] which are revised slightly to adapt to streaming networked setting: (i) We store the loss/gradient norm of the latest 50 samples for noise removal; (ii) For FedBalancer, we ignore the data samples with loss larger than top 10% loss value, and for SLD, we remove the samples with gradient norm larger than the median norm value. • Ideal method with unlimited on-device storage, denoted as FullData (FD), using the entire dataset of each client for training to simulate the unlimited storage scenario. Metrics for Training Performance. We use two metrics to evaluate the performance of each data selection method: (1) Time-to-Accuracy Ratio: we measure the model training speedup by the ratio of training time of RS and the considered method to reach the same target accuracy, which is set to be the final inference accuracy of RS. As the time of one communication round is usually fixed in practical FL scenario, we can quantify the training time with the number of communicating rounds. (2) Final Inference Accuracy: we evaluate the inference accuracy of the final global model on each device’s testing data, and report the average accuracy for evaluation. the biased update direction, slowing down the convergence time by 3.92× and decreasing the final model accuracy by as high as 6.7%. Individual Impact. To show the individual impact of streaming networked data, we ignore limited storage and assume that each device can utilize all the generated data instead of only stored data for online data evaluation and selection. To show the individual impact of limited storage, we ignore streaming networked data and suppose that in each epoch, all the new data are generated concurrently and accessed arbitrarily. The experimental results are shown in Table IV, where − denotes failing to reach the target inference accuracy. It shows that limited on-device storage is the essential cause of the degeneration of FL training process, because: (1) The property of streaming networked data mainly prevents previous methods from making accurate online decisions on data selection, as they select each sample according to a normalized probability depending on both discarded and upcoming samples, which are not available in streaming setting but can be estimated; (2) The property of limited storage prevents previous methods from obtaining full local and global data information, and guides clients to select suboptimal or even low-value data from an insufficient candidate dataset, which is more harmful to FL process. C. Overall Performance We compare the performance of ODE with six baselines on all the datasets, and show the results in Table V. ODE significantly speeds up the model training process. We observed that ODE improves time-to-accuracy performance over the existing data selection methods on all the four datasets. Compared with baselines, ODE achieves the target accuracy 5.88× ∼ 9.52× faster on ST; 1.20× ∼ 1.35× faster on IC; 1.55× ∼ 2.22× faster on HAR; 2.5× faster on TC. Also, we observe that the largest speedup is achieved on the datast ST, because the high non-i.i.d degree across clients and large data divergence within clients leave a great potential for ODE to reduce data heterogeneity and improve training process through the designed on-device data selection policy. ODE largely improves the inference accuracy of the final model. Table V shows that in comparison with baselines with the same storage, ODE enhances final accuracy on all the datasets, achieving 3.24% ∼ 7.56% higher on ST, 3.13% ∼ 6.38% increase on HAR, and around 6% rise on TC. We also notice that ODE has a marginal accuracy improvement (≈ 1.4%) on IC, because the FashionMNIST has less data variance within each label, and a randomly selected subset is sufficient to represent the entire data distribution for model training. Importance-based data selection methods perform poorly. Table V shows that these methods even cannot reach the target final accuracy on tasks IC, HAR and TC, as these datasets are collected from real world and contain noise data, making such importance sampling methods fail to work [25], [26]. Previous data selection methods for FL outperform importance based methods but worse than ODE. As is shown in Table B. Experiment Results for Motivation In this subsection, we provide the comprehensive experimental results for the motivation of our work. First, we show that the properties of limited on-device storage and streaming networked data can deteriorate classic FL model training process significantly in various settings, such as different numbers of local training epochs and characteristics of clients’ data. Then, we analyze the separate impacts of theses two properties on FL with different data selection methods. Overall Impact. To investigate how the properties of limited on-device storage and steaming networked data impact the classic FL with different data characteristics, such as the range and variance of the local data distribution, we conduct experiments over three settings with different numbers of labels owned by each client, which is constructed from the industrial traffic classification dataset (TC). The experimental results are shown in Table III where average variance represents the average local data variance of clients. The results demonstrate that: (1) With local data variance increasing, the reduction of convergence rate and model accuracy is becoming larger. This is because the stored data samples are more likely to be biased when the data distibution has a wide range; (2) When the number of local epoch m increases, the negative impact of limited on-device storage is becoming more serious due to larger steps towards 11 TABLE III I MPACT OF LIMITED ON - DEVICE STORAGE IN THE SETTINGS WITH DIFFERENT N ON -IID DATA DISTRIBUTION AND THE TRAFFIC CLASSIFICATION TASK . Settings #Label per Client Average Variance Setting 1 5 0.227 Setting 2 10 0.234 Setting 3 15 0.235 #Local Epoch 2 5 2 5 2 5 Convergence Rate TRS /TFullData 2.18 1.70 2.50 2.84 3.46 3.92 Inference Accuracy RS FullData 0.867 0.912 0.855 0.890 0.894 0.932 0.874 0.926 0.906 0.960 0.890 0.957 TABLE IV I NDIVIDUAL IMPACTS OF ON - DEVICE STORAGE AND STREAMING NETWORKED DATA ON DATA SELECTION IN FL WITH THE SYNTHETIC TASK . Setting Limited Storage Streaming Data Both RS 79.6 79.6 79.6 Final Accuracy (%) HL GN FB SLD 78.4 83.3 77.6 76.9 87.2 87.6 87.2 87.0 78.4 83.3 78.5 82.3 ODE 87.1 88.2 87.1 RS 1.00 1.00 1.00 Training Speedup HL GN FB − 4.87 − 2.86 2.30 2.86 − 4.87 − (×) SLD − 2 4.08 ODE 9.52 2.67 9.52 TABLE V T HE OVERALL PERFORMANCE OF DIFFERENT DATA SELECTION METHODS ON MODEL TRAINING SPEEDUP AND INFERENCE ACCURACY. T HE SYMBOL ’−’ MEANS THAT THE METHOD FAILS TO REACH THE TARGET ACCURACY. Task ST IC HAR TC Task ST IC HAR TC RS 1.0 1.0 1.0 1.0 HL − − − − 79.56 71.31 67.25 89.30 78.44 51.95 48.16 69.00 Model Training Speedup (×) FB SLD ODE-Exact − 4.08 9.52 − − 1.35 − − 2.22 − − 2.51 Inference Accuracy (%) 83.28 78.56 82.38 87.12 41.45 60.43 69.15 72.71 51.02 48.33 56.24 73.63 69.3 72.19 72.30 95.30 GN 4.87 − − − ODE-Est 5.88 1.20 1.55 2.50 FD 2.67 1.01 4.76 3.92 82.80 72.70 70.39 95.30 88.14 71.37 77.54 96.00 V, FedBalancer and SLD perform better than HL and GN, but worse than RS in a large degree, which is different from the phenomenon in traditional settings [64], [26], [25]. This is because (1) their noise reduction steps, such as removing samples with top loss or gradient norm, highly rely on the complete statistical information of full dataset, and (2) their on-client data valuation metrics fail to work for the global model training in FL, as discussed in Section I. Simplified ODE reduces computation and memory costs significantly with marginal performance degradation. We conduct another two experiments which consider only the last 1 and 2 layers (5 layers in total) for data valuation on the industrial TC dataset. Empirical results shown in Figure 7 demonstrate that the simplified version reduces as high as 44% memory and 83% time delay, with only 0.1× and 1% degradation on speedup and accuracy. ODE introduces small extra memory footprint and data processing delay during data selection process. Empirical results in Table VI demonstrate that simplified ODE brings only tiny evaluation delay and memory burden to mobile devices (a) Inference Accuracy (b) Training Speedup (c) Memory Footprint (d) Evaluation Latency Fig. 7. Performance and cost of simplified ODE with different numbers of model layers for gradient computation. 12 TABLE VI T HE MEMORY FOOTPRINT AND EVALUATION DELAY PER SAMPLE VALUATION OF BASELINES ON THREE REAL - WORLD TASKS . IC HAR TC RS 1.70 1.92 0.75 Memory Footprint (MB) HL GN ODE-Est ODE-Sim 11.91 16.89 18.27 16.92 7.27 12.23 13.46 12.38 10.58 19.65 25.15 14.47 Task IC HAR TC 0.05 0.05 0.05 Evaluation Time (ms) 11.1 21.1 22.8 0.36 1.04 1.93 1.03 9.06 9.69 Task (a) Synthetic Dataset 11.4 0.53 1.23 (b) Image Classification (c) Human Activity Recognition (a) Epoch=2 (b) Epoch=10 Fig. 9. Robustness of ODE and baselines to local epoch number on three public datasets. Fig. 8. The training process of different data selection methods with various numbers of local training epochs on the industrial traffic classification dataset. Participation Rate. The results in Figure 10 demonstrate that ODE can improve the FL process significantly even with small client participation rate, accelerating the model training 2.57× and increasing the inference accuracy by 6.6% in TC task, 9× and 8% in ST, 2× and 3% in IC, and 2× and 6% in HAR. This demonstrates the practicality of ODE in the real-world environments, where only a small proportion of mobile devices could be able to participate in each FL round. We also notice that in some cases (IC and HAR), the effectiveness of ODE on speeding up convergence is weakened with a high client participation rate, this is because increasing such rate can reduce the negative impact of cross-client data heterogeneity on global model convergence, which has been achieved by ODE. Storage Capacity. We also conduct experiments to discover how on-device storage impacts ODE on the indistrial dataset. For each of the four data selection methods (RS, ODE-Exact, ODE-Est, FullData)6 , we conduct experiments on various ondevice storage capacities ranging from 5 samples to 20 samples, and plot the training time as well as the corresponding final accuracy in Figure 11. The training time is normalized by the time-to-accuracy of FullData to exhibit the variation of training time more clearly. Empirical results indicate that (1) the performance of RS (both of convergence rate and inference accuracy) degrades rapidly with the decrease of storage capacity; (2) ODE has stable and good performance nearly independent on storage capacity, and thus are robust to potential diverse storage capacity of heterogeneous mobile de- (1.23ms and 14.47MB for the industrial TC task), and thus can be applied to practical mobile network scenarios. D. Robustness of ODE In this subsection, we mainly compare the robustness of ODE and previous methods to various factors in real-world environments, such as the number of local training epoch m, t| client participation rate |C |C| , storage capacity |Bc |, mini-batch size and data heterogeneity across clients, on the industrial traffic classification dataset. Number of Local Training Epoch. Empirical results shown in Figures 8 and 9 demonstrate that (1) ODE can work with various local training epoch numbers m. With m increasing from 2 to 10, both of ODE-Exact and ODE-Est achieve higher training speedup and final inference accuracy than existing methods with the same setting (e.g. 8.1× → 10× on training speedup and 3.5% → 5.1% on accuracy improvement in ST), and thus more improvement on user experience; (2) ODE-Est has the similar performance with ODE-Exact on improving the model performance. The first phenomenon coincides with the previous analysis in Section III-A: (1) for convergence rate, the one-step-look-ahead strategy of ODE only optimizes the loss reduction of the first local training epoch in (3), and thus the influence on model convergence will be weakened when the number of local epoch becomes larger; (2) for inference accuracy, ODE narrows the gap between the models trained through FL and CL by optimizing the dominant term with the maximum weight, i.e., (1+ηL)m−1 Gc (wct,0 ), of the gap bound in (13), leading to better final inference accuracy with a larger m. 6 We choose to compare with RS because it outperforms importance sampling and previous data selection methods in general cases, and here we mainly want to investigate the robustness of ODE. 13 (a) Synthetic Task (b) Image Classification (c) Human Activity Recognition (d) Traffic Classification Fig. 10. The training speedup and final accuracy of different data selection methods with various participation rates on four tasks. TABLE VIII ROBUSTNESS TO DIFFERENT DATA HETEROGENEITY. (a) Training Time #Label Speedup of ODE-Est 1 5 15 1.56× 2.19× 2.07× Final Accuracy RS ODE-Est 81.8% 85.5% 89.0% 85.4% 90.1% 94.6% (b) Inference Accuracy Fig. 11. Robustness of ODE to storage capacity on industrial dataset. among clients has always been an obstacle for accelerating convergence and improving accuracy in FL model training. Experimental results in Table VIII show that ODE achieves 3.6% ∼ 5.6% increase in inference accuracy and 1.56× ∼ 2.2× speedup in training time in different heterogeneous settings. Despite that the performance of ODE seems to degrade in the extremely heterogeneous case (#Label=1), the degradation is caused by the small number of local data labels, which reduces the effectiveness of the cross-device data selectioncomponent of ODE, and will not actually exist in practical scenario as mobile devices usually have multiple labels (e.g., applications). TABLE VII ROBUSTNESS OF ODE TO DIFFERENT MINI - BATCH SIZES ON INDUSTRIAL TRAFFIC CLASSIFICATION DATASET. Batch Size 1 2 5 Speedup of ODE-Est 2.42× 2.07× 1.97× Final Accuracy RS ODE-Est 92.0% 95.0% 92.4% 95.3% 92.9% 95.5% vices in practice; (3) Compared with ODE, RS needs more than twice storage space to achieve the same model performance. Mini-batch Size. As mini-batch is widely used in model training of FL to reduce training time, we conduct the experiments with different mini-batch sizes on the industrial TC dataset, and adjust the learning rate from 0.005 to 0.001 and local epoch number from 5 to 2 to reduce the instability caused by the multiple local model updates. We show the performance of ODE-Est and RS in Table VII, which demonstrates that (1) ODE can improve the FL model training process with various mini-batch sizes, increasing the model accuracy by 2.6% ∼ 3% and speeding up the model convergence by as high as 2.42×; (2) The smaller batch size is, the more training speedup and accuracy improvement can be obtained by ODE, which coincides the results of different numbers of local epochs. Data Heterogeneity. The data heterogeneity (Non-IID data) E. Component-wise Analysis In this subsection, we evaluate the effectiveness of each of the three components in ODE: on-device data selection, global gradient estimator and cross-client coordination policy. The results altogether show that each component is critical for the good performance guarantee of ODE. On-Client Data Selection. To show the effect of on-client data selection, we compare ODE with another baseline, namely Valuation-, which replaces the on-device data selection policy with random sampling, but still allows the central server to coordinate the clients. The result in Figure 12 demonstrates that without on-device data valuation module, Valuation- performs slightly better than RS but much worse than ODE-Est, which demonstrates the significant role of the proposed data valuation metric and data selection policy. 14 have severe impact on the success of FL, as demonstrated in our experiments. Data Selection. In FL, selecting data from streaming data can be seen as sampling batches of data from the data distribution, which is similar to mini-batch SGD [79]. To improve the training process of SGD, existing methods quantify the importance of each data sample (such as loss [61], [62], gradient norm [63], [29], uncertainty [80], [81], data shapley [28] and representativeness [82], [83]) and leverage importance sampling or priority queue to select training samples. The previous work [25], [26] on data selection in FL simplify conducted the above data selection methods on each client individually for local model training without considering the impacts on global model. Further, all of them require either access to all the data or multiple inspections over the streaming data, which are not satisfied in the mobile network scenarios. Traffic Classification is a significant task in mobile networks, which associates traffic packets with specific applications for the downstream network management task, like user experience optimization, capacity planning and resource provisioning [84]. The techniques for traffic classification can be classified into three categories. Port-based approaches [85] utilized the association of ports in the TCP or UDP header with well-known port numbers assigned by the IANA. Payloadbased approach [86] identified applications by inspecting the packet headers or payload, which has very high computational consumption and cannot handle encrypted traffic classification [87]. Statistics-based approach leverages payload-independent parameters such as packet length, inter-arrival time and flow direction to circumvent the problem of payload encrypted and user’s privacy. Fig. 12. Component-wise analysis of ODE on industrial traffic classification dataset. Global Gradient Estimator. To show the effectiveness of the global gradient estimator, we compare ODE-Est with a naive estimation method, namely Estimator-, where (1) server only utilizes the local gradient estimators of participating clients instead of all clients and (2) each participant only leverages the samples stored locally to calculate the local gradient instead of utilizing all the generated data samples. Figure 12 shows that the naive estimation method has the poorest performance, as the partial clients and biased local gradient lead to really inaccurate estimation for global gradient, which further misleads the clients to select data samples using an incorrect data valuation metric. Cross-Client Coordination. We conduct another experiment where clients select and store data samples only based on their local information without the coordination of the server, namely Coor-. Figure 12 shows that in this situation, the performance of ODE is largely weakened, as the clients tend to store similar and overlapped valuable data samples, and the other data will be under-represented. VI. C ONCLUSION In this work, we identify two key properties of networked FL in mobile networks: limited on-device storage and streaming networked data, which have not been fully explored in the literature. Then, we present the design, implementation and evaluation of ODE, which is an online data selection framework for FL with limited on-device storage, consisting of two components: on-device data evaluation and crossdevice collaborative data storage. Our analysis show that ODE improves both convergence rate and inference accuracy of the global model, simultaneously. Empirical results on three public and one industrial datasets demonstrate that ODE significantly outperforms the state-of-the-art data selection methods in terms of training time, final accuracy and robustness to various factors in industrial environments. ODE is a significant step towards the realization of applying FL to network scenario with limited storage. V. R ELATED W ORKS Federated Learning is a distributed learning framework that aims to collaboratively learn a global model over the networked devices’ data under the constraint that the data is stored and processed locally [21], [16]. Existing works mostly focus on how to improve FL from the perspectives of communication and computation costs. Previous works on communication cost reduction mainly leveraged model quantization [65], sparsification [66], [67] and pruning [14] to decrease the amount of parameters transmitted between clients and server, or considered hierarchical federated learning to optimize the network structure [68], [69], [70] and improve the communication efficiency. The existing literature improved the computation efficiency of FL [71] from the perspective of algorithms in different levels, such as improving the algorithms of local model update [72], [36], [73] and global model aggregation [45], [44], [74], [75], [38], optimizing the client selection algorithms to mitigate the heterogeneity among clients [76], [77], [78]. However, most of them assume a static training dataset on each device and overlook the device properties of limited on-device storage and streaming networked data, which is common for devices in practical mobile networks and will R EFERENCES [1] M. Giordani, M. Polese, M. Mezzavilla, S. Rangan, and M. Zorzi, “Toward 6g networks: Use cases and technologies,” IEEE Communications Magazine, vol. 58, no. 3, pp. 55–61, 2020. [2] D. Bega, M. Gramaglia, M. Fiore, A. Banchs, and X. Costa-Perez, “Deepcog: Optimizing resource provisioning in network slicing with ai-based capacity forecasting,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 2, pp. 361–376, 2019. 15 [3] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network anomaly detection: methods, systems and tools,” IEEE Communications Surveys & Tutorials, vol. 16, no. 1, pp. 303–336, 2013. [4] R. L. Cruz, “Quality of service guarantees in virtual circuit switched networks,” IEEE Journal on Selected areas in Communications, vol. 13, no. 6, pp. 1048–1056, 1995. [5] S. S. Kolahi, S. Narayan, D. D. Nguyen, and Y. Sunarto, “Performance monitoring of various network traffic generators,” in International Conference on Computer Modelling and Simulation (UkSim), 2011. [6] D. S. Nunes, P. Zhang, and J. S. Silva, “A survey on human-in-the-loop applications towards an internet of all,” IEEE Communications Surveys & Tutorials, vol. 17, no. 2, pp. 944–965, 2015. [7] M. A. Siddiqi, H. Yu, and J. Joung, “5g ultra-reliable low-latency communication implementation challenges and operational issues with iot devices,” Electronics, vol. 8, no. 9, p. 981, 2019. [8] P. K. Aggarwal, P. Jain, J. Mehta, R. Garg, K. Makar, and P. Chaudhary, “Machine learning, data mining, and big data analytics for 5g-enabled iot,” in Blockchain for 5G-Enabled IoT, 2021, pp. 351–375. [9] A. Banchs, M. Fiore, A. Garcia-Saavedra, and M. Gramaglia, “Network intelligence in 6g: challenges and opportunities,” in ACM Workshop on Mobility in the Evolving Internet Architecture (MobiArch), 2021. [10] D. Rafique and L. Velasco, “Machine learning for network automation: overview, architecture, and applications,” Journal of Optical Communications and Networking, vol. 10, no. 10, pp. D126–D143, 2018. [11] S. Ayoubi, N. Limam, M. A. Salahuddin, N. Shahriar, R. Boutaba, F. Estrada-Solano, and O. M. Caicedo, “Machine learning for cognitive network management,” IEEE Communications Magazine, vol. 56, no. 1, pp. 158–165, 2018. [12] Y. Deng, “Deep learning on mobile devices: a review,” in Mobile Multimedia/Image Processing, Security, and Applications, 2019. [13] X. Zeng, M. Yan, and M. Zhang, “Mercury: Efficient on-device distributed dnn training via stochastic importance sampling,” in ACM Conference on Embedded Networked Sensor Systems (Sensys), 2021. [14] A. Li, J. Sun, P. Li, Y. Pu, H. Li, and Y. Chen, “Hermes: an efficient federated learning framework for heterogeneous mobile clients,” in International Conference On Mobile Computing And Networking (MobiCom), 2021. [15] F. Mireshghallah, M. Taram, A. Jalali, A. T. T. Elthakeb, D. Tullsen, and H. Esmaeilzadeh, “Not all features are equal: Discovering essential features for preserving prediction privacy,” in The ACM Web Conference (WWW), 2021. [16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. [17] UCSC, “Packet buffers,” 2020. [Online]. Available: https://people.ucsc. edu/∼warner/buffer.html [18] G. Fox, “60 incredible facebook statistics,” https://www.garyfox.co/ facebook-statistics/, 2018. [19] A. Cloud, “Alibaba cloud supported 583,000 orders/second for 2020 double 11 - the highest traffic peak in the world!” https://www.alibabacloud.com/blog/ alibaba-cloud-supported-583000-orderssecond-for-2020-double-11-\ --the-highest-traffic-peak-in-the-world 596884, 2020. [20] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning (ICML), 2020. [21] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020. [22] I. Akbari, M. A. Salahuddin, L. Ven, N. Limam, R. Boutaba, B. Mathieu, S. Moteau, and S. Tuffin, “A look behind the curtain: traffic classification in an increasingly encrypted web,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 5, no. 1, pp. 1–26, 2021. [23] H. Tahaei, F. Afifi, A. Asemi, F. Zaki, and N. B. Anuar, “The rise of traffic classification in iot networks: A survey,” Journal of Network and Computer Applications, vol. 154, p. 102538, 2020. [24] D. Arora, K. F. Li, and A. Loffler, “Big data analytics for classification of network enabled devices,” in International Conference on Advanced Information Networking and Applications Workshops (WAINA), 2016. [25] J. Shin, Y. Li, Y. Liu, and S. Lee, “Fedbalancer: data and pace control for efficient federated learning on heterogeneous clients,” in ACM International Conference on Mobile Systems, Applications, and Services (MobiSys), 2022, pp. 436–449. [26] A. Li, L. Zhang, J. Tan, Y. Qin, J. Wang, and X.-Y. Li, “Sample-level data selection for federated learning,” in IEEE International Conference on Computer Communications (INFOCOM), 2021. [27] R. D. Cook, “Detection of influential observation in linear regression,” Technometrics, vol. 19, no. 1, pp. 15–18, 1977. [28] A. Ghorbani and J. Zou, “Data shapley: Equitable valuation of data for machine learning,” in International Conference on Machine Learning (ICML), 2019. [29] P. Zhao and T. Zhang, “Stochastic optimization with importance sampling for regularized loss minimization,” in International Conference on Machine Learning (ICML), 2015. [30] N. T. Hu, X. Hu, R. Liu, S. Hooker, and J. Yosinski, “When does lossbased prioritization fail?” ICML 2021 workshop on Subset Selection in ML, 2021. [31] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097, 2018. [32] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. [33] X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing, “Clusterfl: a similarity-aware federated learning system for human activity recognition,” in ACM International Conference on Mobile Systems, Applications, and Services (MobiSys), 2021. [34] C. Li, D. Niu, B. Jiang, X. Zuo, and J. Yang, “Meta-har: Federated representation learning for human activity recognition,” in The ACM Web Conference (WWW), 2021. [35] J. S. Vitter, “Random sampling with a reservoir,” ACM Transactions on Mathematical Software, vol. 11, no. 1, pp. 37–57, 1985. [36] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems (MLSys), 2018. [37] J. Hamer, M. Mohri, and A. T. Suresh, “Fedboost: A communicationefficient algorithm for federated learning,” in International Conference on Machine Learning (ICML), 2020. [38] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” in Advances in neural information processing systems, 2020. [39] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang, and H. V. Poor, “Fast-convergent federated learning,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 1, pp. 201–218, 2020. [40] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018. [41] R. Branzei, D. Dimitrov, and S. Tijs, Models in cooperative game theory. Springer Science & Business Media, 2008, vol. 556. [42] P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” in International Conference on Machine Learning (ICML), 2017. [43] L. Shapley, “Quota solutions op n-person games1,” Edited by Emil Artin and Marston Morse, p. 343, 1953. [44] S. Zawad, A. Ali, P.-Y. Chen, A. Anwar, Y. Zhou, N. Baracaldo, Y. Tian, and F. Yan, “Curse or redemption? how data heterogeneity affects the robustness of federated learning,” in The AAAI Conference on Artificial Intelligence (AAAI), 2021. [45] Z. Chai, H. Fayyaz, Z. Fayyaz, A. Anwar, Y. Zhou, N. Baracaldo, H. Ludwig, and Y. Cheng, “Towards taming the resource and data heterogeneity in federated learning,” in {USENIX} Conference on Operational Machine Learning (OpML), 2019. [46] C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu, “Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data,” in The ACM Web Conference (WWW), 2021. [47] D. Jhunjhunwala, P. SHARMA, A. Nagarkatti, and G. Joshi, “Fedvarp: Tackling the variance due to partial client participation in federated learning,” in Conference on Uncertainty in Artificial Intelligence (UAI), 2022. [48] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor, “Federated learning with differential privacy: Algorithms and performance analysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3454–3469, 2020. [49] C. Dwork, “Differential privacy: A survey of results,” in International Conference on Theory and Applications of Models of Computation (ICTAMC), 2008. [50] K. Krishna and M. N. Murty, “Genetic k-means algorithm,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 29, no. 3, pp. 433– 439, 1999. [51] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 86–97, 2012. 16 [52] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal on Computing, vol. 24, no. 2, pp. 227–234, 1995. [53] D. Ge, X. Jiang, and Y. Ye, “A note on the complexity of l p minimization,” Mathematical Programming, vol. 129, no. 2, pp. 285– 299, 2011. [54] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for l1 regularization: A comparative study and two new approaches,” in European Conference on Machine Learning (ECML), 2007. [55] J. Shi, W. Yin, S. Osher, and P. Sajda, “A fast hybrid algorithm for large-scale l1-regularized logistic regression,” The Journal of Machine Learning Research, vol. 11, pp. 713–741, 2010. [56] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, “A survey on homomorphic encryption schemes: Theory and implementation,” ACM Computing Surveys, vol. 51, no. 4, pp. 1–35, 2018. [57] S. Hong, S. Kim, J. Choi, Y. Lee, and J. H. Cheon, “Efficient sorting of homomorphic encrypted data with k-way sorting network,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 4389– 4404, 2021. [58] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Handwritten digit recognition with a back-propagation network,” in Conference on Neural Information Processing Systems (NeurIPS), vol. 2, 1989. [59] C. Li, X. Zeng, M. Zhang, and Z. Cao, “Pyramidfl: A fine-grained client selection framework for efficient federated learning,” in International Conference On Mobile Computing And Networking (MobiCom), 2022. [60] I. Loshchilov and F. Hutter, “Online batch selection for faster training of neural networks,” arXiv preprint arXiv:1511.06343, 2015. [61] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2016. [62] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations (ICLR), 2015. [63] T. B. Johnson and C. Guestrin, “Training deep models faster with robust, approximate importance sampling,” in Conference on Neural Information Processing Systems (NeurIPS), 2018. [64] A. Katharopoulos and F. Fleuret, “Not all samples are created equal: Deep learning with importance sampling,” in International conference on machine learning (ICML), 2018. [65] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Federated learning with compression: Unified analysis and sharp guarantees,” in The 24th International Conference on Artificial Intelligence and Statistics, (AISTATS), ser. Proceedings of Machine Learning Research, 2021, pp. 2350–2358. [66] A. Li, J. Sun, X. Zeng, M. Zhang, H. Li, and Y. Chen, “Fedmask: Joint computation and communication-efficient personalized federated learning via heterogeneous masking,” in ACM Conference on Embedded Networked Sensor (SenSys), 2021, pp. 42–55. [67] P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification for efficient federated learning: An online learning approach,” in IEEE International Conference on Distributed Computing Systems (ICDCS), 2020, pp. 300–310. [68] J. Wu, Q. Liu, Z. Huang, Y. Ning, H. Wang, E. Chen, J. Yi, and B. Zhou, “Hierarchical personalized federated learning for user modeling,” in The ACM Web Conference (WWW), 2021. [69] S. Hosseinalipour, S. S. Azam, C. G. Brinton, N. Michelusi, V. Aggarwal, D. J. Love, and H. Dai, “Multi-stage hybrid federated learning over large-scale d2d-enabled fog networks,” IEEE/ACM Transactions on Networking, vol. 30, no. 4, pp. 1569–1584, 2022. [70] X. Liu, Z. Zhong, Y. Zhou, D. Wu, X. Chen, M. Chen, and Q. Z. Sheng, “Accelerating federated learning via parallel servers: A theoretically [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] 17 guaranteed approach,” IEEE/ACM Transactions on Networking, vol. 30, no. 5, pp. 2201–2215, 2022. Y. Zhang, X. Lan, J. Ren, and L. Cai, “Efficient computing resource sharing for mobile edge-cloud computing networks,” IEEE/ACM Transactions on Networking, vol. 28, no. 3, pp. 1227–1240, 2020. C. T. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless networks: Convergence analysis and resource allocation,” IEEE/ACM Transactions on Networking, vol. 29, no. 1, pp. 398–409, 2021. A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” 2020. Y. J. Cho, A. Manoel, G. Joshi, R. Sim, and D. Dimitriadis, “Heterogeneous ensemble knowledge transfer for training large models in federated learning,” in International Joint Conferences on Artificial Intelligence Organization (IJCAI), 2022. J. Wolfrath, N. Sreekumar, D. Kumar, Y. Wang, and A. Chandra, “Haccs: Heterogeneity-aware clustered client selection for accelerated federated learning,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2022. Y. J. Cho, J. Wang, and G. Joshi, “Towards understanding biased client selection in federated learning,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2022. T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in IEEE International Conference on Communications (ICC), 2019. F. Li, J. Liu, and B. Ji, “Federated learning with fair worker selection: A multi-round submodular maximization approach,” in IEEE International Conference on Mobile Ad-Hoc and Smart Systems (MASS), 2021, pp. 180–188. B. E. Woodworth, K. K. Patel, and N. Srebro, “Minibatch vs local SGD for heterogeneous distributed learning,” in Advances in Neural Information Processing (NeurIPS), 2020. H.-S. Chang, E. Learned-Miller, and A. McCallum, “Active bias: Training more accurate neural networks by emphasizing high variance samples,” in Conference on Neural Information Processing Systems (NeurIPS), 2017. C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in deep embedding learning,” in The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2017. B. Mirzasoleiman, J. Bilmes, and J. Leskovec, “Coresets for dataefficient training of machine learning models,” in International Conference on Machine Learning (ICML), 2020. Y. Wang, F. Fabbri, and M. Mathioudakis, “Fair and representative subset selection from data streams,” in The ACM Web Conference (WWW), 2021. I. Akbari, M. A. Salahuddin, L. Ven, N. Limam, R. Boutaba, B. Mathieu, S. Moteau, and S. Tuffin, “A look behind the curtain: Traffic classification in an increasingly encrypted web,” in Proceedings of the ACM on Measurement and Analysis of Computing Systems (SigMetrics), 2021, pp. 04:1–04:26. T. Karagiannis, A. Broido, M. Faloutsos, and K. Claffy, “Transport layer identification of p2p traffic,” in ACM Internet Measurement Conference (IMC), 2004. Y. Wang, Y. Xiang, and S.-Z. Yu, “Automatic application signature construction from unknown traffic,” in IEEE International Conference on Advanced Information Networking and Applications (AINA), 2010. M. Finsterbusch, C. Richter, E. Rocha, J.-A. Muller, and K. Hanssgen, “A survey of payload-based traffic classification approaches,” IEEE Communications Surveys & Tutorials, vol. 16, no. 2, pp. 1135–1156, 2013.