NOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium | 978-1-6654-7716-1/23/$31.00 ©2023 IEEE | DOI: 10.1109/NOMS56928.2023.10154300 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 Federated Learning Aided Deep Convolutional Neural Network Solution for Smart Traffic Management Guanxiong Liu1 Student Member, IEEE, Nicholas Furth1 , Hang Shi1 , Abdallah Khreishah1 Senior Member, IEEE, Jo Young Lee1 , Nirwan Ansari1 Fellow, IEEE, Chengjun Liu1 , and Yaser Jararweh2 1 New Jersey Institute of Technology, Neward, NJ, US 2 Jordan University of Science and Technology, Irbid, Jordan {gl236, nf77, hs328, abdallah, jo.y.lee, nirwan.ansari, cliu}@njit.edu, yijararweh@just.edu.jo Abstract—Machine learning models, especially neural network (NN) classifiers, have shown tremendous potential of being used in complex tasks such as image classification, object detection and video analytics. However, to be adopted in the real-world applications, there are still problems to be answered. One of these problems is that training machine learning models, especially NN models, requires a certain level of computation and data processing. Other problems are the limited bandwidth of the network and the possibility of exposing the privacy of the users to attacks if the training data (specially video) is going to be transferred through the network. To mitigate these problems, researchers recently proposed the concept of federated learning. In this paper, we build a video analytic application for traffic management and train it using federated learning. More specifically, each traffic surveillance camera combined with its co-located small PC are seen as the worker node in federated learning. In this way, the NN model in each node can be trained on data collected from all nodes without transmitting and sharing with a central server, which resolves all of the above mentioned problems. The performance of the trained NN model is evaluated via experiments under different open sourced datasets to demonstrate that the proposed work has the potential to enhance the detection accuracy (mAP) over 40%. Index Terms—Machine Learning, Neural Network, Traffic Video Analytic the network, which could saturate the network and make the transmitted data vulnerable to privacy attacks. To solve these problems federated learning has been proposed [2]. Different from the centralized training architecture, federated learning tries to train the NN classifier with a small copy of data which reduces the computation consumption. Since this will lead to a biased and sub-optimal model, federated learning further aggregates the trained model in a central node to mitigate the bias. Besides saving the computational power, federated learning also resolves the data privacy issue by only transmitting the model parameters which also significantly reduces the traffic load on the connections between nodes when videos are used as the training data. Federated learning is also suitable to be combined with the edge computing architecture which is widely adopted in today’s Internetof-things (IoT) [3], [4]. In such a combination, the cloudlet in the edge computing paradigm can take the role of the worker node in federated learning. In this paper, we propose a proofof-concept architecture that combines federated learning and edge computing for traffic video analytic applications. The contribution of our work could be listed as follows. • I. I NTRODUCTION Due to their surprisingly good representation power of complex distributions, neural network (NN) models are shown to be the most successful solutions for many complex tasks. For example, recent NN classifiers have outperformed other methods in image classification, object detection and face recognition tasks [1]. However, in order to be applied in the real-world applications, there are still problems that need to be resolved. For example, the training of the NN classifier, especially the classifier that is designed with large number of layers for better performance, requires a certain level of computational power and data with enough diversity. However, in many real-world applications, such as in traffic management systems, the computational power is not centralized as well as the data. In such applications, to train a NN classifier, we need a huge amount of training data to be transmitted through • • We build a proof-of-concept architecture which combines federated learning with edge computing. Based on the proposed architecture, we implement the NN aided traffic video analytic application that can identify the car, bus and pedestrian in the video. Through extensive evaluation, we show that the federated model achieves much better overall performance. The average improvement of the object detection accuracy (mAP) comparing the federated model and the single NN model is larger than 10% in all cases. Moreover, when the node has low quality data (e.g. insufficient training data and low data diversity), the improvement can be increased to over 40%. The rest of the work is organized as follows. Section II summarizes important background material. Sections III details the design of our proposed architecture, federated training and NN model implementation. Section IV presents Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 the evaluation setting and results. Section V concludes the paper. II. BACKGROUND In this section, we review some fundamental topics and provide references for further understanding of the concepts presented throughout this work. A. Edge Computing While cloud computing has been the most popular choice for services that require vast amount of data processing power, the limited network bandwidth is still the bottleneck of such cloud-centralized design. To achieve better performance, both data processing power and network bandwidth have to be taken into consideration. As a result, the paradigm of edge computing has been recently introduced which can push some of the computing resources away from the centralized nodes to the edge of the network. A considerable amount of policies and algorithms have been proposed for the edge network architecture. For example, Chiang and Zhang [5] summarize the opportunities and challenges of edge computing in the networking context of IoT and indicate that the fog concept can fill the technology gaps in IoT. Moreover, Zhao et al. [6] propose a cluster content caching structure for cloud radio access networks (C-RANs) to tackle the problems of high power consumption and poor QoS for real-time services caused by significant data exchange in both backhaul and fronthaul links. Besides the research of edge computing architecture, there are also research works that focus on edge computing aided applications. As an example, Kiani et al. [4] study the problem of combining edge computing with traffic surveillance system. In our work, we further enhance the edge-computing based traffic surveillance system by utilizing federated learning, which allows us to perform more advanced tasks. B. Deep Neural Network Due to the surprisingly good representation power of complex distributions, deep neural networks (DNN), in recent years, have been widely used in many applications. One of the popular use case scenarios is the computer vision related tasks. For example, the RCNN and YOLO models are considered as efficient DNNs that focus on object detection [7], [8]. Despite the tremendous success of DNNs, there are still problems and challenges that need to be resolved under many application scenarios, such as traffic surveillance systems. On one hand, recent research shows that DNNs are vulnerable to attacks such as adversarial examples [9]–[11]. On the other hand, the training and inference of DNNs have strict requirements on computational power and data diversity. Therefore, utilizing DNNs into the applications requires further investigation to solve these problems. Based on our review, current research works on the federated learning shed lights to resolve some of these challenges [2]. C. Federated Learning Federated learning is proposed to allow training of Neural Networks in a distributed manner. With federated learning, we can train the machine learning model (e.g. DNNs) on multiple local datasets (limited data diversity) contained in local nodes (limited computational power) without exchanging individual data samples. The participants of federated learning include a central node and several worker nodes. The worker nodes own their training data and apply updates to the DNNs. The central node collects these updates from worker nodes and aggregates the final update [12]. This process could be summarized in the following steps (Figure 1): • Step-1: The central node initializes the DNN model. The architecture is defined and all weight parameters are properly initialized. • Step-2: The copies of initialized DNN model are sent to each worker nodes. • Step-3: The worker nodes train the DNN model with their own data for a few epochs. The updates of weight parameters (compared with the DNN model received from central node) are calculated. • Step-4: The worker nodes send the updates back to central node. The central node aggregates these updates based on predefined method. Then, the DNN model in the central node is updated. In federated learning, the updated DNN model in the central node could be resent to the worker nodes to repeat the process in step-2 to step-4 many times. The requirement of computational power is mitigated since the worker nodes only need to handle their own training data which can be light weighted. To meet the data diversity and prevent overfitting, the central node in federated learning aggregates the updates before applying to the DNN model. The central node could be the cloud server in the edge computing diagram which has enough computational power. In addition to the computational efficiency, the federated learning also allows the worker nodes to collaborate in the training process without sharing data. In this way, the worker nodes that participated in the federated learning can protect their data privacy [13]. In addition to the process shown in Figure 1, we enhance our algorithm by adding two key improvements, first, we remove any outlier gradients before aggregation, outlier are detected using the l2 distance, also known as the euclidean distance. Second, after the aggregation has completed we compare the loss of each local model on its own data on both the aggregated and local weights and select the better of the two prior to beginning a new training round. D. Video Analysis on Edge The works presented in [14]–[17] provide solutions and benchmarks for video analytic and computer vision application on edge devices. The works presented in [14] and [17] provide frameworks for choosing the best pre-existing solution given hardware and latency constraints, however their contribution does not include providing any new solutions for the problems. Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 Central Node Central Node Central Node Central Node Worker Nodes Worker Nodes Worker Nodes Worker Nodes Fig. 1: Federated Learning [15] provides a framework to evaluate video analytic solutions. In addition to measuring the traditional metrics such as F1score, this work also considered if a certain program was dependent on specific features within a video making it less generalizable. The work presented in [16] provides a solution for prediction traffic speed and congestion while minimizing computational costs on edge devices. Additionally, other existing works such as [18] and [19] only consider classification for either cars or pedestrians, not both simultaneously. Moreover, the solutions presented in [18], [19] do not generalize well to real-time object detection and they only consider a centralized training setting which limits their real-world potential. The work presented in [20] provides an overview of existing techniques for traffic detection, although fails to provide any new solutions. The works presented in [21]–[23] each consider computer vision and video analysis in federated learning (FL) settings. While the works presented in [22], [23] do not deal with real-world application like traffic analysis, it does present a general solution of how to implement large-scale models in FL. The work presented in [21], unlike [22] and [23] focuses on the traffic analysis as the practical scenario to demonstrates its solution. Specifically, [21] proposes a multi-layer design of distributed system that combines the edge computing and federated learning for traffic surveillance. However, the focus of this work is the system design that can accelerate the prediction instead of the edge computing and federated learning based video analytic algorithm. In other words, [21] does not implement the proposed system and fail to provide any empirical evaluation results. The approach which we present has several key differences from these aforementioned works. First of all, we propose the joint design of federated learning and state-of-the-art video analytic method (i.e., YOLOv3). Secondly, in order to achieve cooperation between federated learning and YOLOv3 model, we enhance the training method which makes the trained model significantly outperforms the YOLOv3 model that is sololy trained. Last but not the least, we extensively evaluate our proposed approach under the realworld application, traffic analysis, with real-world datasets which demonstrates its great applicability. III. S YSTEM D ESIGN As mentioned before, we build a traffic surveillance application with both edge computing and federated learning Fig. 2: Edge Computing based Framework with DNN. Our system could be broken down into three components: (1) the edge computing based framework, (2) the implementation of federated learning, and (3) the DNN model details. A. Edge Computing based Framework In this work, we propose an edge computing based framework that consists of multiple cloudlets (with cameras) and a central cloud as presented in Figure 2. Here, the cloudlet is formed by a mini computer with a co-located camera at the edge of the network. All cloudlets are connected to the cloud through backhaul network. Within each cloudlet, a NN model is trained to solve the traffic surveillance related task (e.g. counting pedestrian and vehicle). These cloudlets are deployed in different locations which means that the recorded videos could be under different conditions (e.g. lighting, environmental, angle, traffic, etc). Moreover, these recorded videos are labelled manually or by supporting devices (e.g. loop vehicle detector). We assume the video data at each cloudlet is private such that the cloudlets cannot share data among each other. All the data transmissions happen between cloudlets and the cloud. In the traditional edge computing framework, the raw video data is transferred from cloudlet to the cloud for further processing, which always causes network congestion and privacy concerns. In our federated learning model, we only transfer the weights trained by the NN to Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 Fig. 3: The network structure of YOLOv3. the cloud, which can significantly reduce the data transmission pressure on the network and hide the private video data details. ∆θ = fagg (∆θ0 , ..., ∆θk ) 0 0 θ ←− θ + ∆θ (1) (2) B. Implementation of Federated Learning On top of the edge computing framework, we train the NN model by cooperation among the distributed cloudlets and the central cloud through a federated learning based method. Firstly, we focus on one of the cloudlets. We assume that Xi = {xi0 , ..., xin } is the training examples collected by the cloudlet’s camera locally. Based on these examples, the weight parameters, θi , of NN model in the cloudlet could be updated to minimize the corresponding objective function, Li (Xi ,Yi , θ 0 ). Here, Yi is the label for Xi . When the updated NN model is ready, this cloudlet could calculate its overall change with the respect of weight parameters initialized by the cloud (θ 0 ) as ∆θi = θi − θ 0 . Since the environment conditions for any specific cloudlet is relatively stable, such change in weight parameters (∆θi ) could lead to an overfitting model. To mitigate this issue, the overall change to the weight parameters in different cloudlets are transmitted to the cloud. It is worth to mention that transmitting the overall change requires much less network bandwidth than transmitting the raw video (S(∆θi ) << S(Xi ) where S(·) denote the data size calculation). Once these changes arrive at the cloud, a pre-defined aggregation function, fagg , is fed to calculate the final update, ∆θ , to the weight parameter. In this work, the aggregation function calculates the average of all inputs. The above process is repeated multiple times as summarized in Algorithm 1. C. Object Detector Based on YOLOv3 You Only Look Once (YOLO) [24] is a widely known object detection method using the deep neural network. It can achieve state-of-the-art object detection accuracy in real-time. The YOLO method partitions the input frame into multiple grids and predicts bounding boxes and confidence scores. By setting up a threshold for the confidence score, we can detect the objects with the highest likelihood in the frame. YOLOv3 is an incremental update of YOLO [25], that achieves a higher detection accuracy. It predicts the object class in three different scales and uses independent logistic classifiers and the binary cross-entropy loss instead of the softmax layers in prediction. Therefore, we select to use the structure of YOLOv3 as the network model in our federated learning framework. The structure of the neural network we used is shown in Figure. 3. The first 52 layers of the Darknet-53 are used for feature extraction. By connecting different scales of concatenation layers, the network can detect both the small objects far away from the camera and the large ones close to the camera. Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 Algorithm 1 Federated Learning INPUT: The cloud weight parameters θ 0 OUTPUT: Final weight parameters θ ∗ 1: 2: 3: 4: 5: for Each federated learning epoch do Update each cloudlet with cloud weight parameters θ 0 for Each cloudlet in parallel do Update the NN model weight parameters based on training examples collected and labelled locally. θi = arg minLi (Xi ,Yi , θ 0 ) θ 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: Calculate the overall change. ∆θi = θi − θ 0 Transmit the ∆θi to the cloud. end for Aggregate the changes. ∆θ = fagg (∆θ0 , ..., ∆θk ) Update the cloud weight parameter. θ 0 ←− θ 0 + ∆θ end for Obtain the final cloud weight parameters. θ∗ = θ0 return θ ∗ IV. E XPERIMENTS To illustrate the effectiveness of our proposed federated learning method, we select to use object detection in traffic videos as the evaluation task. Voc data set [26] and the Coco data set [27] are two widely used data sets for object detection evaluation. We randomly sample 5 sub-datasets from these two datasets individually. We assume that each one of these sub-datasets is the local data stored in an edge device, which cannot be shared or transmitted. Therefore, in total, we have 10 different clients in this federated learning system. For the single NN model, we utilize the entire Voc and Coco dataset which makes it more challenging for the federated learning model. To simplify the DNN model, we select to use the vehicle class and the human class, which are the most commonly seen objects in traffic videos, as target object classes to train our DNN models. We have trained three DNN models in our experiment, two local models, and one federated model based on the network structure introduced in Sec. III-C. The local models are trained with the data stored at one of the edges, specifically, one is trained using Voc data set, and another is trained using Coco data set. The federated model is trained with our proposed federated learning method, which shares the trained weights of two edges. We use a GPU server with 8 NVIDIA V100 GPUs to train and test our federated learning method. For the local models, we first train with frozen layers and 10−3 learning rate for 50 epochs. Then, we fine-tune with 10−4 learning rate for another 50 epochs. For the federated model, we follow the same training process. The difference is that we transfer the weights update from two local models to the cloud each epoch for aggregation, and then local models re-initialize the model based on weights sent back from the cloud. To test the performance of the models, we use a third data set: the urban tracker [28] as testing data set. The urban tracker data set includes four video scenarios, with over 7000 annotated video frames. It includes three urban traffic videos, which contain both vehicles and humans and one indoor video with only humans. The three traffic videos are taken from different viewing angles, which are effective to test the generalization ability of models. We use average precision (AP), precision, recall, F-score to measure the detection accuracy in each object class, and use the mean average precision (mAP) to measure the detection accuracy on average. The precision, recall, and F-score are calculated based on a confidence score of 0.3. By comparing the two local models, we can see that the Coco model always achieves a higher detection accuracy (mAP) than the Voc model. It reflects the unbalanced training data quality at different edge sides. By applying the federated model, the training progress will integrate the features extracted from two edge sides without exchanging the raw data. This training progress on one hand reduces the data transmission pressure of the network. On the other hand, it resolves the data privacy issue. The rationale of the federated model is that we enhance the object classification accuracy by enlarging the training sample set. We will further analyze the effectiveness of the federated model. Table. I shows the detection accuracy of the urban tracker datasets. The first row in Table. I shows the detection accuracy of the ’Stmarc’ video. This video has a high viewing angle as shown in the first row of Fig. 4. Considering the two single NN models, for both the vehicle and human classes, we can see that the Voc model achieves higher precision but lower recall rate compared with the Coco model, because of the miss detection happening in the Voc model. In comparison, the Coco model reaches a higher AP and F-score, which means that the Coco model has better comprehensive detection accuracy. For the vehicle class, we can see that both the AP and F-score of the federated model outperform that of the Coco model and the Voc model. This happens because the federated model can achieve higher value in both the Precision and Recall. For the human class, we can see that the federated model also achieves a higher AP and F-score compared with the Coco model and the Voc model. It is clear that the federated model is able to achieve balanced performance between Precision and Recall compared with the Voc model. The mAP represents the average detection accuracy for these two classes, we can see that the federated model achieves the highest mAP in these three models. The mAP of the federated model increases more than 10% comparing to the Coco model, and around 30% comparing to the Voc model. This improvement proves that the federated learning structure can really improve the detection accuracy in practice. Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 32.97% 16.01% 45.69% 34.24% 32.17% 72.49% 60.58% 50.57% 67.43% Coco Voc Fed Coco Voc Fed Coco Voc Fed Coco Voc Fed Stmarc Rouen Sherbrooke Atrium (human only) AP Model Video NA 86.03% 87.73% 60.05% 49.32% 50.43% 88.08% 35.22% 56.09% 54.02% Precision 69.96% 59.53% 81.40% 49.59% 31.61% 72.48% 55.27% 17.09% 61.62% Recall Vehicle 77% 70.93% 69.11% 49% 38.86% 79.52% 43% 26.20% 57.57% F-score 81.81% 81.78% 90.58% 21.15% 17.93% 74.15% 55.42% 18.06% 62.10% 29.46% 9.09% 37.64% AP Object Class 98.48% 99.76% 98.43% 77.56% 65.75% 54.08% 85.41% 90.21% 95.08% 55.54% 98.10% 60.37% Precision 89.08% 86.62% 93.65% 21.37% 28.90% 97.18% 63.45% 19.16% 65.88% 45.83% 7.01% 58.36% Recall Human 93.54% 92.73% 95.98% 33.51% 40.15% 69.49% 72.81% 31.61% 77.84% 50.22% 13.08% 59.35% F-score TABLE I: The comparative detection results of the Coco model, Voc model, and the federated model. 40.87% 34.25% 70.79% 44.83% 25.12% 67.29% 31.22% 12.55% 41.66% mAP 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 The second row in Table. I shows the detection accuracy of the ’Rouen’ video. This video has a lower camera hanging height than the ’Stmarc’ video as shown in the second row of Fig. 4. The vehicles and humans have a large size, and includes more detail features in comparison. The precision and recall of the Coco model and Voc model have similar performance compared with the ’Stmarc’ video. The federated model, in comparison, still outperforms two single NN models in terms of AP, Precision, Recall, and F-score on both Vehicle and Human classes. For the overall measurement on mAP, we can see that the federated model achieves 67.29% while the values from single models are 44.82% (the Coco model) and 25.12% (the Voc model). The third row in Table. I shows the detection accuracy of the ’Sherbrooke’ video. This video has a lower camera viewing angle than the ’Stmarc’ and ’Rouen’ videos as shown in the third row of Fig. 4. The different viewing angles result in different features of the object. At this viewing angle, the detection of human is becoming hard for single NN models since we can see the best AP and F-score is much lower the previous two rows. However, the performance of the federated model is not affected based on the results and we believe that the federated model is indeed benefit from the diversity when model is jointly trained with multiple clients’ updates. We can see from these three testing videos that the Voc model always achieves worse detection accuracy compared with the Coco model, due to the poor training data quality (less data). Therefore, the worker node with the Voc dataset can always benefit from the federated model significantly. In addition, the work node with the Coco dataset can also get large improvement by using the federated model. For example, on ’Sherbrooke’ video, the improvement on mAP over the Coco model is around 30%. By analyzing the detection results, we can conclude that the federated model can enhance the detection accuracy for most worker nodes. For poorly performed clients, our empirical results show that using the federated model can improve its mAP over 42%. Even for the best performed clients, the federated model can still enhance its mAP by 10% which is significant in object detection. The last row in Table. I shows the detection accuracy of the ’Atrium’ video. This video is an indoor video that only contains humans as shown in the last row of Fig. 4. The accuracy of all three models is at the same level, which is pretty high for object detection. This result shows that taking federated learning approach will not break the original performance when the single NN model is good enough. V. C ONCLUSION In this work, we build an architecture that combines edge computing and the federated learning to implement a traffic video analytic application with the NN based YOLOv3 model. To evaluate the performance of this application, we compare its performance on object detection (vehicle and human) on several different video scenarios. The results show that the federated model can significantly improve the object detection accuracy in almost all cases. The improvements on mAP over the best performed client is as large as 10% which is huge in object detection. For poorly performed clients, federated learning enables the knowledge sharing which effectively mitigate insufficient training data or low data diversity. Our empirical results show that the enhancement on mAP could become over 40%. R EFERENCES [1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1. [2] J. Konečnỳ, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015. [3] F. Bonomi, “Connected vehicles, the internet of things, and fog computing,” in The Eighth ACM International Workshop on Vehicular InterNetworking (VANET), Las Vegas, USA, 2011, pp. 13–15. [4] A. Kiani, G. Liu, H. Shi, A. Khreishah, N. Ansari, J. Y. Lee, and C. Liu, “A two-tier edge computing based model for advanced traffic detection,” in 2018 Fifth International Conference on Internet of Things: Systems, Management and Security. IEEE, 2018, pp. 208–215. [5] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854– 864, 2016. [6] Z. Zhao, M. Peng, Z. Ding, W. Wang, and H. V. Poor, “Cluster content caching: An energy-efficient approach to improve quality of service in cloud radio access networks,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 5, pp. 1207–1221, 2016. [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. [8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788. [9] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” International Conference on Learning Representations, 2015. [10] G. Liu, I. Khalil, and A. Khreishah, “Zk-gandef: A gan based zero knowledge adversarial training defense for neural networks,” in 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2019, pp. 64–75. [11] ——, “Gandef: A gan based adversarial training defense for neural network classifier,” in IFIP International Conference on ICT Systems Security and Privacy Protection. Springer, 2019, pp. 19–32. [12] S. Banabilah, M. Aloqaily, E. Alsayed, N. Malik, and Y. Jararweh, “Federated learning review: Fundamentals, enabling technologies, and future applications,” Information processing & management, vol. 59, no. 6, p. 103061, 2022. [13] J. Posner, L. Tseng, M. Aloqaily, and Y. Jararweh, “Federated learning in vehicular networks: Opportunities and solutions,” IEEE Network, vol. 35, no. 2, pp. 152–159, 2021. [14] U. I. Minhas, L. Mukhanov, G. Karakonstantis, H. Vandierendonck, and R. Woods, “Leveraging transprecision computing for machine vision applications at the edge,” in 2021 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, 2021, pp. 205–210. [15] Z. Xiao, Z. Xia, H. Zheng, B. Y. Zhao, and J. Jiang, “Towards performance clarity of edge video analytics,” arXiv preprint arXiv:2105.08694, 2021. [16] G. Liu, H. Shi, A. Kiani, A. Khreishah, J. Lee, N. Ansari, C. Liu, and M. M. Yousef, “Smart traffic monitoring system using computer vision and edge computing,” IEEE Transactions on Intelligent Transportation Systems, 2021. [17] X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen, “Deepdecision: A mobile deep learning framework for edge video analytics,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2018, pp. 1421–1429. [18] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian detection with unsupervised multi-stage feature learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3626–3633. Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply. 6th International Workshop on Intelligent Transportation and Autonomous Vehicles Technologies (ITAVT 2023) - Workshop of NOMS 2023 Fig. 4: The comparative detection results of the Coco model, Voc model, and the federated model using the urban tracker data set. The first column shows the detection results achieved from the Coco model. The second column shows the detection results of the Voc model. The third column shows the detection results of the federated model. Each row represents a frame from a video (Stmarc, Rouen, Sherbrooke, Atrium) respectively. [19] M. Stojmenovic, “Real time machine learning based car detection in images with fast training,” Machine Vision and Applications, vol. 17, no. 3, pp. 163–172, 2006. [20] N. Buch, S. A. Velastin, and J. Orwell, “A review of computer vision techniques for the analysis of urban traffic,” IEEE Transactions on intelligent transportation systems, vol. 12, no. 3, pp. 920–939, 2011. [21] A. B. Sada, M. A. Bouras, J. Ma, H. Runhe, and H. Ning, “A distributed video analytics architecture based on edge-computing and federated learning,” in 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). IEEE, 2019, pp. 215–220. [22] C. He, A. D. Shah, Z. Tang, D. F. N. Sivashunmugam, K. Bhogaraju, M. Shimpi, L. Shen, X. Chu, M. Soltanolkotabi, and S. Avestimehr, “Fedcv: A federated learning framework for diverse computer vision tasks,” arXiv preprint arXiv:2111.11066, 2021. [23] Y. Liu, A. Huang, Y. Luo, H. Huang, Y. Liu, Y. Chen, L. Feng, T. Chen, H. Yu, and Q. Yang, “Fedvision: An online visual object detection platform powered by federated learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 08, 2020, pp. 13 172– 13 179. [24] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 779– 788. [25] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv, 2018. [26] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, Jan. 2015. [27] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, ser. Lecture Notes in Computer Science, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8693. Springer, 2014, pp. 740–755. [28] J. Jodoin, G. Bilodeau, and N. Saunier, “Urban tracker: Multiple object tracking in urban mixed traffic,” in IEEE Winter Conference on Applications of Computer Vision, 2014, pp. 885–892. Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on June 28,2023 at 09:37:03 UTC from IEEE Xplore. Restrictions apply.