A DVANCED T OPICS IN S ECURE AND D ISTRIBUTED C OMPUTING : D ISTRIBUTED L EARNING - M OVIE RECOGNITION BY AUDIO USING FEDERATED LEARNING Jannis Voigt Institute for Computer Science University of Innsbruck Innsbruck, Austria jannis.voigt@student.uibk.ac.at A BSTRACT Distributed learning and federated learning have gained significant attention as techniques for training machine learning models on decentralized data sources while respecting data privacy. This study investigates the applicability of these approaches in the domain of movie recognition using audio samples. Leveraging cutting-edge machine learning algorithms and frameworks, we distribute the training process across multiple clients, enabling collaborative model training without relying on centralized data aggregation. Various data distribution scenarios are explored, encompassing complete data availability on all clients, allocation of different movies to specific clients, and partial data of each movie on each client. Additionally, parallelized speedup techniques are employed during data preprocessing to enhance efficiency. The efficacy of distributed and federated learning in achieving accurate models while preserving data privacy is evaluated. Performance metrics, including execution time, accuracy, and scalability, are analyzed to assess the effectiveness and efficiency of these approaches. By bridging the domains of distributed learning, federated learning, and movie recognition, this research contributes to advancing decentralized machine learning systems, offers valuable insights for future developments, and highlights the benefits of parallelized speedup in data preprocessing. 1 Introduction Music recognition applications such as Shazam1 , SoundHound2 , and voice assistants have gained immense popularity since the advent of the smartphone era and continue to be essential tools today. These applications have revolutionized the way we identify songs by analyzing audio snippets and matching them to their corresponding tracks. Building upon this technology, our project delves into the realm of movie recognition based on audio samples. In this study, we aim to investigate the effectiveness of existing music recognition techniques in the domain of movie identification. By leveraging state-of-the-art machine learning methods, we employ advanced algorithms that distribute the training of our model. This approach not only enhances the realism of our model but also improves the training time for large-scale data sets. By extending the capabilities of audio recognition systems to movies, we hope to explore the potential of identifying films from audio snippets. While we anticipate challenges and limitations in this task, we believe that even partial success could have practical applications in the entertainment industry. Accurately identifying movies from audio samples could provide users with useful information such as the film’s title and relevant details, facilitating content discovery in the digital age. 1 2 https://www.shazam.com/de https://www.soundhound.com/ Throughout this project, we will assess the feasibility of movie recognition based on audio samples, evaluating the performance and limitations of our model. By conducting experimentation and analysis, we aim to contribute to the ongoing research in audio recognition and distributed learning technologies and gain insights that can guide future advancements in these topics. 2 Related works A substantial body of research has focused on music recognition, providing valuable insights into various techniques and approaches. One notable study by Yan Ke et al. titled "Computer Vision for Music Identification" [1] explores the transformation of audio data into visual representations for music recognition. This innovative method has demonstrated superior performance compared to older techniques, motivating us to adopt a similar approach in our project. In addition to music recognition, several papers have investigated the classification of movie genres based on audio data. However, most existing studies utilize the audio input as is, without any specific transformation or feature extraction. Various machine learning techniques, including Support Vector Machines [2], Latent Factor Analysis [3], and Neural Networks [4], have been employed to build genre classifiers. These classifiers have achieved promising results, with accuracies reaching up to 90 percent. However, it is important to note that many of these classifiers primarily distinguish between action and non-action movies/scenes, offering limited granularity in genre classification. While the aforementioned research provides a foundation for our work, it is essential to address the specific challenges and limitations that arise when applying music recognition techniques to movie identification. By building upon the existing literature and incorporating advancements in audio analysis, we aim to extend the capabilities of audio recognition systems and explore the potential for accurate movie identification based on audio snippets. 3 System description Our system is designed to leverage distributed and federated learning techniques for training machine learning models on decentralized data sources while ensuring data privacy. In this section, we provide further details on the data, the model architecture, the infrastructure setup, the data distribution amongst clients and the approach to speeding up data preprocessing. 3.1 Data Our data set consists of in total 2410 3 second long audio snippets from 11 different movies. As mentioned in the described work we transform these audio files to mel spectrograms3 with the help of librosa4 , a python package for music and audio analysis. A mel spectrogram is a visual representation of the frequencies present in an audio signal over time. It is created by applying the Fourier transform to the audio signal, which decomposes it into its constituent frequencies. The resulting spectrogram represents the amplitudes of these frequencies as a function of time. Furthermore the mel scale is a frequency scale that approximates the way humans perceive pitch. It divides the frequency range into perceptually equal intervals by applying a log scale to the frequency axis, giving more resolution to lower frequencies and less resolution to higher frequencies. This scale is designed to better match the way our ears perceive sound, as human hearing is more sensitive to changes in lower frequencies than higher frequencies. The mel spectrogram is particularly suited for music recognition tasks because it captures both temporal and spectral information. 3.2 Model The input data has a size of 128 × 469. As convolutional layers are well suited for visual analysis, we start with 3 convolutional layers followed by a ReLU activation function, a batch normalization layer, and a pooling layer (except the third and last conv layer). We then flatten the output of the last convolutional layer and pass it as input to a linear layer with 11 output nodes activated by a softmax layer for classification. The model incorporates several key features and their associated benefits: 3 4 https://en.wikipedia.org/wiki/Mel-frequency_cepstrum https://librosa.org/doc/main/index.html 2 Figure 1: https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html 1. Convolutional Layers: We employ three convolutional layers to extract hierarchical features from the input data. Convolutional layers are particularly well-suited for visual analysis due to their ability to capture local patterns and spatial relationships within images. By applying a set of learnable filters to the input data, these layers can detect relevant visual features at different scales. 2. ReLU Activation Function: Following each convolutional layer, we apply a rectified linear unit (ReLU) activation function. The ReLU function introduces non-linearity to the model, allowing it to capture complex relationships between the input features. ReLU activation has the benefit of being computationally efficient and preventing the vanishing gradient problem, which can impede training convergence. 3. Batch Normalization Layer: To improve stability and speed up the training process, we incorporate a batch normalization layer. This layer normalizes the activations of the previous layer across a batch of data, reducing the internal covariate shift. Batch normalization helps to alleviate the problem of unstable gradients during training and facilitates faster convergence by allowing higher learning rates to be used. 4. Pooling Layer: After the ReLU activation, we apply a pooling layer (except for the third and last convolutional layer). The pooling operation downsamples the feature maps, reducing their spatial dimensions while retaining important information. Pooling helps to achieve translation invariance, making the model more robust to variations in the input data and reducing the computational complexity of subsequent layers. 5. Flatten Operation: The output of the last convolutional layer is flattened into a one-dimensional vector. This flattening operation transforms the spatially arranged feature maps into a format suitable for linear layers, enabling the model to learn global relationships among the extracted features. 6. Linear Layer and Softmax Activation: The flattened feature vector is then passed through a linear layer with 11 output nodes. This linear layer captures high-level representations of the input data. Finally, a softmax activation function is applied to the output of the linear layer to produce probability scores for each class in 3 Figure 2: https://gemmo.ai/federated-learning the classification task. Softmax ensures that the predicted class probabilities sum to one, facilitating easy interpretation and enabling the selection of the most likely class label. By employing these architectural choices, our model effectively captures relevant visual features on the input data. The combination of convolutional layers, activation functions, batch normalization, pooling, and linear layers contributes to the model’s ability to extract hierarchical representations, achieve translation invariance, and handle complex visual patterns, ultimately enhancing its overall performance. 3.3 Infrastructure In order to evaluate the efficiency of federated learning and compare it with local training, we established a infrastructure that consisted of multiple using Amazon EC2 instances5 , each serving a specific role in the federated learning setup. Firstly, we deployed one EC2 instance as a server responsible for coordinating the federated learning process. This server utilized the FedAvg[5] strategy, a widely adopted algorithm for federated learning that aggregates model updates from the participating clients. Additionally, we leveraged several EC2 instances as clients, actively participating in the training procedure. These client instances were responsible for performing the actual model training using the federated learning approach. By distributing the training workload among multiple clients, we aimed to showcase the collaborative and decentralized nature of federated learning. To facilitate the training process, we opted for EC2 instances equipped with the Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2) image. This pre-configured image provided a streamlined setup for deep learning frameworks and GPU support, enabling efficient execution of computationally intensive tasks. Specifically, we utilized EC2 instances of the t2.large instance type, which offered a suitable balance between computational capacity and cost-effectiveness. By designing and utilizing this infrastructure, we were able to effectively implement and evaluate the performance of federated learning in comparison to local training. The combination of EC2 instances, the FedAvg strategy, and the Deep Learning AMI GPU PyTorch 2.0.1 image provided a reliable and efficient environment for conducting our experiments and drawing meaningful conclusions about the benefits and drawbacks of federated learning in our specific context. 5 https://aws.amazon.com/de/pm/ec2/ 4 3.4 Federated learning Federated learning, a distributed machine learning approach, offers a promising solution for training models on decentralized data sources while ensuring data privacy. In this section, we delve deeper into the implementation of federated learning in our experiments and highlight its key features. To enable federated learning, we utilized the Flower6 framework, an open-source Python package specifically designed for implementing federated learning algorithms. Flower simplifies the development and deployment of federated learning systems, providing a comprehensive set of tools and functionalities, like handling critical tasks such as communication, synchronization, and aggregation of model updates. One of the key advantages of federated learning is its ability to train models without requiring centralized data aggregation. Instead of transmitting raw data to a central server, which poses privacy concerns, federated learning enables the clients to perform local training on their respective data sets while exchanging model updates with the central server. This decentralized approach ensures data privacy while allowing for collaborative model training. In our experiments, we focused on exploring different data distribution strategies within the federated learning framework. We considered scenarios where all data resided on each client, specific movies were allocated to specific clients, and all movies were present on each client, but with varying data samples. These diverse data distribution setups allowed us to assess the robustness of federated learning in handling heterogeneous and decentralized data. The coordination and synchronization of the federated learning process were facilitated by the central server, which utilized the FedAvg[5] strategy. FedAvg involves aggregating the model updates received from the participating clients by calculating their average and distributing the updated weights back to the clients. This collaborative approach allows the server to leverage the collective knowledge of all participants while preserving the privacy of their individual data. By designing and implementing federated learning in our infrastructure, we aimed to evaluate its performance and efficiency in comparison to traditional local training methods. We assessed its effectiveness in different data distribution scenarios, considering factors such as privacy preservation, scalability, and collaborative learning. In the Evaluation section, we present the results and analysis of our federated learning experiments, focusing on the performance and effectiveness of the federated learning approach. We discuss the benefits and challenges of federated learning in our specific context, highlighting its potential for decentralized training and collaborative knowledge sharing. 3.5 Speeding up data preparation To optimize the data preprocessing, we introduced parallelization using Ray7 , a framework specifically designed to scale AI and Python workloads. By leveraging Ray, we achieved faster data loading and transformation processes, particularly for larger data sets. In our implementation, we parallelized both the data loading and the transformation step, which involved converting the raw audio data into mel spectrograms. By distributing the workload across multiple cores, we were able to process multiple data samples simultaneously. While our initial experiments with a relatively small data set did not demonstrate substantial improvements in processing speed, the benefits of parallelization become more pronounced as the data set size increases. In the Evaluation section, we present results that highlight the efficiency gains achieved with larger data sets, illustrating the advantages of using Ray for scaling up data preprocessing. By incorporating parallelization techniques through Ray, we enhanced the speed and efficiency of the data loading and mel spectrogram transformation steps. This optimization not only saves time but also enables researchers to process larger data sets more effectively, ultimately improving the overall scalability and performance of the preprocessing pipeline. 4 Evaluation In this section, we evaluate the performance and effectiveness of our federated learning approaches and explore their advantages over local training methods. We also examine the impact of utilizing the Ray framework for parallelization and discuss findings related to the model architecture. 6 7 https://flower.dev/ https://www.ray.io/ 5 Method 1 Method 2 Method 3 Time 18min 8.3min 8min accuracy 46% 37% 48% Table 1: Execution time per round of training and accuracy for the three different Federated Learning methods applied. Figure 3: Accuracy and training time for different FL Methods. 4.1 Comparison of Federated Learning approaches In our exploration of federated learning, we investigated three different approaches, each offering unique data distribution scenarios: 1. Complete data available on all clients 2. Different movies for each client 3. Partial data of each movie on each client We now proceed to compare these approaches based on their respective execution times and accuracy. The first approach achieved an overall accuracy of 46%, while the second approach yielded an average accuracy of 37%. The third approach demonstrated a slightly higher accuracy of 48%. However, it is worth noting that the first approach took more than twice as long as the second and third approaches due to the larger volume of data processed. Table 1 presents a summary of the execution times and accuracies for the three different federated learning methods employed. Figure 3 illustrates the accuracy and training time for the different federated learning methods, providing a visual representation of their performance characteristics. Although these accuracies may not appear exceptionally high at first glance, considering that they are more than 5 times better than random guessing (9%), and the fact that the audio snippets were only 3 seconds long, the achieved results are quite promising. Furthermore, we successfully demonstrated the potential for significant speed improvements by distributing the data among the clients, without compromising the performance of the trained model. Interestingly, the second approach showcased considerable variations in accuracy across different clients. For random movie allocation the best-performing client achieved an accuracy of 60%, while the other clients attained accuracies of 30% and 17% respectively. This discrepancy suggests that some movies are relatively easy to classify, while others pose more challenges. We will delve deeper into this topic in subsequent discussions. 4.2 Genre-Based Data Partitioning and Accuracy Variations In the second approach where each client trains on different movies, we observed significant variations in accuracy among the clients, even when grouping the movies randomly. To gain deeper insights, we conducted an analysis by grouping the movies based on their similarity, which led to some interesting findings. Upon grouping the movies into three categories: Musicals, Action0 and Thought-provoking / Epic movies, we obtained the following accuracies for each group: 6 1. Musicals (Mamma Mia, Shrek): 11.2% 2. Action (Top Gun, Fifth Element, Madmax, Kill Bill): 35.3% 3. Thougt provoking / Epic movies (Shutter Island, Space Odyssey, Inception, V for Vendetta): 47.9% We discovered that distinguishing musicals, such as Mamma Mia and Shrek, proved to be remarkably challenging, and they derived little benefit from the shared knowledge among the clients. In contrast, epic and thought-provoking movies like Inception demonstrated the most significant improvement in accuracy and were comparatively easier to distinguish, despite there being a substantially greater number of movies within this genre. These findings shed light on the varying levels of complexity associated with different movie genres when using federated learning. Musicals present a considerable challenge, likely due to their unique characteristics and nuances that make them harder to classify accurately. On the other hand, epic and thought-provoking movies exhibit distinct patterns and features that are more easily captured by the trained model, resulting in enhanced accuracy. By organizing the clients’ data based on movie genres, we obtained an overall accuracy of 31.5%. This finding suggests that the training process is more successful when each client’s data is diverse and covers a wide range of genres. When we compare this result to the accuracy achieved through random grouping of the data, we observe an improvement of approximately 5%. Furthermore, when each client has access to data from all movies, the accuracy improves by an additional 10%. This finding underscores the inherent weaknesses of distributed or federated learning, particularly when data is partitioned based on specific characteristics. 4.3 Federate training vs local training When comparing our best federated learning result (method 3) to the results obtained from local training, we observed similar accuracies. However, the federated learning approach showcased its superiority in terms of execution time efficiency. Local training took more than six times longer to complete. It is important to note that the execution time can vary depending on the hardware used. In our case, the machine employed for local training lacked a GPU and consisted of an Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz, 1800 Mhz, with 4 cores. Conversely, the EC2 instances utilized for distributed training also lacked a GPU but had 2 cores with up to 3GHz. Additionally, it is worth mentioning that some background tasks were running in parallel on the local machine, potentially impacting its performance. It is evident that the overall training time could be drastically improved by incorporating GPUs, as their architecture is highly suitable for machine learning tasks. Alternatively, utilizing hardware with more or faster CPUs could also contribute to significant performance enhancements. Another crucial advantage of federated training is the inherent capability for data privacy. With federated learning, only the model weights are shared among the clients, while the actual data remains stored locally. This ensures enhanced data privacy and confidentiality, addressing concerns associated with centralized data aggregation. Furthermore, federated training eliminates the scalability limitations typically associated with local machines. By adopting the federated learning approach, one can easily add more clients to the system, thereby increasing the training capacity and leveraging the collective knowledge of a larger network of participants. In conclusion, federated training demonstrates its efficiency in terms of execution time, data privacy, and scalability, presenting a viable alternative to traditional local training methods. The incorporation of advanced hardware, such as GPUs or CPUs with higher performance capabilities, could further enhance the overall performance and accelerate the training process. 4.4 Ray Speedup In order to improve the efficiency of data loading, preprocessing, and transformation, we leveraged the Ray framework, as discussed in Section 3.5. By parallelizing these operations with Ray, we aimed to reduce the overall processing time and enhance the scalability of our system. To evaluate the impact of Ray on the performance of data preparation, we measured the average time required to perform these operations, as shown in Table 2. It is worth noting that the initialization time of Ray is relatively high. However, as the number of movies increases, the benefits of utilizing Ray become more pronounced (see Figure 4). On my local machine (Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz, 1800 Mhz, 4 Cores), Ray already demonstrated superior performance for more than 8 movies. This suggests that for a larger data set the time savings could be significant, amounting to roughly 30 hours for 100,000 movies. 7 Ray init time data preparation with ray data preparation without ray 8.887s 4.185s 5.250s Table 2: Average time needed for ray initialization and preparing data for one movie with and without ray. When measuring the preparation time with ray the ray startupt time is not included. Figure 4: Time needed for data preparation. At the dashed line ray begins to outperform the non parallel variant. The red line show how many movies we used in our experiment. Table 2 provides the average time for Ray initialization and data preparation with and without Ray. It is important to note that when measuring the preparation time with Ray, the Ray startup time is not included in the calculation. The comparison clearly demonstrates the efficiency gains achieved by utilizing Ray for parallelization. Furthermore, Figure 4 illustrates the time needed for data preparation. As depicted by the dashed line, indicating the point at which the benefits of parallelization become apparent, Ray begins to outperform the non-parallel variant very early. The red line indicates the number of movies used in our experiment. By incorporating Ray into our workflow, we improved the speed of data preprocessing and transformation even for our small data set. More important, we deonstrated the importance and power of parallelizing workflows and the huge potential for large scale data sets. 4.5 Model After exploring the model described above (3.2), we also experimented with a modified version that had twice the number of nodes in each layer. However, despite the increase in complexity, the resulting accuracy did not exhibit a improvement. This observation suggests that the original model, with its relatively smaller layers, already had sufficient capacity to capture the relevant features and perform well on the classification task. These findings indicate that model performance is not solely dependent on the number of nodes or the complexity of the architecture. Other factors, such as the effectiveness of the convolutional layers in feature extraction, the appropriate use of activation functions, and the regularization techniques employed, also play crucial roles in determining overall performance. 8 By acknowledging the limited impact of increasing the number of nodes in the linear layer, we can better understand the importance of designing a well-balanced and optimized architecture that leverages the strengths of different components. This knowledge can guide future model development and help researchers make informed decisions when tailoring models for specific tasks and data sets. 5 Conclusion In this report, we have explored the domain of movie identification based on audio samples, leveraging advanced machine learning techniques and distributed learning methods. Our project aimed to investigate the feasibility of applying music recognition techniques to the task of movie recognition and explore the potential for applications in the entertainment industry. Through our experiments and analysis, we have made several key observations and findings. Firstly, we have demonstrated that existing music recognition techniques can be effectively adapted for movie identification, achieving promising results in terms of accuracy. By transforming audio snippets into mel spectrograms and employing a deep learning model with convolutional layers, we were able to capture relevant visual features and classify movies based on audio samples. Additionally, we have evaluated the performance and efficiency of federated learning, a distributed machine learning approach that ensures data privacy and enables collaborative model training. By establishing a federated learning infrastructure using Amazon EC2 instances and the Flower framework, we explored different data distribution scenarios and compared the effectiveness of federated learning with local training methods. Our results showed that federated learning can achieve competitive accuracy while providing privacy-preserving training on decentralized data sources. Furthermore we showed the limitations of distributed learning, particularly that performance can suffer from data distributions based on specific characteristics. Furthermore, we have optimized the data preprocessing pipeline by introducing parallelization using Ray. This optimization significantly improved the speed and efficiency of data loading and transformation processes, particularly for larger data sets, enhancing the scalability and performance of the overall system. Overall, our findings suggest that movie identification based on audio samples is a promising field with practical applications in content discovery and the entertainment industry. By leveraging distributed learning methods like federated learning and incorporating advanced machine learning techniques, accurate movie recognition could be achieved while ensuring data privacy and scalability. Moving forward, further research can be conducted to enhance the accuracy of movie identification, explore more sophisticated models, and investigate additional features beyond audio samples that can improve the classification performance. Additionally, the application of federated learning can be extended to other domains and tasks to address privacy concerns and enable collaborative model training on decentralized data sources. In conclusion, this study contributes to the ongoing research in audio recognition technologies and paves the way for future advancements in multimedia analysis, with potential applications in various industries beyond entertainment. References [1] Yan Ke, D. Hoiem, and R. Sukthankar. Computer vision for music identification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 597–604 vol. 1, 2005. [2] Driss Matrouf Mickael Rouvier, Georges Linares. Robust audio-based classification of video genre. [3] Driss Matrouf Mickael Rouvier, Georges Linares. Factor analysis for audio-based video genre classification. [4] R.S. Jadon Sanjay Jain. Audio based movies characterization using neural network. [5] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. CommunicationEfficient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1273–1282. PMLR, 20–22 Apr 2017. 9