Uploaded by jannis.voigt

MOVIE RECOGNITION BY AUDIO USING FEDERATED LEARNING

advertisement
A DVANCED T OPICS IN S ECURE AND D ISTRIBUTED C OMPUTING :
D ISTRIBUTED L EARNING - M OVIE RECOGNITION BY AUDIO
USING FEDERATED LEARNING
Jannis Voigt
Institute for Computer Science
University of Innsbruck
Innsbruck, Austria
jannis.voigt@student.uibk.ac.at
A BSTRACT
Distributed learning and federated learning have gained significant attention as techniques for training
machine learning models on decentralized data sources while respecting data privacy. This study
investigates the applicability of these approaches in the domain of movie recognition using audio
samples. Leveraging cutting-edge machine learning algorithms and frameworks, we distribute the
training process across multiple clients, enabling collaborative model training without relying on
centralized data aggregation. Various data distribution scenarios are explored, encompassing complete
data availability on all clients, allocation of different movies to specific clients, and partial data of
each movie on each client. Additionally, parallelized speedup techniques are employed during
data preprocessing to enhance efficiency. The efficacy of distributed and federated learning in
achieving accurate models while preserving data privacy is evaluated. Performance metrics, including
execution time, accuracy, and scalability, are analyzed to assess the effectiveness and efficiency of
these approaches. By bridging the domains of distributed learning, federated learning, and movie
recognition, this research contributes to advancing decentralized machine learning systems, offers
valuable insights for future developments, and highlights the benefits of parallelized speedup in data
preprocessing.
1
Introduction
Music recognition applications such as Shazam1 , SoundHound2 , and voice assistants have gained immense popularity
since the advent of the smartphone era and continue to be essential tools today. These applications have revolutionized
the way we identify songs by analyzing audio snippets and matching them to their corresponding tracks. Building upon
this technology, our project delves into the realm of movie recognition based on audio samples.
In this study, we aim to investigate the effectiveness of existing music recognition techniques in the domain of movie
identification. By leveraging state-of-the-art machine learning methods, we employ advanced algorithms that distribute
the training of our model. This approach not only enhances the realism of our model but also improves the training time
for large-scale data sets.
By extending the capabilities of audio recognition systems to movies, we hope to explore the potential of identifying
films from audio snippets. While we anticipate challenges and limitations in this task, we believe that even partial
success could have practical applications in the entertainment industry. Accurately identifying movies from audio
samples could provide users with useful information such as the film’s title and relevant details, facilitating content
discovery in the digital age.
1
2
https://www.shazam.com/de
https://www.soundhound.com/
Throughout this project, we will assess the feasibility of movie recognition based on audio samples, evaluating the
performance and limitations of our model. By conducting experimentation and analysis, we aim to contribute to the
ongoing research in audio recognition and distributed learning technologies and gain insights that can guide future
advancements in these topics.
2
Related works
A substantial body of research has focused on music recognition, providing valuable insights into various techniques
and approaches. One notable study by Yan Ke et al. titled "Computer Vision for Music Identification" [1] explores the
transformation of audio data into visual representations for music recognition. This innovative method has demonstrated
superior performance compared to older techniques, motivating us to adopt a similar approach in our project.
In addition to music recognition, several papers have investigated the classification of movie genres based on audio data.
However, most existing studies utilize the audio input as is, without any specific transformation or feature extraction.
Various machine learning techniques, including Support Vector Machines [2], Latent Factor Analysis [3], and Neural
Networks [4], have been employed to build genre classifiers. These classifiers have achieved promising results, with
accuracies reaching up to 90 percent. However, it is important to note that many of these classifiers primarily distinguish
between action and non-action movies/scenes, offering limited granularity in genre classification.
While the aforementioned research provides a foundation for our work, it is essential to address the specific challenges
and limitations that arise when applying music recognition techniques to movie identification. By building upon
the existing literature and incorporating advancements in audio analysis, we aim to extend the capabilities of audio
recognition systems and explore the potential for accurate movie identification based on audio snippets.
3
System description
Our system is designed to leverage distributed and federated learning techniques for training machine learning models
on decentralized data sources while ensuring data privacy. In this section, we provide further details on the data, the
model architecture, the infrastructure setup, the data distribution amongst clients and the approach to speeding up data
preprocessing.
3.1
Data
Our data set consists of in total 2410 3 second long audio snippets from 11 different movies. As mentioned in the
described work we transform these audio files to mel spectrograms3 with the help of librosa4 , a python package for
music and audio analysis.
A mel spectrogram is a visual representation of the frequencies present in an audio signal over time. It is created by
applying the Fourier transform to the audio signal, which decomposes it into its constituent frequencies. The resulting
spectrogram represents the amplitudes of these frequencies as a function of time.
Furthermore the mel scale is a frequency scale that approximates the way humans perceive pitch. It divides the frequency
range into perceptually equal intervals by applying a log scale to the frequency axis, giving more resolution to lower
frequencies and less resolution to higher frequencies. This scale is designed to better match the way our ears perceive
sound, as human hearing is more sensitive to changes in lower frequencies than higher frequencies.
The mel spectrogram is particularly suited for music recognition tasks because it captures both temporal and spectral
information.
3.2
Model
The input data has a size of 128 × 469. As convolutional layers are well suited for visual analysis, we start with 3
convolutional layers followed by a ReLU activation function, a batch normalization layer, and a pooling layer (except
the third and last conv layer). We then flatten the output of the last convolutional layer and pass it as input to a linear
layer with 11 output nodes activated by a softmax layer for classification.
The model incorporates several key features and their associated benefits:
3
4
https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
https://librosa.org/doc/main/index.html
2
Figure 1: https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html
1. Convolutional Layers: We employ three convolutional layers to extract hierarchical features from the input
data. Convolutional layers are particularly well-suited for visual analysis due to their ability to capture local
patterns and spatial relationships within images. By applying a set of learnable filters to the input data, these
layers can detect relevant visual features at different scales.
2. ReLU Activation Function: Following each convolutional layer, we apply a rectified linear unit (ReLU)
activation function. The ReLU function introduces non-linearity to the model, allowing it to capture complex
relationships between the input features. ReLU activation has the benefit of being computationally efficient
and preventing the vanishing gradient problem, which can impede training convergence.
3. Batch Normalization Layer: To improve stability and speed up the training process, we incorporate a batch
normalization layer. This layer normalizes the activations of the previous layer across a batch of data, reducing
the internal covariate shift. Batch normalization helps to alleviate the problem of unstable gradients during
training and facilitates faster convergence by allowing higher learning rates to be used.
4. Pooling Layer: After the ReLU activation, we apply a pooling layer (except for the third and last convolutional
layer). The pooling operation downsamples the feature maps, reducing their spatial dimensions while retaining
important information. Pooling helps to achieve translation invariance, making the model more robust to
variations in the input data and reducing the computational complexity of subsequent layers.
5. Flatten Operation: The output of the last convolutional layer is flattened into a one-dimensional vector. This
flattening operation transforms the spatially arranged feature maps into a format suitable for linear layers,
enabling the model to learn global relationships among the extracted features.
6. Linear Layer and Softmax Activation: The flattened feature vector is then passed through a linear layer
with 11 output nodes. This linear layer captures high-level representations of the input data. Finally, a softmax
activation function is applied to the output of the linear layer to produce probability scores for each class in
3
Figure 2: https://gemmo.ai/federated-learning
the classification task. Softmax ensures that the predicted class probabilities sum to one, facilitating easy
interpretation and enabling the selection of the most likely class label.
By employing these architectural choices, our model effectively captures relevant visual features on the input data. The
combination of convolutional layers, activation functions, batch normalization, pooling, and linear layers contributes to
the model’s ability to extract hierarchical representations, achieve translation invariance, and handle complex visual
patterns, ultimately enhancing its overall performance.
3.3
Infrastructure
In order to evaluate the efficiency of federated learning and compare it with local training, we established a infrastructure
that consisted of multiple using Amazon EC2 instances5 , each serving a specific role in the federated learning setup.
Firstly, we deployed one EC2 instance as a server responsible for coordinating the federated learning process. This
server utilized the FedAvg[5] strategy, a widely adopted algorithm for federated learning that aggregates model updates
from the participating clients.
Additionally, we leveraged several EC2 instances as clients, actively participating in the training procedure. These
client instances were responsible for performing the actual model training using the federated learning approach. By
distributing the training workload among multiple clients, we aimed to showcase the collaborative and decentralized
nature of federated learning.
To facilitate the training process, we opted for EC2 instances equipped with the Deep Learning AMI GPU PyTorch 2.0.1
(Amazon Linux 2) image. This pre-configured image provided a streamlined setup for deep learning frameworks and
GPU support, enabling efficient execution of computationally intensive tasks. Specifically, we utilized EC2 instances of
the t2.large instance type, which offered a suitable balance between computational capacity and cost-effectiveness.
By designing and utilizing this infrastructure, we were able to effectively implement and evaluate the performance
of federated learning in comparison to local training. The combination of EC2 instances, the FedAvg strategy, and
the Deep Learning AMI GPU PyTorch 2.0.1 image provided a reliable and efficient environment for conducting our
experiments and drawing meaningful conclusions about the benefits and drawbacks of federated learning in our specific
context.
5
https://aws.amazon.com/de/pm/ec2/
4
3.4
Federated learning
Federated learning, a distributed machine learning approach, offers a promising solution for training models on
decentralized data sources while ensuring data privacy. In this section, we delve deeper into the implementation of
federated learning in our experiments and highlight its key features.
To enable federated learning, we utilized the Flower6 framework, an open-source Python package specifically designed
for implementing federated learning algorithms. Flower simplifies the development and deployment of federated
learning systems, providing a comprehensive set of tools and functionalities, like handling critical tasks such as
communication, synchronization, and aggregation of model updates.
One of the key advantages of federated learning is its ability to train models without requiring centralized data
aggregation. Instead of transmitting raw data to a central server, which poses privacy concerns, federated learning
enables the clients to perform local training on their respective data sets while exchanging model updates with the
central server. This decentralized approach ensures data privacy while allowing for collaborative model training.
In our experiments, we focused on exploring different data distribution strategies within the federated learning
framework. We considered scenarios where all data resided on each client, specific movies were allocated to specific
clients, and all movies were present on each client, but with varying data samples. These diverse data distribution setups
allowed us to assess the robustness of federated learning in handling heterogeneous and decentralized data.
The coordination and synchronization of the federated learning process were facilitated by the central server, which
utilized the FedAvg[5] strategy. FedAvg involves aggregating the model updates received from the participating clients
by calculating their average and distributing the updated weights back to the clients. This collaborative approach allows
the server to leverage the collective knowledge of all participants while preserving the privacy of their individual data.
By designing and implementing federated learning in our infrastructure, we aimed to evaluate its performance and
efficiency in comparison to traditional local training methods. We assessed its effectiveness in different data distribution
scenarios, considering factors such as privacy preservation, scalability, and collaborative learning.
In the Evaluation section, we present the results and analysis of our federated learning experiments, focusing on the
performance and effectiveness of the federated learning approach. We discuss the benefits and challenges of federated
learning in our specific context, highlighting its potential for decentralized training and collaborative knowledge sharing.
3.5
Speeding up data preparation
To optimize the data preprocessing, we introduced parallelization using Ray7 , a framework specifically designed to
scale AI and Python workloads. By leveraging Ray, we achieved faster data loading and transformation processes,
particularly for larger data sets.
In our implementation, we parallelized both the data loading and the transformation step, which involved converting the
raw audio data into mel spectrograms. By distributing the workload across multiple cores, we were able to process
multiple data samples simultaneously.
While our initial experiments with a relatively small data set did not demonstrate substantial improvements in processing
speed, the benefits of parallelization become more pronounced as the data set size increases. In the Evaluation section,
we present results that highlight the efficiency gains achieved with larger data sets, illustrating the advantages of using
Ray for scaling up data preprocessing.
By incorporating parallelization techniques through Ray, we enhanced the speed and efficiency of the data loading and
mel spectrogram transformation steps. This optimization not only saves time but also enables researchers to process
larger data sets more effectively, ultimately improving the overall scalability and performance of the preprocessing
pipeline.
4
Evaluation
In this section, we evaluate the performance and effectiveness of our federated learning approaches and explore their
advantages over local training methods. We also examine the impact of utilizing the Ray framework for parallelization
and discuss findings related to the model architecture.
6
7
https://flower.dev/
https://www.ray.io/
5
Method 1 Method 2 Method 3
Time
18min
8.3min
8min
accuracy
46%
37%
48%
Table 1: Execution time per round of training and accuracy for the three different Federated Learning methods applied.
Figure 3: Accuracy and training time for different FL Methods.
4.1
Comparison of Federated Learning approaches
In our exploration of federated learning, we investigated three different approaches, each offering unique data distribution
scenarios:
1. Complete data available on all clients
2. Different movies for each client
3. Partial data of each movie on each client
We now proceed to compare these approaches based on their respective execution times and accuracy. The first approach
achieved an overall accuracy of 46%, while the second approach yielded an average accuracy of 37%. The third
approach demonstrated a slightly higher accuracy of 48%. However, it is worth noting that the first approach took more
than twice as long as the second and third approaches due to the larger volume of data processed. Table 1 presents a
summary of the execution times and accuracies for the three different federated learning methods employed. Figure 3
illustrates the accuracy and training time for the different federated learning methods, providing a visual representation
of their performance characteristics.
Although these accuracies may not appear exceptionally high at first glance, considering that they are more than 5 times
better than random guessing (9%), and the fact that the audio snippets were only 3 seconds long, the achieved results
are quite promising. Furthermore, we successfully demonstrated the potential for significant speed improvements by
distributing the data among the clients, without compromising the performance of the trained model.
Interestingly, the second approach showcased considerable variations in accuracy across different clients. For random
movie allocation the best-performing client achieved an accuracy of 60%, while the other clients attained accuracies of
30% and 17% respectively. This discrepancy suggests that some movies are relatively easy to classify, while others
pose more challenges. We will delve deeper into this topic in subsequent discussions.
4.2
Genre-Based Data Partitioning and Accuracy Variations
In the second approach where each client trains on different movies, we observed significant variations in accuracy
among the clients, even when grouping the movies randomly. To gain deeper insights, we conducted an analysis by
grouping the movies based on their similarity, which led to some interesting findings.
Upon grouping the movies into three categories: Musicals, Action0 and Thought-provoking / Epic movies, we obtained
the following accuracies for each group:
6
1. Musicals (Mamma Mia, Shrek): 11.2%
2. Action (Top Gun, Fifth Element, Madmax, Kill Bill): 35.3%
3. Thougt provoking / Epic movies (Shutter Island, Space Odyssey, Inception, V for Vendetta): 47.9%
We discovered that distinguishing musicals, such as Mamma Mia and Shrek, proved to be remarkably challenging, and
they derived little benefit from the shared knowledge among the clients. In contrast, epic and thought-provoking movies
like Inception demonstrated the most significant improvement in accuracy and were comparatively easier to distinguish,
despite there being a substantially greater number of movies within this genre.
These findings shed light on the varying levels of complexity associated with different movie genres when using
federated learning. Musicals present a considerable challenge, likely due to their unique characteristics and nuances
that make them harder to classify accurately. On the other hand, epic and thought-provoking movies exhibit distinct
patterns and features that are more easily captured by the trained model, resulting in enhanced accuracy.
By organizing the clients’ data based on movie genres, we obtained an overall accuracy of 31.5%. This finding suggests
that the training process is more successful when each client’s data is diverse and covers a wide range of genres. When
we compare this result to the accuracy achieved through random grouping of the data, we observe an improvement
of approximately 5%. Furthermore, when each client has access to data from all movies, the accuracy improves by
an additional 10%. This finding underscores the inherent weaknesses of distributed or federated learning, particularly
when data is partitioned based on specific characteristics.
4.3
Federate training vs local training
When comparing our best federated learning result (method 3) to the results obtained from local training, we observed
similar accuracies. However, the federated learning approach showcased its superiority in terms of execution time
efficiency. Local training took more than six times longer to complete. It is important to note that the execution time
can vary depending on the hardware used. In our case, the machine employed for local training lacked a GPU and
consisted of an Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz, 1800 Mhz, with 4 cores. Conversely, the EC2 instances
utilized for distributed training also lacked a GPU but had 2 cores with up to 3GHz. Additionally, it is worth mentioning
that some background tasks were running in parallel on the local machine, potentially impacting its performance.
It is evident that the overall training time could be drastically improved by incorporating GPUs, as their architecture
is highly suitable for machine learning tasks. Alternatively, utilizing hardware with more or faster CPUs could also
contribute to significant performance enhancements.
Another crucial advantage of federated training is the inherent capability for data privacy. With federated learning, only
the model weights are shared among the clients, while the actual data remains stored locally. This ensures enhanced
data privacy and confidentiality, addressing concerns associated with centralized data aggregation.
Furthermore, federated training eliminates the scalability limitations typically associated with local machines. By
adopting the federated learning approach, one can easily add more clients to the system, thereby increasing the training
capacity and leveraging the collective knowledge of a larger network of participants.
In conclusion, federated training demonstrates its efficiency in terms of execution time, data privacy, and scalability,
presenting a viable alternative to traditional local training methods. The incorporation of advanced hardware, such as
GPUs or CPUs with higher performance capabilities, could further enhance the overall performance and accelerate the
training process.
4.4
Ray Speedup
In order to improve the efficiency of data loading, preprocessing, and transformation, we leveraged the Ray framework,
as discussed in Section 3.5. By parallelizing these operations with Ray, we aimed to reduce the overall processing time
and enhance the scalability of our system.
To evaluate the impact of Ray on the performance of data preparation, we measured the average time required to perform
these operations, as shown in Table 2. It is worth noting that the initialization time of Ray is relatively high. However,
as the number of movies increases, the benefits of utilizing Ray become more pronounced (see Figure 4). On my local
machine (Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz, 1800 Mhz, 4 Cores), Ray already demonstrated superior
performance for more than 8 movies. This suggests that for a larger data set the time savings could be significant,
amounting to roughly 30 hours for 100,000 movies.
7
Ray init time data preparation with ray data preparation without ray
8.887s
4.185s
5.250s
Table 2: Average time needed for ray initialization and preparing data for one movie with and without ray. When
measuring the preparation time with ray the ray startupt time is not included.
Figure 4: Time needed for data preparation. At the dashed line ray begins to outperform the non parallel variant. The
red line show how many movies we used in our experiment.
Table 2 provides the average time for Ray initialization and data preparation with and without Ray. It is important to
note that when measuring the preparation time with Ray, the Ray startup time is not included in the calculation. The
comparison clearly demonstrates the efficiency gains achieved by utilizing Ray for parallelization.
Furthermore, Figure 4 illustrates the time needed for data preparation. As depicted by the dashed line, indicating the
point at which the benefits of parallelization become apparent, Ray begins to outperform the non-parallel variant very
early. The red line indicates the number of movies used in our experiment.
By incorporating Ray into our workflow, we improved the speed of data preprocessing and transformation even for
our small data set. More important, we deonstrated the importance and power of parallelizing workflows and the huge
potential for large scale data sets.
4.5
Model
After exploring the model described above (3.2), we also experimented with a modified version that had twice the
number of nodes in each layer. However, despite the increase in complexity, the resulting accuracy did not exhibit a
improvement. This observation suggests that the original model, with its relatively smaller layers, already had sufficient
capacity to capture the relevant features and perform well on the classification task.
These findings indicate that model performance is not solely dependent on the number of nodes or the complexity of the
architecture. Other factors, such as the effectiveness of the convolutional layers in feature extraction, the appropriate
use of activation functions, and the regularization techniques employed, also play crucial roles in determining overall
performance.
8
By acknowledging the limited impact of increasing the number of nodes in the linear layer, we can better understand the
importance of designing a well-balanced and optimized architecture that leverages the strengths of different components.
This knowledge can guide future model development and help researchers make informed decisions when tailoring
models for specific tasks and data sets.
5
Conclusion
In this report, we have explored the domain of movie identification based on audio samples, leveraging advanced
machine learning techniques and distributed learning methods. Our project aimed to investigate the feasibility of
applying music recognition techniques to the task of movie recognition and explore the potential for applications in the
entertainment industry.
Through our experiments and analysis, we have made several key observations and findings. Firstly, we have demonstrated that existing music recognition techniques can be effectively adapted for movie identification, achieving
promising results in terms of accuracy. By transforming audio snippets into mel spectrograms and employing a deep
learning model with convolutional layers, we were able to capture relevant visual features and classify movies based on
audio samples.
Additionally, we have evaluated the performance and efficiency of federated learning, a distributed machine learning
approach that ensures data privacy and enables collaborative model training. By establishing a federated learning
infrastructure using Amazon EC2 instances and the Flower framework, we explored different data distribution scenarios
and compared the effectiveness of federated learning with local training methods. Our results showed that federated
learning can achieve competitive accuracy while providing privacy-preserving training on decentralized data sources.
Furthermore we showed the limitations of distributed learning, particularly that performance can suffer from data
distributions based on specific characteristics.
Furthermore, we have optimized the data preprocessing pipeline by introducing parallelization using Ray. This
optimization significantly improved the speed and efficiency of data loading and transformation processes, particularly
for larger data sets, enhancing the scalability and performance of the overall system.
Overall, our findings suggest that movie identification based on audio samples is a promising field with practical
applications in content discovery and the entertainment industry. By leveraging distributed learning methods like
federated learning and incorporating advanced machine learning techniques, accurate movie recognition could be
achieved while ensuring data privacy and scalability.
Moving forward, further research can be conducted to enhance the accuracy of movie identification, explore more
sophisticated models, and investigate additional features beyond audio samples that can improve the classification
performance. Additionally, the application of federated learning can be extended to other domains and tasks to address
privacy concerns and enable collaborative model training on decentralized data sources.
In conclusion, this study contributes to the ongoing research in audio recognition technologies and paves the way for
future advancements in multimedia analysis, with potential applications in various industries beyond entertainment.
References
[1] Yan Ke, D. Hoiem, and R. Sukthankar. Computer vision for music identification. In 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 597–604 vol. 1, 2005.
[2] Driss Matrouf Mickael Rouvier, Georges Linares. Robust audio-based classification of video genre.
[3] Driss Matrouf Mickael Rouvier, Georges Linares. Factor analysis for audio-based video genre classification.
[4] R.S. Jadon Sanjay Jain. Audio based movies characterization using neural network.
[5] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. CommunicationEfficient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu, editors, Proceedings
of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine
Learning Research, pages 1273–1282. PMLR, 20–22 Apr 2017.
9
Download