Uploaded by samiksha mehta

Lip Reading Synopsis

advertisement
PROJECT SYNOPSIS
DEPARTMENT
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
TITLE OF THE
LIP-READING FOR SPEECH RECOGNITION
PROJECT
STUDENT
NAMES/ USN/
PHONE/ MAIL ID
POOJA SINGARI
K SUJAN RAO CH
SAMIKSHA MEHTA
NAGANOSHITH
1DS20AI040
1DS20AI024
1DS20AI020
1DS20AI050
20beam062@dsce.edu. 20beam021@dsce. 20beam006@dsce 20beam061@dsce.edu.in
.edu.in
in
edu.in
PROJECT
TIMELINE
September 2023 to December 2023
(Tentative Start DateEnd Date)
PROJECT GUIDE
FIELD OF
PROF. YASHASWINI B M
Deep Learning
PROJECT
PROJECT
INTRODUCTION
Human perception of speech is intrinsically multimodal, involving audition and
vision. The speech production is accompanied by the movement of lips and teeth,
which can be visually interpreted to understand speech. Vision plays a crucial role
in speech understanding and the importance of utilizing visual information to
improve the performance and robustness of speech recognition has been
demonstrated. Visual cues of speech not only play an essential role in language
learning for pre-lingual children but also improve speech understanding in noisy
environments and provide patients with speech impairment with means of
communication. Furthermore, perceptual studies have shown that such visual cues
can alter the perceived sound.
Machine learning models can be trained to recognize lip movements and convert them
into text, providing a way to automatically transcribe speech without relying on audio
input. Lip reading machine learning models typically use deep neural networks to
process video frames and extract features that are relevant for speech recognition.
In the 1970s and 1980s, researchers developed early lip-reading systems that used
rule-based approaches to recognize speech from video input. These systems were
limited by the availability of computational resources and the lack of large labeled
datasets for training.
In the 1990s and early 2000s, researchers began to use feature-based approaches to lip
reading, which involved extracting specific features from video frames, such as the
shape of the lips and the position of the tongue. These approaches were able to
achieve some success in recognizing isolated words but struggled with more complex
speech patterns. In recent years, deep learning approaches have become the dominant
paradigm in lip reading research. These approaches use convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) to process video frames and learn to
recognize patterns in the visual input. Deep learning approaches have achieved stateof-the-art performance on lip reading benchmarks, and have the potential to improve
communication for individuals who are deaf or hard of hearing. Another recent
development in lip reading recognition is the use of multimodal approaches, which
combine visual input from lip reading with audio input from speech recognition
systems. By combining these two modalities, researchers hope to improve the
accuracy and robustness of lip reading systems, particularly in noisy or adverse
conditions.
Many companies such as Google, Apple, Deepzen and Sensetime are just a few
examples of companies that are working on lip reading models. As interest in this
technology grows, it is likely that more companies will begin to develop lip reading
systems for a variety of applications. The lip reading project aims to develop machine
learning models that can recognize speech by visually interpreting the movements of
the lips, tongue, and other facial features. This technology has the potential to
improve communication for individuals who are deaf or hard of hearing, as well as
enhance speech recognition in noisy or crowded environments, enable more natural
human-robot interactions, and support language learning.
Overall, the lip reading project has the potential to transform the way we
communicate and interact with technology, and holds great promise for improving
accessibility and inclusivity for individuals with hearing
1
Literature
Survey
Summary
● Lip reading is a challenging task due to the variability of lip movements and
the complexity of visual cues involved in speech production. Traditional
approaches to lip reading have relied on hand-crafted features and statistical
models, which are designed to capture the shape and motion of the lips, but
these methods have limited accuracy and robustness. In recent years, deep
learning approaches have shown promise in improving lip reading performance
by automatically learning features from raw video frames and capturing
temporal dependencies in lip movements.
● Convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
are among the most commonly used deep learning architectures for lip reading.
CNNs are effective at learning local spatial features from video frames, while
RNNs are suitable for modeling temporal dependencies in lip movements.
Some researchers have also explored the use of attention mechanisms, which
allow the models to focus on relevant regions of the video frames and improve
their robustness to variations in lighting, head movements, and facial
expressions.
● In addition to architecture design, data augmentation techniques have also been
shown to be effective in improving the performance of lip reading models.
These techniques involve applying random transformations to the video
frames, such as cropping and flipping, to increase the size and diversity of the
training dataset. This helps the models to generalize better to unseen data and
reduces overfitting.
● Despite the recent advances in deep learning for lip reading, there are still
several challenges to overcome. One major challenge is the need for large
amounts of annotated data, which is time-consuming and expensive to collect.
Another challenge is the variability in speech and language, which can make it
difficult for the models to generalize to new speakers and languages.
● The architecture used in the most recent research is based on an off-the-shelf
architecture, which has achieved state-of-the-art performance on the LRS2 and
LRS3 datasets without the use of external data. The VSR front-end is based on
a modified ResNet-18, where the first layer is a spatio-temporal convolutional
layer with a kernel size of 5× 7 × 7 and a stride of 1× 2 × 2.The temporal
2
back-end, which follows the front-end, is a Conformer. Similarly, the ASR
encoder consists of a 1D ResNet-18 followed by a Conformer. The ASR and
VSR encoder outputs are fused via a multi-layer perceptron (MLP). The rest of
the network consists of a projection layer and a Transformer decoder for joint
CTC/attention training.
● The AUTO-AVSR system proposed in the recent papers achieves state-of-theart performance on both LRS2 and LRS3 datasets, with a WER of 10.5% and
16.1%, respectively. This represents a significant improvement over the
previous state-of-the-art results, which had WERs of 12.4% and 18.6%,
respectively. The authors also report that their system outperforms several
existing audio-only, video-only, and audio-visual baselines on both datasets.
● The approach used is to pre-train ASR and VSR models using self-supervised
learning on unlabelled datasets such as AVSpeech and VoxCeleb2. The authors
then use publicly-available pre-trained ASR models to automatically transcribe
unlabelled datasets such as LRS2 and LRS3. They train ASR, VSR, and AVASR models on the augmented training set, which consists of the LRS2 and
LRS3 datasets as well as the additional automatically-transcribed data.
● The proposed AUTO-AVSR system uses a modified ResNet-18 for the VSR
front-end, a Conformer for the temporal back-end, a 1D ResNet-18 followed
by a Conformer for the ASR encoder, and a multi-layer perceptron (MLP) to
fuse the ASR and VSR encoder outputs. The rest of the network consists of a
projection layer and a Transformer decoder for joint CTC/attention training.
● Overall, the literature suggests that deep learning approaches have the potential
to significantly improve the performance of lip reading systems and enable
better communication for people with hearing impairments and enhanced
speech recognition in noisy environments. However, further research is needed
to overcome the challenges and limitations of current approaches and to
explore new possibilities for improving the accuracy and usability of lip
reading technology.
3
PROJECT
PROBLEM
STATEMENT AND
CHALLENGES
“Lip reading is an essential technology for speech recognition in noisy
environments and for people with hearing impairments. Traditional lip reading
methods have limited accuracy and robustness due to the variability of lip
movements and the complexity of visual cues involved in speech production.
Deep learning approaches have shown promise in improving lip reading
performance, but there are still several challenges to overcome.”
● Performance of several key architectures for the categories of unseen and
overlapped speakers
● The main aim is to recognize what the speaker is saying, merely by Lip
movement. Then comes the most crucial step that is to mark the Lip using
four points for left, right, upper and lower lip. Features are extracted on the
basis of the change in position of the lips while speaking .
● The primary reason lipreading is difficult is that much of the image in the
video input remains unchanged—the movement of the lips is the biggest
distinction. However, it is possible to perform action recognition, which is
a type of video classification, from a single image.
● To derive the characteristics relevant to the speech content from a single
image and to analyze the time relationship between the entire series of
images to infer the content. The key problem with lipreading is visual
ambiguity
● LipNet is the baseline model in our study; therefore, we have to evaluate
the for the categories of unseen and overlapped speakers
● Lip-reading is often performed in noisy environments where the audio
signal is degraded or unavailable. This makes it challenging to
disambiguate between similar-looking lip movements, which can lead to
errors in transcription.
● Deep learning models can be complex and difficult to interpret, which
makes it challenging to diagnose errors or understand the reasoning behind
the model's predictions. This can be especially problematic in lip-reading
4
applications where errors can have significant consequences.
●
Real-time lip-reading applications require fast and efficient models that
can process video frames in real-time. Achieving real-time performance
requires careful optimization of the model architecture and
implementation.
OBJECTIVES
The project aims to achieve the following goals:
OFTHE
PROJECT
1. Develop a deep learning model that can accurately recognize speech from lip
movements captured in video footage.
2. Improve the robustness and accuracy of the lip reading model
3. Explore the use of different neural network architectures, such as
convolutional neural networks (CNNs) and recurrent neural networks
(RNNs), to optimize the performance of the lip reading model.
4. Investigate the impact of data augmentation techniques, such as random
cropping and flipping, on the performance of the lip reading model.
5. Investigate the potential applications of the lip reading model in areas such as
speech recognition for people with hearing impairments, speech enhancement
in noisy environments, and speech recognition in multi-modal settings.
6. Develop a user-friendly interface for the lip reading system, which allows
users to upload video footage and receive real-time speech recognition
results.
ARCHITECTURE
1. 3D CNN with BiGRU : This hybrid architecture combines the strengths of
3D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent
Unit (BiGRU). The 3D CNN extracts spatial and temporal features from video
sequences, while the BiGRU captures long-term temporal dependencies.
Integration of both models enhances the understanding of lip movements,
providing a robust foundation for accurate lip-reading predictions.
5
2. 3D CNN with BiGRU and LSTM : This comprehensive architecture combines the
effectiveness of 3D Convolutional Neural Network (CNN), Bidirectional Gated
Recurrent Unit (BiGRU), and Long Short-Term Memory (LSTM). The 3D CNN
processes video frames to capture both spatial and temporal features. The BiGRU
model, along with LSTM, focuses on capturing temporal dependencies in a
bidirectional manner. This synergistic integration facilitates a robust understanding of
lip movements, utilizing both short-term and long-term context for accurate lip
reading predictions.
3. 3D CNN and BERT : This architecture combines the power of 3D
Convolutional Neural Network (CNN) and BERT (Bidirectional Encoder
Representations from Transformers). The 3D CNN processes video frames to
capture spatial and temporal features. BERT, originally designed for natural
language processing, is adapted to comprehend sequential patterns in lip
movements, providing a comprehensive understanding of spoken content. This
fusion allows for effective lip reading by leveraging both visual and
linguistic context.
4. 3D CNN and Vision Transformer : This architecture amalgamates the
strengths of 3D Convolutional Neural Network (CNN) and Vision Transformer.
The 3D CNN processes video frames in three dimensions, capturing both spatial
and temporal features critical for lip reading. On the other hand, Vision
Transformer excels in learning long-range dependencies in images, typically used
in natural language processing tasks. The fusion of 3D CNN and Vision
Transformer allows the model to effectively extract spatial and temporal features
from video sequences while leveraging Transformer-based mechanisms to
understand the relationships and patterns within the data. This integration
enriches the lip reading process by combining both spatial and contextual
information, resulting in enhanced accuracy and understanding of lip movements.
6
Tools Required
Keras,Tensorflow,Pytorch
Opencv,Python
Deep Learning models
Dataset Details
GRID is a large multitalker audiovisual sentence corpus to support joint
computational-behavioral studies in speech perception. In brief, the corpus consists
of high-quality audio and video (facial) recordings of 1000 sentences spoken by
each of 34 talkers (18 male, 16 female). Sentences are of the form "put red at G9
now". The corpus, together with transcriptions, is freely available for research use.
Performance Metrics
● Comparison of the accuracy of various models.
●
To compare the performances and computational efficiencies of all models,
we evaluated the parameters, epoch time, CER, and WER, of each model
Demonstration
The project will be demonstrated via a web interface.
Details including GUI
7
SYSTEM
DIAGRAM
ARE THERE ANY
1) https://github.com/deepconvolution/LipNet/tree/master/Dataset
STANDARD
2) https://spandh.dcs.shef.ac.uk//gridcorpus/
DATASETS
AVAILABLE
Base Paper Link
https://arxiv.org/pdf/2303.14307.pdf
AUTO-AVSR: AUDIO-VISUAL SPEECH
RECOGNITION WITH AUTOMATIC LABELS Pingchuan Ma1 , Alexandros
Haliassos1 , Adriana Fernandez-Lopez2 , Honglie Chen2 Stavros Petridis1,2 ,
Maja Pantic1,2 1 Imperial College London, UK 2Meta AI, UK.
NAME
OF
THE
PAPER:
8
REFERENCES
[1] AUTO-AVSR: AUDIO-VISUAL SPEECH RECOGNITION WITH
AUTOMATIC LABELS Pingchuan Ma1 , Alexandros Haliassos1 , Adriana
Fernandez-Lopez2 , Honglie Chen2 Stavros Petridis1,2 , Maja Pantic1,2 1 Imperial
College London, UK 2Meta AI, UK.
[2]
Xichen Pan1 , Peiyu Chen1 , Yichen Gong2 , Helong Zhou2 , Xinbing
Wang1 , Zhouhan Lin (2022).Leveraging Unimodal Self-Supervised Learning
for Multimodal Audio-Visual Speech Recognition .
[3]
LIPNET: END-TO-END SENTENCE-LEVEL LIPREADING Yannis M.
Assael 1, , Brendan Shillingford1, , Shimon Whiteson1 & Nando de Freitas 1,2,3
Department of Computer Science, University of Oxford, Oxford, UK 1 Google
DeepMind, London, UK 2 CIFAR, Canada 3
[4]
LipNet: A comparative study Vyom Jain, Srishti Lamba, Shweta Airan
Institute of Technology, Nirma University.
[5]
Pingchuan Ma, Stavros Petridis, Maja Pantic. (2021). END-TO-END
AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS
[6]
Lipreading Architecture Based on Multiple Convolutional
Neural
Networks for Sentence-Level Visual Speech Recognition Sanghun Jeon , Ahmed
Elsharkawy and Mun Sang Kim *
[7]
Bowen Shi1∗ Wei-Ning Hsu2 Kushal Lakhotia2 Abdelrahman Mohamed
(2022). LEARNING AUDIO-VISUAL SPEECH REPRESENTATION BY
MASKED MULTIMODAL CLUSTER PREDICTION
[8] End-to-End Lip-Reading Open Cloud-Based Speech Architecture by Sanghun Jeon
and Mun Sang Jim
[9] LipReadNet: A Deep Learning Approach to Lip Reading by Tejas Adsare and
Kuldeep Vayadande
9
Download