PROJECT SYNOPSIS DEPARTMENT ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TITLE OF THE LIP-READING FOR SPEECH RECOGNITION PROJECT STUDENT NAMES/ USN/ PHONE/ MAIL ID POOJA SINGARI K SUJAN RAO CH SAMIKSHA MEHTA NAGANOSHITH 1DS20AI040 1DS20AI024 1DS20AI020 1DS20AI050 20beam062@dsce.edu. 20beam021@dsce. 20beam006@dsce 20beam061@dsce.edu.in .edu.in in edu.in PROJECT TIMELINE September 2023 to December 2023 (Tentative Start DateEnd Date) PROJECT GUIDE FIELD OF PROF. YASHASWINI B M Deep Learning PROJECT PROJECT INTRODUCTION Human perception of speech is intrinsically multimodal, involving audition and vision. The speech production is accompanied by the movement of lips and teeth, which can be visually interpreted to understand speech. Vision plays a crucial role in speech understanding and the importance of utilizing visual information to improve the performance and robustness of speech recognition has been demonstrated. Visual cues of speech not only play an essential role in language learning for pre-lingual children but also improve speech understanding in noisy environments and provide patients with speech impairment with means of communication. Furthermore, perceptual studies have shown that such visual cues can alter the perceived sound. Machine learning models can be trained to recognize lip movements and convert them into text, providing a way to automatically transcribe speech without relying on audio input. Lip reading machine learning models typically use deep neural networks to process video frames and extract features that are relevant for speech recognition. In the 1970s and 1980s, researchers developed early lip-reading systems that used rule-based approaches to recognize speech from video input. These systems were limited by the availability of computational resources and the lack of large labeled datasets for training. In the 1990s and early 2000s, researchers began to use feature-based approaches to lip reading, which involved extracting specific features from video frames, such as the shape of the lips and the position of the tongue. These approaches were able to achieve some success in recognizing isolated words but struggled with more complex speech patterns. In recent years, deep learning approaches have become the dominant paradigm in lip reading research. These approaches use convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to process video frames and learn to recognize patterns in the visual input. Deep learning approaches have achieved stateof-the-art performance on lip reading benchmarks, and have the potential to improve communication for individuals who are deaf or hard of hearing. Another recent development in lip reading recognition is the use of multimodal approaches, which combine visual input from lip reading with audio input from speech recognition systems. By combining these two modalities, researchers hope to improve the accuracy and robustness of lip reading systems, particularly in noisy or adverse conditions. Many companies such as Google, Apple, Deepzen and Sensetime are just a few examples of companies that are working on lip reading models. As interest in this technology grows, it is likely that more companies will begin to develop lip reading systems for a variety of applications. The lip reading project aims to develop machine learning models that can recognize speech by visually interpreting the movements of the lips, tongue, and other facial features. This technology has the potential to improve communication for individuals who are deaf or hard of hearing, as well as enhance speech recognition in noisy or crowded environments, enable more natural human-robot interactions, and support language learning. Overall, the lip reading project has the potential to transform the way we communicate and interact with technology, and holds great promise for improving accessibility and inclusivity for individuals with hearing 1 Literature Survey Summary ● Lip reading is a challenging task due to the variability of lip movements and the complexity of visual cues involved in speech production. Traditional approaches to lip reading have relied on hand-crafted features and statistical models, which are designed to capture the shape and motion of the lips, but these methods have limited accuracy and robustness. In recent years, deep learning approaches have shown promise in improving lip reading performance by automatically learning features from raw video frames and capturing temporal dependencies in lip movements. ● Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are among the most commonly used deep learning architectures for lip reading. CNNs are effective at learning local spatial features from video frames, while RNNs are suitable for modeling temporal dependencies in lip movements. Some researchers have also explored the use of attention mechanisms, which allow the models to focus on relevant regions of the video frames and improve their robustness to variations in lighting, head movements, and facial expressions. ● In addition to architecture design, data augmentation techniques have also been shown to be effective in improving the performance of lip reading models. These techniques involve applying random transformations to the video frames, such as cropping and flipping, to increase the size and diversity of the training dataset. This helps the models to generalize better to unseen data and reduces overfitting. ● Despite the recent advances in deep learning for lip reading, there are still several challenges to overcome. One major challenge is the need for large amounts of annotated data, which is time-consuming and expensive to collect. Another challenge is the variability in speech and language, which can make it difficult for the models to generalize to new speakers and languages. ● The architecture used in the most recent research is based on an off-the-shelf architecture, which has achieved state-of-the-art performance on the LRS2 and LRS3 datasets without the use of external data. The VSR front-end is based on a modified ResNet-18, where the first layer is a spatio-temporal convolutional layer with a kernel size of 5× 7 × 7 and a stride of 1× 2 × 2.The temporal 2 back-end, which follows the front-end, is a Conformer. Similarly, the ASR encoder consists of a 1D ResNet-18 followed by a Conformer. The ASR and VSR encoder outputs are fused via a multi-layer perceptron (MLP). The rest of the network consists of a projection layer and a Transformer decoder for joint CTC/attention training. ● The AUTO-AVSR system proposed in the recent papers achieves state-of-theart performance on both LRS2 and LRS3 datasets, with a WER of 10.5% and 16.1%, respectively. This represents a significant improvement over the previous state-of-the-art results, which had WERs of 12.4% and 18.6%, respectively. The authors also report that their system outperforms several existing audio-only, video-only, and audio-visual baselines on both datasets. ● The approach used is to pre-train ASR and VSR models using self-supervised learning on unlabelled datasets such as AVSpeech and VoxCeleb2. The authors then use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as LRS2 and LRS3. They train ASR, VSR, and AVASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. ● The proposed AUTO-AVSR system uses a modified ResNet-18 for the VSR front-end, a Conformer for the temporal back-end, a 1D ResNet-18 followed by a Conformer for the ASR encoder, and a multi-layer perceptron (MLP) to fuse the ASR and VSR encoder outputs. The rest of the network consists of a projection layer and a Transformer decoder for joint CTC/attention training. ● Overall, the literature suggests that deep learning approaches have the potential to significantly improve the performance of lip reading systems and enable better communication for people with hearing impairments and enhanced speech recognition in noisy environments. However, further research is needed to overcome the challenges and limitations of current approaches and to explore new possibilities for improving the accuracy and usability of lip reading technology. 3 PROJECT PROBLEM STATEMENT AND CHALLENGES “Lip reading is an essential technology for speech recognition in noisy environments and for people with hearing impairments. Traditional lip reading methods have limited accuracy and robustness due to the variability of lip movements and the complexity of visual cues involved in speech production. Deep learning approaches have shown promise in improving lip reading performance, but there are still several challenges to overcome.” ● Performance of several key architectures for the categories of unseen and overlapped speakers ● The main aim is to recognize what the speaker is saying, merely by Lip movement. Then comes the most crucial step that is to mark the Lip using four points for left, right, upper and lower lip. Features are extracted on the basis of the change in position of the lips while speaking . ● The primary reason lipreading is difficult is that much of the image in the video input remains unchanged—the movement of the lips is the biggest distinction. However, it is possible to perform action recognition, which is a type of video classification, from a single image. ● To derive the characteristics relevant to the speech content from a single image and to analyze the time relationship between the entire series of images to infer the content. The key problem with lipreading is visual ambiguity ● LipNet is the baseline model in our study; therefore, we have to evaluate the for the categories of unseen and overlapped speakers ● Lip-reading is often performed in noisy environments where the audio signal is degraded or unavailable. This makes it challenging to disambiguate between similar-looking lip movements, which can lead to errors in transcription. ● Deep learning models can be complex and difficult to interpret, which makes it challenging to diagnose errors or understand the reasoning behind the model's predictions. This can be especially problematic in lip-reading 4 applications where errors can have significant consequences. ● Real-time lip-reading applications require fast and efficient models that can process video frames in real-time. Achieving real-time performance requires careful optimization of the model architecture and implementation. OBJECTIVES The project aims to achieve the following goals: OFTHE PROJECT 1. Develop a deep learning model that can accurately recognize speech from lip movements captured in video footage. 2. Improve the robustness and accuracy of the lip reading model 3. Explore the use of different neural network architectures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to optimize the performance of the lip reading model. 4. Investigate the impact of data augmentation techniques, such as random cropping and flipping, on the performance of the lip reading model. 5. Investigate the potential applications of the lip reading model in areas such as speech recognition for people with hearing impairments, speech enhancement in noisy environments, and speech recognition in multi-modal settings. 6. Develop a user-friendly interface for the lip reading system, which allows users to upload video footage and receive real-time speech recognition results. ARCHITECTURE 1. 3D CNN with BiGRU : This hybrid architecture combines the strengths of 3D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BiGRU). The 3D CNN extracts spatial and temporal features from video sequences, while the BiGRU captures long-term temporal dependencies. Integration of both models enhances the understanding of lip movements, providing a robust foundation for accurate lip-reading predictions. 5 2. 3D CNN with BiGRU and LSTM : This comprehensive architecture combines the effectiveness of 3D Convolutional Neural Network (CNN), Bidirectional Gated Recurrent Unit (BiGRU), and Long Short-Term Memory (LSTM). The 3D CNN processes video frames to capture both spatial and temporal features. The BiGRU model, along with LSTM, focuses on capturing temporal dependencies in a bidirectional manner. This synergistic integration facilitates a robust understanding of lip movements, utilizing both short-term and long-term context for accurate lip reading predictions. 3. 3D CNN and BERT : This architecture combines the power of 3D Convolutional Neural Network (CNN) and BERT (Bidirectional Encoder Representations from Transformers). The 3D CNN processes video frames to capture spatial and temporal features. BERT, originally designed for natural language processing, is adapted to comprehend sequential patterns in lip movements, providing a comprehensive understanding of spoken content. This fusion allows for effective lip reading by leveraging both visual and linguistic context. 4. 3D CNN and Vision Transformer : This architecture amalgamates the strengths of 3D Convolutional Neural Network (CNN) and Vision Transformer. The 3D CNN processes video frames in three dimensions, capturing both spatial and temporal features critical for lip reading. On the other hand, Vision Transformer excels in learning long-range dependencies in images, typically used in natural language processing tasks. The fusion of 3D CNN and Vision Transformer allows the model to effectively extract spatial and temporal features from video sequences while leveraging Transformer-based mechanisms to understand the relationships and patterns within the data. This integration enriches the lip reading process by combining both spatial and contextual information, resulting in enhanced accuracy and understanding of lip movements. 6 Tools Required Keras,Tensorflow,Pytorch Opencv,Python Deep Learning models Dataset Details GRID is a large multitalker audiovisual sentence corpus to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female). Sentences are of the form "put red at G9 now". The corpus, together with transcriptions, is freely available for research use. Performance Metrics ● Comparison of the accuracy of various models. ● To compare the performances and computational efficiencies of all models, we evaluated the parameters, epoch time, CER, and WER, of each model Demonstration The project will be demonstrated via a web interface. Details including GUI 7 SYSTEM DIAGRAM ARE THERE ANY 1) https://github.com/deepconvolution/LipNet/tree/master/Dataset STANDARD 2) https://spandh.dcs.shef.ac.uk//gridcorpus/ DATASETS AVAILABLE Base Paper Link https://arxiv.org/pdf/2303.14307.pdf AUTO-AVSR: AUDIO-VISUAL SPEECH RECOGNITION WITH AUTOMATIC LABELS Pingchuan Ma1 , Alexandros Haliassos1 , Adriana Fernandez-Lopez2 , Honglie Chen2 Stavros Petridis1,2 , Maja Pantic1,2 1 Imperial College London, UK 2Meta AI, UK. NAME OF THE PAPER: 8 REFERENCES [1] AUTO-AVSR: AUDIO-VISUAL SPEECH RECOGNITION WITH AUTOMATIC LABELS Pingchuan Ma1 , Alexandros Haliassos1 , Adriana Fernandez-Lopez2 , Honglie Chen2 Stavros Petridis1,2 , Maja Pantic1,2 1 Imperial College London, UK 2Meta AI, UK. [2] Xichen Pan1 , Peiyu Chen1 , Yichen Gong2 , Helong Zhou2 , Xinbing Wang1 , Zhouhan Lin (2022).Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition . [3] LIPNET: END-TO-END SENTENCE-LEVEL LIPREADING Yannis M. Assael 1, , Brendan Shillingford1, , Shimon Whiteson1 & Nando de Freitas 1,2,3 Department of Computer Science, University of Oxford, Oxford, UK 1 Google DeepMind, London, UK 2 CIFAR, Canada 3 [4] LipNet: A comparative study Vyom Jain, Srishti Lamba, Shweta Airan Institute of Technology, Nirma University. [5] Pingchuan Ma, Stavros Petridis, Maja Pantic. (2021). END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS [6] Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition Sanghun Jeon , Ahmed Elsharkawy and Mun Sang Kim * [7] Bowen Shi1∗ Wei-Ning Hsu2 Kushal Lakhotia2 Abdelrahman Mohamed (2022). LEARNING AUDIO-VISUAL SPEECH REPRESENTATION BY MASKED MULTIMODAL CLUSTER PREDICTION [8] End-to-End Lip-Reading Open Cloud-Based Speech Architecture by Sanghun Jeon and Mun Sang Jim [9] LipReadNet: A Deep Learning Approach to Lip Reading by Tejas Adsare and Kuldeep Vayadande 9