Page |1 Mini Project Report on Sign Language Detection and Recognition Using Deep Learning By Group ID: 09 Syead Maaz Ahmed (201900007) Farheen (201900014) Ebbani Thapa (201900040) Under the guidance of Mr. Shantanu Kumar Mishra, Assistant Professor, Department of Computer Science and Engineering, SMIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SIKKIM MANIPAL INSTITUTE OF TECHNOLOGY (A constituent college of Sikkim Manipal University) MAJITAR, RANGPO, EAST SIKKIM – 737136 Page |2 LIST OF CONTENTS Content Page No. Abstract 3 Introduction 3-4 Literature Survey 4-8 Problem Definition 8-9 Solution Strategy 9 – 10 System Requirements 10 Design 11 Implementation 12 - 14 Input / Output 15 Conclusion 16 Problems 16 Future Works 16 Gantt Chart 17 References 17 Page |3 ABSTRACT Conversing to a person with hearing disability is always a major challenge. Sign language has indelibly become the ultimate panacea and is a very powerful tool for individuals with hearing and speech disability to communicate their feelings and opinions to the world. It makes the integration process between them and others smooth and less complex. However, the invention of sign language alone, is not enough. There are many strings attached to this boon. The sign gestures often get mixed and confused for someone who has never learnt it or knows it in a different language. However, this communication gap which has existed for years can now be narrowed with the introduction of various techniques to automate the detection of sign gestures. In this report, we introduce a Sign Language recognition using American Sign Language. In this study, the user must be able to capture images of the hand gesture using web camera and the system shall predict and display the name of the captured image. The images undergo a series of processing steps which include various Computer vision techniques such as the conversion to grayscale, gaussian blur. And the region of interest which, in our case is the hand gesture is segmented. The features extracted are the binary pixels of the images. We make use of Convolutional Neural Network (CNN) for training and to classify the images. We are able to recognize more than 20 alphabets of American Sign language with high accuracy. Our model has achieved a remarkable accuracy of above 96%. INTRODUCTION As well stipulated by Nelson Mandela, “Talk to a man in a language he understands, that goes to his head. Talk to him in his own language, that goes to his heart”, language is undoubtedly essential to human interaction and has existed since human civilization began. It is a medium humans use to communicate to express themselves and understand notions of the real world. Without it, no books, no cell phones and definitely not any word I am writing would have any meaning. It is so deeply embedded in our everyday routine that we often take it for granted and don’t realize its importance. Sadly, in the fast changing society we live in, people with hearing impairment are usually forgotten and left out. They have to struggle to Page |4 bring up their ideas, voice out their opinions and express themselves to people who are different to them. People with impaired speech and hearing uses sign language as a form of communication. Disabled People use this sign language gestures as a tool of non-verbal communication to express their own emotions and thoughts to other common people. But these common people find it difficult to understand their expression, thus trained sign language expertise are needed during medical and legal appointment, educational and training session. Over the past few years, there has been an increase in demand for these services. Other form of services such as video remote human interpret using the high-speed Internet connection, has been introduced, thus these services provides an easy to use sign language interpret service, which can be used and benefited, yet have major limitations. To address this, we are putting forward a sign language recognition system. It will be an ultimate tool for people with hearing disability to communicate their thoughts as well as a very good interpretation for non-sign language user to understand what the latter is saying. In this recognition system, we will be use a custom CNN model to recognize gestures in sign language. Convolutional neural network of 11 layers is constructed, three Convolution layers, three Max-Pooling Layers, two dense layers, one flattening layer and two dropout layer. We will use the American Sign Language (ASL) Dataset from MNIST (Modified National Institute of Standards and Technology Dataset) to train the model to identify the gesture. The dataset contains the features of different augmented gestures. Shall introduce a custom CNN (Convolutional Neural Network) model to identify the sign from a video frame using OpenCV. Initially, we will use feature extracted dataset to train the custom model that has 11 layers with a default image size. Rest of the report contains the following: Literature Survey, elaborated Problem Definition, Solution Strategy, Implementation, Pseudo Code Gantt Chart and finally References taken from. LITERATURE SURVEY 2.1 A convolutional neural network to classify American Sign Language fingerspelling from depth and color images Page |5 Title A convolutional neural network to classify American Sign Language fingerspelling from depth and color images Author Ameen, SA and Vadera, S Salient Features The paper aims to utilize a deep learning architecture to recognize the kind of signs presented as images. Explanation on how a convolution aims to apply kernel transformations on an image to identify relevant features while the main goal of pooling is to introduce invariance to local translation and reduce the number of hidden units. The paper explores the use of a different architecture that recognizes that depth and intensity are inherently different types of information and that there may be advantages in keeping those separate in the initial layers of a ConvNet. An analysis of the confusion matrix has identified two types of errors: (i) symmetric errors, such as two letters that can be misclassified as each other and (ii) asymmetric errors, where one letter is misclassified as another but not the other way round. Pros The results of the empirical evaluation showed an improvement by 3% compared to their previous works, and with precision rate over 82%. Cons The sign for the letter R has nearly the same shape as that for the letter U, especially when the hand moves. In both letters, the signer needs to use two fingers to convey the meaning. In addition, the distance between the camera and the fingers is nearly equal which makes it difficult to recognize the differences even when using depth. 2.2 Static Sign Language Recognition Using Deep Learning Title Static Sign Language Recognition Using Deep Learning Page |6 Author Lean Karlo S. Tolentino, Ronnie O. Serfa Juan, August C. Thio-ac, Maria Abigail B. Pamahoy, Joni Rose R. Forteza, and Xavier Jet O. Garcia Salient Features The main objective of the project was to develop a system that can translate static sign language into its corresponding word equivalent that includes letters, numbers, and basic static signs to familiarize the users with the fundamentals of sign language. The paper explains how the system was developed on basis of skin-color modeling technique, i.e., explicit skin-color space thresholding that will extract pixels (hand) from non–pixels (background). The images were fed into the model called the Convolutional Neural Network (CNN) for classification of images. Keras and TensorFlow was used for training of images. Provided with proper lighting condition and a uniform background. Pros The testing accuracy of 90.04% in letter recognition, 93.44% in number recognition and 97.52% in static word recognition, obtaining an average of 93.667% based on the gesture recognition with limited time. Each system was trained using 2,400, 50 × 50 images of each letter/number/word gesture. Cons The studies proposed a complex process of skin color thresholding; it could be seen that when only the bare hands of the signer are used, it was difficult for the system to recognize the gesture because of different hindrances such as noise. 2.3 American Sign Language Alphabet Recognition using Deep Learning Title American Sign Language Alphabet Recognition using Deep Learning Author Nikhil Kasukurthi, Brij Rokad, Shiv Bidani and Dr. Aju Dennisan Page |7 Salient Features For translating the image to the relevant alphabet, they have trained the pre-trained model on the SqueezeNet model. The model is trained on the Surrey Finger thereby using the Surrey Finger Dataset. The trained model is then used for inference from the images that are being fed as input to the image. The model was trained on NVIDIA K80 GPU for a dataset size of 41,258 images. Each sample provided RGB image (320x320 pixels), Depth map (320x320 pixels), Segmentation masks (320x320 pixels) for the classes: background, person, three classes for each finger and one for each palm. The evaluation was a low computation process and could also be carried out on a handheld mobile device. Pros The evaluation was a low computation process and could also be carried out on a handheld mobile device. The maximum validation accuracy attained was 83.29% at the 9th epoch. Whereas the maximum training accuracy attained was 87.47%. The correlation between the training and validation accuracy was 98.47% which signified that the model had been trained accurately. Cons The model is able to give accurate predictions but there are certain cases when it failed. This happened when similar looking alphabet like ‘a’ and ‘t’ where the difference between them is thumb on the side for ‘a’ whereas ‘t’ has thumb in between index and middle finger. When an image with different light conditions is given, or the fingers are not visible then it lead to a false prediction. 2.4 Sign Language Recognition Using Deep Learning and Computer Vision Title Sign Language Recognition Using Deep Learning and Computer Vision Author R.S. Sabeenian, S. Sai Bharathwaj and , M. Mohamed Aadhil Page |8 Salient Features A custom CNN model is used to recognize gestures in sign language. Convolutional neural network of 11 layers is constructed, four Convolution layers, three Max-Pooling Layers, two dense layers, one flattening layer and one dropout layer. The American Sign Language Dataset is used from MNIST (Modified National Institute of Standards and Technology database) to train the model to identify the gesture. The dataset contains the features of different augmented gestures. Also, a custom CNN (Convolutional Neural Network) model is introduced to identify of the sign from a video frame using Open-CV. Pros The usage of the custom CNN model makes it easy in choosing the variety of convolution to utilize (3x3, 5x5) in the model itself. The validation dataset consisted of 7172 samples, and the validation accuracy of the model resulted to greater than 93 %. Cons The major issue faced was due to the background of the image. As the model was trained with a segmented grayscale gesture image, it didn’t support background subtraction from the image when the frames were dropped from a video. PROBLEM DEFINITION There are more than 70 million people who are hearing or speech impaired. Recognizing sign language by human interpreters might not be a very challenging task as all that’s required is for the interpreter to learn the particular sign language. Sign languages is not just ‘natural language represented by sign’ or not just hand representation of the words as it is but rather it is the representation of meaning. Various facts associated with sign language, a natural language, which most of us are unaware. Some of them are listed below: ● NOT the same all over the world ● NOT just gestures and pantomime but do have their own grammar. ● Dictionary is smaller compared to other languages. Page |9 ● Finger-spelling for unknown words. ● Adjectives are placed after the noun for most of the sign language. ● Never use suffixes ● Always sign in present tense ● Do not use articles. ● Do not uses I but uses me. ● Have no gerunds ● Use of eyebrows and non-manual expressions. Normal people face difficulty in understanding their language. Hence there is a need of a system which recognizes the different signs, gestures and conveys the information to the normal people. SOLUTION STRATEGY The major reason to develop this system is to make communication through the internet easier for the deaf and mute community. A hand gesture recognition gadget can offer an opportunity for deaf people to talk with vocal humans without the need of an interpreter. The working of this approach is carried out with the help of certain modules which are as follows: • Data Set: In this step, a set of images for each letter in the sign language is fed to a database. The number of images may vary from 50 to 100, with different angles of each particular gesture included. The input obtained is then compared with the given images in the dataset to identify the gesture made. The reason for the number of images in the dataset is to get the output with a good amount of accuracy and also to avoid ambiguity. P a g e | 10 • Image detection: This is the step that comes right after camera capture. Image detection refers to detecting the image that is obtained. • Feature Extraction: Feature extraction refers to extracting the details from the image captured. In a sign language interpreter, the image captured is a gesture made by a hand. These features are then used to recognize the gesture using certain algorithms. • Image Recognition: Image recognition is the most crucial procedure of this project. The acquired image is converted to its vector form. • Output: The flow of execution takes place in the following manner: The camera gets the input gesture image from the user, the detection process takes place to check if it is a hand or not using certain algorithms, image recognition is the next step where the image acquired from the user is compared with the images in the dataset, to interpret the shown gesture. Image recognition is done using a model known as CNN or Convolutional Neural Network. The next step is the output where the recognized symbol is converted to text form as the output. SYSTEM REQUIREMENTS Software Requirements • Python 3.10 • Jupyter-lab • GPU – CUNN/CUDN • Object Detection API • OpenCV • TensorFlow • Numpy • Pandas Hardware Requirements • Processors: Processor - Intel® Core™ i5/AMD Ryzen 5 • RAM: 08GB • Storage: 20GB • Standard Devices: Keyboard, monitor, webcam and mouse P a g e | 11 AMERICAN SIGN LANGUAGE DESIGN P a g e | 12 IMPLEMENTATION 6.1 Algorithm The entire model is based on the concept of Convolutional Neural Network (CNN). CNN comes under deep neural networks belonging to one of its classes. We can see CNN applied in many areas but most of these areas including visual images. The steps followed would include providing input image into convolution layer. Choosing parameters, applying filters with strides, padding if required. Performing convolution on the image and applying ReLU activation to the matrix. Perform pooling to reduce dimensionality size. Add as many convolutional layers until satisfied Flatten the output and feed into a fully connected layer (FC Layer). Output the class using an activation function (Logistic Regression with cost functions) and classifies images. CNNs are used for image classification and recognition because of its high accuracy rate thus this will be best for training the model since our project revolves around images. PSEUDO CODE Dataset transformed to grayscale P a g e | 13 Convolution Neural Network is applied for training and classification Model Summary P a g e | 14 Plotting accuracy VS val_accuracy graph Evaluation Using camera to take alphabet sign as input P a g e | 15 INPUT / OUTPUT P a g e | 16 CONCLUSION Many breakthroughs have been made in the field of artificial intelligence, machine learning and computer vision. They have immensely contributed in how we perceive things around us and improve the way in which we apply their techniques in our everyday lives. Many researches have been conducted on sign gesture recognition using different techniques like ANN, LSTM and 3D CNN. However, most of them require extra computing power. On the other hand, our research paper requires low computing power and gives a remarkable accuracy of above 95%. In our research, we proposed to normalize and rescale our images to 64 pixels in order to extract features (binary pixels) and make the system more robust. We use CNN to classify the more than 20 alphabetical American sign gestures and successfully achieve an accuracy of 97% which is better than other related work stated in this paper. PROBLEMS Sign languages are very broad and differ from country to country in terms of gestures, body language and face expressions. The grammars and structure of a sentence also varies a lot. In our study, learning and capturing the gestures was quite a challenge for us since the movement of hands had to be precise and on point . Some gestures are difficult to reproduce. And it was hard to keep our hands in exact same position when creating our dataset. FUTURE WORK We look forward to improve the model so that it recognizes more alphabetical features while at the same time get a high accuracy. Further we would like to extend this alphabetical recognition system into a fully automated conversation recognition system. We would also like to enhance the system by adding speech recognition so that blind people can benefit as well. P a g e | 17 GANTT CHART November December January Febuary March April Problem Identification Feasibility Study Literature Survey SRS and Design Coding Testing Documentation REFERENCES [1] Amen, S., & Vadera, S. (2017). A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images (University of Salford Manchester (2017)) [2] L. Tolentino, R. Juan, August C. , Maria A. Pamahoy, J. Forteza, and Xavier O. Gracia. Static Sign Language Recognition Using Deep Learning (International Journal of Machine Learning and Computing, Vol. 9, No. 6, December 2019) [3] Nikhil Kasukurthi, Brij Rokad, Shiv Bidani, Dr. Aju Dennisan. Sign Language Recognition Using Deep Learning and Computer Vision (Vellore Institute of Technology University (2019) [4] R.S. Sabeenian, S. Sai Bharathwaj and , M. Mohamed Aadhil. Sign Language Recognition Using Deep Learning and Computer Vision ( Journal of Advanced Research in Dynamical and Control Systems, May, 2020) [5] Prof. Radha S. Shirbhate, Mr. Vedant D. Shinde, Ms. Sanam A. Metkari, Ms. Pooja U. Borkar, Ms. Mayuri A. Khandge Using Sign language Recognition Using Machine Learning Algorithm (International Research Journal of Engineering and Technology (IRJET)) (2020)