Making Kinnections Software Requirements Specification Document Date 11/21/2014 Version 1.0 Author Bhargav Shivkumar, Andrew Stohr , Chritopher Weeden, Balu Swarna, Mark S. Reviews: Date 11/21/2014 Reviewed By Prof. Michael Buckley Table of Contents Introduction .................................................................................................................................................. 3 A note on American Sign Language .............................................................................................................. 3 Existing gesture recognition systems............................................................................................................ 4 Software Architecture Block Diagram........................................................................................................... 5 Sample User Interface Screen ..................................................................................................................... 11 Change Management procedures .............................................................................................................. 12 Cross Reference Listing ............................................................................................................................... 13 Integration Thread ...................................................................................................................................... 13 Technical Specifications: ............................................................................................................................. 16 Conclusion ................................................................................................................................................... 16 Introduction This document aims to address the software requirements given by the users for building the Kinect based gesture recognition system, in pseudo technical terms. The document will be expounding on a high level explanation of the technical solution to the presented problem. We will be touching upon the sample user interfaces and interactions between various modules in the system. The main intention as perceived from the functional requirements is that this system has to be seamlessly used by the hearing and speech impaired seamlessly with their unimpaired counterparts to carry out a normal conversation. Before going into the details of how this system can be designed we start off by exploring existing systems and why they might not be a feasible solution. We then chalk out the various modules in our design and a high level working and interaction with other modules and the database. A note on American Sign Language The hand and body motions that comprise ASL are commonly classified using five parameters. The first of these parameters is hand shape which describes the state of one's fingers; it would make distinctions between when one's hand is in a fist shape versus a high-five shape. The second parameter is location; this is the location of the hands; whether they are in front of the chest, or above the head. The next parameter is movement; signs in ASL are not just static shapes that are made with the hands. Signs frequently involve movements of the hands or body. The parameter of palm orientation describes whether a person's palm is facing toward the other person or away from the other person. The final parameter is non-manual markers which encompasses facial expressions. Signs can have completely different meanings based on whether the signer has a happy or sad facial expression. Our system takes into these five parameters when interpreting the sign and thus is more robust as compared to existing recognition systems. More on ASL parameters can be found at http://nshsasl2.weebly.com/uploads/5/2/8/7/5287242/five_paramters_of_asl_in_pdf.pdf Existing gesture recognition systems At the moment the current best solution has been from Microsoft. The project team that is currently working on the problem is called The Kinect Sign Language Translator project. Currently their solution captures the gestures, while machine learning and pattern recognition programming help interpret the meaning. The system is capable of capturing a given conversation from both sides: a deaf person who is showing signs and a person who is speaking. Visual signs are converted to written and spoken translation rendered in real-time while spoken words are turned into accurate visual signs. The system consists of two modes, the first of which is translation mode, which translates sign language into text or speech. This includes isolated word recognition and sentence recognition. The raising and putting down of hands is defined as the “start” and the “end” of each sign language word to be recognized. By tracking the position of the user’s hands, many signs can be detected instantly by using the research group’s 3D trajectory matching algorithm. The second mode is communication mode, in which a non-hearing impaired person can communicate with the signer through an avatar. They speak to the system, and the avatar will sign what they spoke. This can also be done via typing it into the Kinect software. [2] The downsides are that it currently takes five people to establish the recognition patterns for just one word. Only 300 Chinese sign language words have been added out of a total 4000. As you can see there is still a long road ahead towards mastering this problem. [2] http://research.microsoft.com/en-us/collaboration/stories/kinect-sign-language-translator.aspx Software Architecture Block Diagram I. System Overview Diagram Block diagram Depth Data Kinect Audio Live video stream Speaker Text Processing Speech Data Kinect’s Microphone Monitor Sign data Avatar Display Avatar Process Database Data flow Figure 1: System Overview Diagram The input to the processing block is the Kinect depth data, live video stream, and speech data. The depth data has information that is important for recognizing what signs are, and the meaning behind them. The data contains the ASL parameters data, which is explained later on in the recognition module. The speech data comes directly from the Kinect’s microphone and it is used to help translate what the non-impaired person says to signs, so that the hearing impaired person can have a conversation with them. These signs are then sent to the display avatar, where the avatar is eventually displayed onto the Monitor. The video stream is also sent to the monitor. In the Monitor, it will display the interface, and within the interface will be a live video stream of the person who is signing into the Kinect. The live video stream is important because it allows the signing person to have live feed and recognition that what he/she is signing is being processed by the program correctly. II. Processing Block Processing Depth data Text Recog Module Text to speech Audio Speaker Kinect Text Monitor Avatar Speech data Sign data Speech Interpreter Display Avatar Kinect’s Microphone Process Database Data flow Figure 2: Processing Module Block diagram The processing module is broken down into 3 sub processes. The recognition module, which accepts the depth data from the Kinect, translates the data into English text. That is then passed on to the text to speech module as well as the monitor. The text to speech module outputs the audio equivalent of the text input. Finally, the speech interpreter module takes as input, speech data from the Kinect microphone, and interprets it into sign data, which is then passed to the display avatar. III. Recognition module Recog Module Depth Data Hand shape analysis Location Movement analysis Facial expression recogniser Palm orientation Kinect ASL parameter data Sign Identifier Recognized gesture Sign to text converter Sign Sign training dataset Sign keys Text Text to speech Text Monitor Sign database Process Database Data flow Figure 3: Recognition Module Block Diagram The recognition module takes as input the depth data and interprets it using five common classification parameters of ASL signs. The five ASL parameters used in classifications are hand shape analysis, location, movement analysis, palm orientation, and facial expression recognizer. Each of these are very important in order to be able translate what the impaired person is signing into the correct meaning. These parameters are passed to the sign identifier module, which queries the sign database to identify potential signs. Additionally the sign identifier module selects from the results of the sign database the most likely meaning of the ASL parameters. The recognized gesture is then passed to the sign to text converter. In this module, the sign database is queried for the equivalent text of the recognized gesture. The text is then output to both the text to speech module, and the monitor. In the monitor, the interface will display the text to which the sign was recognized to mean. In order for the sign database to have the all the meanings it needs, we will have to add each sign/phrase, and its meaning to this database. IV. Text To Speech module Text to speech Text audio database Audio Text Text Recog module Text Audio Text to speech converter Speaker Process Database Data flow Figure 4: Text to Speech module block Diagram The text to speech module receives the text from the recognition module. The text is then passed to the text to speech converter, which queries the database for the equivalent audio of the text. There are many programs, and APIs out there that allow for text to speech translate. The database will be using one of these in order to translate the text to the correct audio. There can be many different voices that the audio can take on, for instance, a man’s or a woman’s voice. From here, the audio is sent to the speaker so that the non-impaired person may be able to listen to what the impaired person signed. V. Speech Interpreter module Speech Interpreter Speech to text database Speech data Speech data Kinect’s Microphone Sign database Converted text Speech to text converter Sign keys Converted text Sign data Text to sign converter Sign data Display Avatar Process Database Data flow Figure 5: Speech interpreter module block diagram This module receives speech data from the Kinect’s microphone. The microphone is a part of the hardware of the Kinect, and so most modern day systems can obtain speech data from the microphone. The data is then passed to the speech to text converter. The speech to text database is then queried by the converter for the equivalent text of the speech data. There are many programs and Databases that will allow the ability to convert the speech to text. Once the converted text is retrieved, it is sent to the text to sign converter, which queries the sign database for the equivalent sign for the text. The Sign keys are keys that are identified in the text which have an equivalent representation in the sign language. The sign for the text is passed on in the form of sign data. This sign data is then passed onto the display avatar. VI. Display Avatar Module Display Avatar Speech Interpreter Sign data Program Avatar Avatar model Render Avatar Avatar Monitor Process Database Data flow Figure 6: Display Avatar module Block Diagram The Display Avatar takes the output from the speech interpreter. This sign data is passed to the program avatar module, which computes the correct x and y coordinates of the hands, and facial expressions for the avatar. With the complete set x and y coordinates and the avatar model created appropriately with these in mind, the avatar model is then passed to the Render Avatar module, which renders the avatar onto the monitor. Sample User Interface Screen Figure 7: Sample screen for the user interface. Above image shows sample screens for the user interaction. This includes a live video feed in the right hand side box and the avatar which displays sign on the left hand side. The Text box below displays the converted text and the Mode indicates if the system is in “Sign” mode or “Speak” mode. Aesthetics of the designed screen take into account the fact that the users may not be as receptive and thus has a plain and laid out structure. Change Management procedures This SRS document elaborates and relates to all the requirements specified in the functional requirement document. Once the SRS is signed off, all changes and further feature enhancement requests, will be handled only through due procedure which includes filling out the below Change Request form. This change request will be considered only after evaluation of the impact of the change and post a feasibility study. Requested By: _________________ Date Reported : ____________________ Job Title : Contact Number : ____________________ _________________ Change Details: Detailed Description : Type of Change : New Feature Changes to existing functionality/feature Change to Requirements Reason For Change: Priority Level: Low Medium Modules Affected: High Impact : Low Are Affected Modules Live? Yes No Assign to for action: Expected duration for completion: Additional Notes : Closure: Closure Approved by : Closure Notes: Medium Closure Date : High Cross Reference Listing Serial Number Function System Specification Software Requirements System 1 Kinect Functional Requirement 1, 15 1.System Overview Diagram, 2.Processing Block, 3.Recogognition Module 2 Speech To Sign Translator Functional Requirement 2 5. Speech Interpreter 3 Monitor (only 1 is needed for the disabled person in Phase 3) Functional Requirement 4a 1.Highest level, 2.Processing, 3.Recog Module 6.Display Avatar 4 Visual Display for Disabled Functional Requirement 5 Sample User Interface Screen Functional Requirement 6 1.System Overview Diagram, 2.Processing Block, 3.Recognition Module, 4.Text to Speech Module 5 6 7 Changes Between Phases Sign to Speech UI simplicity and Presentation Error Checking for Impaired Functional Requirement 14 Functional Requirement 16 Sample User Interface Screen Sample User Interface Screen There will only be one monitor for the impaired person, the unimpaired person will only use the mic from the Kinect Integration Thread The development of this project is planned to take place in increments with each increment giving a portion of the final expected product. This will not only help us resolve issues early on in the project, but will also help customers visualize the final outcome and participate actively in the development process. The first increment in this step is outlined below. This increment will demonstrate the entire working of the product end to end, minus the complete feature set. This will give a flavor of how the product can be used will serve as a base for building upon the other features mention in this document. In this increment, we intend to deliver scaled down versions of the following modules: 1. Recognition module 2. Speech Interpreter These two modules combined will perform the task of converting MOST signs that are used into text and the speech that is recorded will also be converted back to text. Implementations of the avatar and text to speech conversions are not in the scope of this increment. A more detailed explanation of each module is given below. Scaled Down recognition module: Recog Module Depth Data Hand shape analysis Kinect ASL parameter data Sign Identifier Recognized gesture Sign to text converter Sign Sign training dataset Text Monitor Sign keys Sign database Process Database Data flow Figure 8: Scaled down Recog module. As described earlier, the recognition module makes use of the 5 ASL parameters to interpret the input data from the Kinect. Each of the five parameters corresponds to a separate module in itself which interpret the data differently to analyze that particular parameter. In this version, we will be analyzing only the hand gestures and not laying emphasis to other parameters of ASL. Since this is one of the most important aspects of sign language representation, this feature would enable the system to recognize most hand signs. The interpreted sign is then converted to text which is directly displayed to the monitor. This module will thus enable partial reading of signs and will allow communication to continue uninterrupted as long as only hand signs are being used. The other parameters of ASL will be dealt with in subsequent iterations. Scaled down Speech Interpreter Speech Interpreter Speech to text database Speech data Speech data Kinect’s Microphone Converted text Speech to text converter Text Monitor Process Database Data flow Figure 9: scaled down Speech Interpreter This module is responsible for converting the speech into text only. The crux of the module involves converting the speech into text by extensively using the Microsoft Speech recognition engine which makes use of a speech database to index spoken audio into recognizable text. The overall Block diagram of the 1st iteration can be as seen below: Processing Depth data Recog Module Kinect Text Speech data Monitor Text Speech Interpreter Kinect’s Microphone Process Database Data flow Figure 10: Overall integration block diagram. This diagram highlights the main features in the 1st iteration which include: 1. Analysis of hand gestures. 2. Converting analyzed gesture to text. 3. Converting speech to text. The process flow of the entire application would remain as mentioned in the specifications document barring the implementation of all the features mentioned there. When this iteration is successful, we can move onto the next iteration which would involve building the modules which are responsible for capturing the nuances in sign language as well as other modules which take care of the text to speech conversion and the building of the Avatar. Plans for subsequent modules will be released and placed for sign off on successful completion of iteration 1 of the development. Technical Specifications: The main language used for coding the application is C#. The reason for choice of this language is the compatibility of the Kinect device with the Microsoft Kinect SDK. The choice of a Dotnet platform allows for greater flexibility with use of the Kinect device and also exploits the powerful functionalities provided by the development toolkit which allow us to tweak the Kinect data to a large extent. OpenCV libraries are also to be used which will provide great input to the face recognition. Microsoft’s Speech recognition engine, coupled with the Kinect’s inbuilt array of microphones, is to be used for all speech to text conversion. The Microsoft XNA framework provides great flexibility in terms of creating visually appealing avatars and they can be seamlessly linked with the Kinect’s gesture recognition capabilities. Conclusion This document aims to display a technical understanding of the functional requirements put forth by the customer. As specified by the customer, we have added in our research and thought process about the domain into the building of this system. It provides a holistic coverage of all customer requirements and will serve as an agreement of the system that is to be delivered.