Phase 3 document

advertisement
Making Kinnections
Software Requirements Specification Document
Date
11/21/2014
Version
1.0
Author
Bhargav Shivkumar,
Andrew Stohr ,
Chritopher Weeden,
Balu Swarna,
Mark S.
Reviews:
Date
11/21/2014
Reviewed By
Prof. Michael Buckley
Table of Contents
Introduction .................................................................................................................................................. 3
A note on American Sign Language .............................................................................................................. 3
Existing gesture recognition systems............................................................................................................ 4
Software Architecture Block Diagram........................................................................................................... 5
Sample User Interface Screen ..................................................................................................................... 11
Change Management procedures .............................................................................................................. 12
Cross Reference Listing ............................................................................................................................... 13
Integration Thread ...................................................................................................................................... 13
Technical Specifications: ............................................................................................................................. 16
Conclusion ................................................................................................................................................... 16
Introduction
This document aims to address the software requirements given by the users for building the
Kinect based gesture recognition system, in pseudo technical terms. The document will be
expounding on a high level explanation of the technical solution to the presented problem. We
will be touching upon the sample user interfaces and interactions between various modules in
the system. The main intention as perceived from the functional requirements is that this
system has to be seamlessly used by the hearing and speech impaired seamlessly with their
unimpaired counterparts to carry out a normal conversation. Before going into the details of
how this system can be designed we start off by exploring existing systems and why they might
not be a feasible solution. We then chalk out the various modules in our design and a high
level working and interaction with other modules and the database.
A note on American Sign Language
The hand and body motions that comprise ASL are commonly classified using five parameters.
The first of these parameters is hand shape which describes the state of one's fingers; it would
make distinctions between when one's hand is in a fist shape versus a high-five shape. The
second parameter is location; this is the location of the hands; whether they are in front of the
chest, or above the head. The next parameter is movement; signs in ASL are not just static
shapes that are made with the hands. Signs frequently involve movements of the hands or
body. The parameter of palm orientation describes whether a person's palm is facing toward
the other person or away from the other person. The final parameter is non-manual markers
which encompasses facial expressions. Signs can have completely different meanings based on
whether the signer has a happy or sad facial expression.
Our system takes into these five parameters when interpreting the sign and thus is more robust
as compared to existing recognition systems.
More on ASL parameters can be found at
http://nshsasl2.weebly.com/uploads/5/2/8/7/5287242/five_paramters_of_asl_in_pdf.pdf
Existing gesture recognition systems
At the moment the current best solution has been from Microsoft. The project team that is currently
working on the problem is called The Kinect Sign Language Translator project. Currently their solution
captures the gestures, while machine learning and pattern recognition programming help interpret the
meaning. The system is capable of capturing a given conversation from both sides: a deaf person who is
showing signs and a person who is speaking. Visual signs are converted to written and spoken
translation rendered in real-time while spoken words are turned into accurate visual signs. The system
consists of two modes, the first of which is translation mode, which translates sign language into text or
speech. This includes isolated word recognition and sentence recognition. The raising and putting down
of hands is defined as the “start” and the “end” of each sign language word to be recognized. By tracking
the position of the user’s hands, many signs can be detected instantly by using the research group’s 3D
trajectory matching algorithm. The second mode is communication mode, in which a non-hearing
impaired person can communicate with the signer through an avatar. They speak to the system, and the
avatar will sign what they spoke. This can also be done via typing it into the Kinect software. [2]
The downsides are that it currently takes five people to establish the recognition patterns for just one
word. Only 300 Chinese sign language words have been added out of a total 4000. As you can see there
is still a long road ahead towards mastering this problem.
[2] http://research.microsoft.com/en-us/collaboration/stories/kinect-sign-language-translator.aspx
Software Architecture Block Diagram
I.
System Overview Diagram
Block diagram
Depth Data
Kinect
Audio
Live video stream
Speaker
Text
Processing
Speech Data
Kinect’s Microphone
Monitor
Sign data
Avatar
Display
Avatar
Process
Database
Data flow
Figure 1: System Overview Diagram
The input to the processing block is the Kinect depth data, live video stream, and speech data.
The depth data has information that is important for recognizing what signs are, and the
meaning behind them. The data contains the ASL parameters data, which is explained later on
in the recognition module. The speech data comes directly from the Kinect’s microphone and it
is used to help translate what the non-impaired person says to signs, so that the hearing
impaired person can have a conversation with them. These signs are then sent to the display
avatar, where the avatar is eventually displayed onto the Monitor. The video stream is also sent
to the monitor. In the Monitor, it will display the interface, and within the interface will be a
live video stream of the person who is signing into the Kinect. The live video stream is
important because it allows the signing person to have live feed and recognition that what
he/she is signing is being processed by the program correctly.
II.
Processing Block
Processing
Depth data
Text
Recog Module
Text to speech
Audio
Speaker
Kinect
Text
Monitor
Avatar
Speech data
Sign data
Speech Interpreter
Display
Avatar
Kinect’s Microphone
Process
Database
Data flow
Figure 2: Processing Module Block diagram
The processing module is broken down into 3 sub processes. The recognition module, which
accepts the depth data from the Kinect, translates the data into English text. That is then
passed on to the text to speech module as well as the monitor. The text to speech module
outputs the audio equivalent of the text input. Finally, the speech interpreter module takes as
input, speech data from the Kinect microphone, and interprets it into sign data, which is then
passed to the display avatar.
III.
Recognition module
Recog Module
Depth Data
Hand shape
analysis
Location
Movement
analysis
Facial
expression
recogniser
Palm
orientation
Kinect
ASL parameter data
Sign Identifier
Recognized
gesture
Sign to text
converter
Sign
Sign training dataset
Sign keys
Text
Text to speech
Text
Monitor
Sign database
Process
Database
Data flow
Figure 3: Recognition Module Block Diagram
The recognition module takes as input the depth data and interprets it using five common
classification parameters of ASL signs. The five ASL parameters used in classifications are hand
shape analysis, location, movement analysis, palm orientation, and facial expression recognizer.
Each of these are very important in order to be able translate what the impaired person is
signing into the correct meaning. These parameters are passed to the sign identifier module,
which queries the sign database to identify potential signs. Additionally the sign identifier
module selects from the results of the sign database the most likely meaning of the ASL
parameters. The recognized gesture is then passed to the sign to text converter. In this module,
the sign database is queried for the equivalent text of the recognized gesture. The text is then
output to both the text to speech module, and the monitor. In the monitor, the interface will
display the text to which the sign was recognized to mean. In order for the sign database to
have the all the meanings it needs, we will have to add each sign/phrase, and its meaning to
this database.
IV.
Text To Speech module
Text to speech
Text audio
database
Audio
Text
Text
Recog module
Text
Audio
Text to speech
converter
Speaker
Process
Database
Data flow
Figure 4: Text to Speech module block Diagram
The text to speech module receives the text from the recognition module. The text is
then passed to the text to speech converter, which queries the database for the
equivalent audio of the text. There are many programs, and APIs out there that allow for
text to speech translate. The database will be using one of these in order to translate
the text to the correct audio. There can be many different voices that the audio can take
on, for instance, a man’s or a woman’s voice. From here, the audio is sent to the
speaker so that the non-impaired person may be able to listen to what the impaired
person signed.
V.
Speech Interpreter module
Speech Interpreter
Speech to text
database
Speech data
Speech data
Kinect’s Microphone
Sign database
Converted
text
Speech to text
converter
Sign keys
Converted text
Sign data
Text to sign
converter
Sign data
Display Avatar
Process
Database
Data flow
Figure 5: Speech interpreter module block diagram
This module receives speech data from the Kinect’s microphone. The microphone is a part of the
hardware of the Kinect, and so most modern day systems can obtain speech data from the microphone.
The data is then passed to the speech to text converter. The speech to text database is then queried by
the converter for the equivalent text of the speech data. There are many programs and Databases that
will allow the ability to convert the speech to text. Once the converted text is retrieved, it is sent to the
text to sign converter, which queries the sign database for the equivalent sign for the text. The Sign keys
are keys that are identified in the text which have an equivalent representation in the sign language. The
sign for the text is passed on in the form of sign data. This sign data is then passed onto the display
avatar.
VI.
Display Avatar Module
Display Avatar
Speech Interpreter
Sign data
Program Avatar
Avatar model
Render Avatar
Avatar
Monitor
Process
Database
Data flow
Figure 6: Display Avatar module Block Diagram
The Display Avatar takes the output from the speech interpreter. This sign data is passed to the
program avatar module, which computes the correct x and y coordinates of the hands, and
facial expressions for the avatar. With the complete set x and y coordinates and the avatar
model created appropriately with these in mind, the avatar model is then passed to the Render
Avatar module, which renders the avatar onto the monitor.
Sample User Interface Screen
Figure 7: Sample screen for the user interface.
Above image shows sample screens for the user interaction. This includes a live video feed in
the right hand side box and the avatar which displays sign on the left hand side. The Text box
below displays the converted text and the Mode indicates if the system is in “Sign” mode or
“Speak” mode. Aesthetics of the designed screen take into account the fact that the users may
not be as receptive and thus has a plain and laid out structure.
Change Management procedures
This SRS document elaborates and relates to all the requirements specified in the functional
requirement document. Once the SRS is signed off, all changes and further feature
enhancement requests, will be handled only through due procedure which includes filling out
the below Change Request form. This change request will be considered only after evaluation of
the impact of the change and post a feasibility study.
Requested By: _________________
Date Reported : ____________________
Job Title :
Contact Number : ____________________
_________________
Change Details:
Detailed Description :
Type of Change :
New Feature
Changes to existing functionality/feature
Change to Requirements
Reason For Change:
Priority Level:
Low
Medium
Modules Affected:
High
Impact :
Low
Are Affected Modules Live?
Yes
No
Assign to for action:
Expected duration for completion:
Additional Notes :
Closure:
Closure Approved by :
Closure Notes:
Medium
Closure Date :
High
Cross Reference Listing
Serial
Number
Function
System Specification
Software
Requirements
System
1
Kinect
Functional Requirement 1,
15
1.System Overview
Diagram,
2.Processing Block,
3.Recogognition
Module
2
Speech To Sign
Translator
Functional Requirement 2
5. Speech
Interpreter
3
Monitor (only 1 is
needed for the
disabled person in
Phase 3)
Functional Requirement 4a
1.Highest level,
2.Processing,
3.Recog Module
6.Display Avatar
4
Visual Display for
Disabled
Functional Requirement 5
Sample User
Interface Screen
Functional Requirement 6
1.System Overview
Diagram,
2.Processing Block,
3.Recognition
Module,
4.Text to Speech
Module
5
6
7
Changes
Between
Phases
Sign to Speech
UI simplicity and
Presentation
Error Checking for
Impaired
Functional Requirement 14
Functional Requirement 16
Sample User
Interface Screen
Sample User
Interface Screen
There will only be one monitor for the impaired person,
the unimpaired person will only use the mic from the Kinect
Integration Thread
The development of this project is planned to take place in increments with each increment
giving a portion of the final expected product. This will not only help us resolve issues early on
in the project, but will also help customers visualize the final outcome and participate actively
in the development process.
The first increment in this step is outlined below. This increment will demonstrate the entire
working of the product end to end, minus the complete feature set. This will give a flavor of
how the product can be used will serve as a base for building upon the other features mention
in this document.
In this increment, we intend to deliver scaled down versions of the following modules:
1. Recognition module
2. Speech Interpreter
These two modules combined will perform the task of converting MOST signs that are used into
text and the speech that is recorded will also be converted back to text. Implementations of the
avatar and text to speech conversions are not in the scope of this increment. A more detailed
explanation of each module is given below.
Scaled Down recognition module:
Recog Module
Depth Data
Hand shape
analysis
Kinect
ASL parameter data
Sign Identifier
Recognized
gesture
Sign to text
converter
Sign
Sign training dataset
Text
Monitor
Sign keys
Sign database
Process
Database
Data flow
Figure 8: Scaled down Recog module.
As described earlier, the recognition module makes use of the 5 ASL parameters to interpret
the input data from the Kinect. Each of the five parameters corresponds to a separate module
in itself which interpret the data differently to analyze that particular parameter. In this
version, we will be analyzing only the hand gestures and not laying emphasis to other
parameters of ASL. Since this is one of the most important aspects of sign language
representation, this feature would enable the system to recognize most hand signs. The
interpreted sign is then converted to text which is directly displayed to the monitor. This
module will thus enable partial reading of signs and will allow communication to continue
uninterrupted as long as only hand signs are being used. The other parameters of ASL will be
dealt with in subsequent iterations.
Scaled down Speech Interpreter
Speech Interpreter
Speech to text
database
Speech data
Speech data
Kinect’s Microphone
Converted
text
Speech to text
converter
Text
Monitor
Process
Database
Data flow
Figure 9: scaled down Speech Interpreter
This module is responsible for converting the speech into text only. The crux of the module involves
converting the speech into text by extensively using the Microsoft Speech recognition engine which
makes use of a speech database to index spoken audio into recognizable text.
The overall Block diagram of the 1st iteration can be as seen below:
Processing
Depth data
Recog Module
Kinect
Text
Speech data
Monitor
Text
Speech Interpreter
Kinect’s Microphone
Process
Database
Data flow
Figure 10: Overall integration block diagram.
This diagram highlights the main features in the 1st iteration which include:
1. Analysis of hand gestures.
2. Converting analyzed gesture to text.
3. Converting speech to text.
The process flow of the entire application would remain as mentioned in the specifications document
barring the implementation of all the features mentioned there. When this iteration is successful, we
can move onto the next iteration which would involve building the modules which are responsible for
capturing the nuances in sign language as well as other modules which take care of the text to speech
conversion and the building of the Avatar. Plans for subsequent modules will be released and placed for
sign off on successful completion of iteration 1 of the development.
Technical Specifications:
The main language used for coding the application is C#. The reason for choice of this language
is the compatibility of the Kinect device with the Microsoft Kinect SDK. The choice of a Dotnet
platform allows for greater flexibility with use of the Kinect device and also exploits the
powerful functionalities provided by the development toolkit which allow us to tweak the
Kinect data to a large extent.
OpenCV libraries are also to be used which will provide great input to the face recognition.
Microsoft’s Speech recognition engine, coupled with the Kinect’s inbuilt array of microphones,
is to be used for all speech to text conversion.
The Microsoft XNA framework provides great flexibility in terms of creating visually appealing
avatars and they can be seamlessly linked with the Kinect’s gesture recognition capabilities.
Conclusion
This document aims to display a technical understanding of the functional requirements put
forth by the customer. As specified by the customer, we have added in our research and
thought process about the domain into the building of this system. It provides a holistic
coverage of all customer requirements and will serve as an agreement of the system that is to
be delivered.
Download